Re: Regular expressions in lexing and parsing

Christopher F Clark <>
Tue, 18 Jun 2019 18:43:45 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Regular expressions in lexing and parsing (Hans-Peter Diettrich) (2019-05-21)
Re: Regular expressions in lexing and parsing (Ev. Drikos) (2019-05-23)
Re: Regular expressions in lexing and parsing (Christopher F Clark) (2019-06-17)
Re: Regular expressions in lexing and parsing (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing (Quinn Jackson) (2019-06-18)
Re: Regular expressions in lexing and parsing (Kaz Kylheku) (2019-06-18)
Re: Regular expressions in lexing and parsing (Christopher F Clark) (2019-06-18)
| List of all articles for this month |

From: Christopher F Clark <>
Newsgroups: comp.compilers
Date: Tue, 18 Jun 2019 18:43:45 -0400
Organization: Compilers Central
References: 19-06-005
Injection-Info:; posting-host=""; logging-data="64233"; mail-complaints-to=""
Keywords: parse, lex
Posted-Date: 18 Jun 2019 19:52:18 EDT
In-Reply-To: <>

First, Kaz Kylheku attributed some ideas to me that properly belong to
Quinn Jackson, who deserves credit for them, not me.

Second, I am agnostic on the lexer/parser split issue. There is a
point in separating concerns and in Yacc++ we kept the lexer and
parser separate at run time although you can write one file that
defines both [and that's the default way we write grammars]. However,
the scanner-less parser people have had lots of success with blurring
that distinction.

Moreover, people like Terence Parr echo the sentiment Quinn expressed
at some level. ANTLR uses the same language for traversing ASTs as it
does for parsing to create them in the first place. I don't know
whether ANTLR also merges lexing and parsing, but it wouldn't surprise
me if it did.

However, clearly to have one notation that works for multiple phases
makes sense to some people. And, while our lexer and parser in Yacc++
are separate run time objects, the notation used to create them is
unified [with a few small differences, e.g. lexer features that are
missing from the parser and vice versa], but the general shape of
productions works for both.

But as our moderator said, there are advantages to keeping them
distinct (at least conceptually). In particular, it is nice for the
lexer to drop certain tokens on the floor without ever sending them to
the parser. We called those discard tokens. Whitespace and comments
are typical examples of discard tokens. Having them as tokens means
the lexer has a precise "formal" definition of them, but immediately
dropping them means that the parser is not burdened by specifying when
they are allowed. By the way, it is *not* built in to Yacc++ which
tokens are discarded. The grammar writer chooses which things are
irrelevant to parsing. The ability to specify things which are lexed
but not parser is an important concept in getting the formalism right.
The details of how it is used to simplify the specification belong in
the hands of the grammar writer.

Further, to roughly match what Quinn was mentioning, we also have the
concept of "ignored" tokens (and non-terminals). These items are can
be used in the grammar at places where they are required, but also
will effectively be passed over by the parser when used in other
places. (For example, if you made a newline token that was ignored,
you could use it to mark places in the grammar where things must be at
the start of a line.) The lexer doesn't discard these and we modify
the parser tables so that they don't impact it in those places when it
isn't required. A slightly different formalism to deal with a
slightly different problem, but the mechanism is standardized so that
it can be made rigorous.

(I don't recall whether we have away of saying where ignored tokens
are disallowed. It certainly makes sense to add such a feature. We
probably rely on syntactic predicates currently for that issue.)

You can see the same thing occurring with ASTs. You don't put every
token into the AST, you put only the information that is semantically
relevant. And whether a token is semantically relevant can depend
upon context. Still by including only the relevant information the
consumer of an AST, be it a separate phase, or a visitor pattern, only
sees the information that is relevant and is not burdened with
extraneous information. Now, whether you code that up separately or
write it as part of one spec is a design decision, but the separation
of concerns, which is trivially enforced if the two things are
separate run time objects (passes, phases, coroutines, etc.) is
useful. Dropping the irrelevant information is as important as
keeping the relevant bits.

In fact, you can even see that point carried farther in systems like
LLVM, which has two different IRs, one machine independent and one
machine dependent. Again, that forces the separation of concerns.
There are things that should be done in a machine independent manner
(and thus with that IR) and other things that must be done in a
machine dependent manner (and thus with the other IR). There are
details that the machine dependent IR carries that the independent one
is not aware of. I suspect the inverse of that is true also.

The point being, separation of concerns is good. Don't complicate
things with irrelevant details. Formalism is good. Use notation to
make things simpler to understand without reading the implementation.
And, finally, there are many ways to achieve the same goals. However,
no matter which way one chooses, building a robust infrastructure that
supports it makes it more sound and more reliable.

On Mon, Jun 17, 2019 at 8:37 AM Christopher F Clark
<> wrote:
> Hans-Peter Diettrich <> wrote:
> > I think that regular expressions are primarily usable in *formal* token
> > grammars. In contrast to parsers I'm missing *complete* formal grammars
> > for the tokens, including whitespace, comments, strings and other
> > special language elements, and how these fit together.
> I definitely would not discount the use of formal grammars both for
> lexing (i.e. formal regexes) and parsing and the use of tools.
> When we did Yacc++, we specifically made our lexer syntax look like
> our parser syntax (and used LR yacc-style productions extended with
> regular expressions on the right hand side) to make the notation
> formal, regular, and simple. I would call that one of our successes.
> One of the advantages of having a consistent notation and using it in
> a formal manner is that often grammars can be taken almost verbatim
> from a standard and the parts that are missing easily inferred. It's
> amazing how much most lexers *should* look like each other. And, that
> was one of our goals to standardize some of the "tricks of the trade"
> (or hacks if you prefer) so that they became part of the standard
> lexicon and could be reused as languages evolved.
> Not everyone needs PL/I style keywords, but it turns out they are more
> useful than one would think for a dead language, so standardizing how
> that is done turns it into less of a hack and allows better reasoning
> about it. Same argument can be made for the different forms of
> comments, ones that nest and ones that don't. Nesting comments aren't
> hard if your lexer actually can do LR lexing and not just regular
> expressions. However, the point is to gather such things together and
> implement them in a standardized and verifiable way, so that there is
> some level of formal correctness.
> And, that would be my criticism of hand written recursive descent
> parsers. You can do amazing things with them, but you also can
> produce an unmaintainable mess where no one is quite sure what is
> legal and what is not. Reading the source to the compiler is not the
> solution to that problem.
> Anyway, I just wanted to point out, that by doing the above when I sat
> down to write the front end for the Intel Vmod project. I took the
> 1995 Verilog standard and almost copied it verbatim as the source of
> the lexer and parser. It took about 2 days to do so, but when I was
> done, I had a roughly working parser and after a couple of weeks it
> was essentially fully debugged. That's the advantage of using a
> formal style grammar and tools. You get something that actually works
> for some relatively obscure edge cases quite quickly and you can be
> confident of it. And, no it isn't just me that can do that, one our
> first customers implemented SGML out of the box, the same way and I
> cannot recall how many times I worked with customers implementing SQL.
> There are some corner cases having to do with joins that are hard to
> get right. The main point is that when you are done, you have
> something reasonably readable that you can have confidence in.
> So, if someone were to seriously argue against formal grammars (and
> formal lexing specs too), I would challenge them to prove their
> contentions.
> Yes, I can see that calling libPCRE with strings that have lots of
> backreferences, assertions, fiddling with greedy v. non-greedy
> operators etc. is not particularly maintainable. However, a simple
> lexer, with simple regular expressions following patterns that have
> been shown to work in language after language is worth striving for
> (and in many cases not that difficult).
> I will let someone else have the soapbox now....

Chris Clark email:
Compiler Resources, Inc. Web Site:
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.