Re: Why separate Lexical & Parser Generators (A Johnstone)
Fri, 21 Oct 1994 02:27:27 GMT

          From comp.compilers

Related articles
[5 earlier articles]
Re: Why separate Lexical & Parser Generators (1994-10-10)
Re: Why separate Lexical & Parser Generators (Walter Smith) (1994-10-10)
Re: Why separate Lexical & Parser Generators (Charles Fiterman) (1994-10-11)
Re: Why separate Lexical & Parser Generators (1994-10-11)
Re: Why separate Lexical & Parser Generators (John Lacey) (1994-10-12)
Re: Why separate Lexical & Parser Generators (1994-10-14)
Re: Why separate Lexical & Parser Generators (1994-10-21)
Re: Why separate Lexical & Parser Generators (1994-10-22)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (A Johnstone)
Keywords: lex, parse
Organization: Univ. of London, Royal Holloway College.
References: 94-10-028 94-10-058
Date: Fri, 21 Oct 1994 02:27:27 GMT

Excuse this rather long message, but this is a hobby horse of mine...

John Heron <heronj@smtplink.NGC.COM> writes:

>Pardon me if this question is naive. Why have a separate parser generator
>and lexical analyzer generator? ...

This is an excellent question, and I think some of the previous
responses have been a bit off the point. Historically, lex and yacc
have dominated the world of parser generators because they were
`there' and free (a bit like Unix in the seventies, but that's another

Now regular expressions are powerful enough to express most aspects of
lexical analysis (although try doing nested comment brackets with a
type-0 language) so it made sense to build a tool that did regular
expressions well and produced scanners automatically. It also made
sense to build your parser generator to make use of Lex's
capabilities, and so the regular expression/context free language
separation was born.

However, there have always been languages which are hard to coerce
into this tidy split. Algol-68, which allows you to define new
operators made out of punctuation marks at compile time comes to mind,
and others have mentioned Forth and so on. C, which effectively needs
two lexical analysers, one for the preprocessor and one for the main
language, introduces further headaches.

Worst of all, Lex/YACC style tools require two notations to be used:
regular expressions for the scanner and BNF (or extended BNF) for the
phrase level grammar. As good computer scientists, we should be
striving for unified notations that allow us to say more with less,
not replicating concepts unnecessarily in a sort of reverse Occam's
razor. Lex and YACC are so dominant that many tools have preserved
their interface for backwards compatibility. Terence Parr's superb
PCCTS suite suffers particularly from this although as a previous
poster noted he is promising a unified scanner/phrase level syntax for
a future release.

It is well know that BNF has greater expresive power than regular
expressions, and of course BNF can be used instead of regular
expressions. This has all been done long ago - Alex, the scanner
generator part of the original CoCo (see SIGPlan notices May 1986)
takes this approach. Alex builds scanners based on finite automata,
much like Lex and its relatives.

In my own tool RDP, I have a sort of halfway house where a
basically hardwired scanner is tuned by information gleaned from the
phrase level grammar. In RDP V2.0 (not due until early 1995) you can
use scanner productions to explicitly describe complicated lexical
constructs, and RDP will write scanner code that is just like
recursive descent parser code. One advantage of my scanner is that you
can add lexical symbols at run time (if you are deranged enough to
want to do that - I think user extensible languages are a short cut to
the funny farm personaly). Another advantage is that the scanner is
human readable unlike the finite automata produced by most other
generators, including Lex, DFA and Alex.

So, do we need scanners at all? Why not just do everything at phrase
level? Well yes we do need scanners, although preferably ones
specified by a unified syntax in the same file as the phrase level
grammar. Here are some reasons for maintaining the scanner/phrase
level split:

1. I/O: getting lines buffered up efficiently and making sure that
error messages are generated at the right point is surprisingly hard.
In addition, a good scanner should be able to flip into what I call
text mode, so that input can be transparently taken from a file or
from a text buffer (or from stdin for that matter). This allows macros
and interpreters to be implemented easily. Interestingly, many scanner
generators don't help at all when it comes to the I/O side of things.

2. Efficiency: recursive descent parsers like the ones generated by
PCCTS, PRECC, RDP and CoCo tend to do a lot of function calling.
That's OK at phrase level, but down at the character-by-character
level a monolithic lump of code is what you need using iteration rather
than recursion. Anyone who doubts the importance of efficient scanning
should profile their favorite compiler. You'd be surprised how much of
the time many compilers spend in the scanner.

3. Backtracking and lookahead: most parser generators only allow a
single token of lookahead (but see PCCTS which accepts ll(k)
grammars). Lexical analysis often requires complicated contortions.

4. Ambiguity: Nearly all lexical descriptions are ambiguous. For
indstance, >= might mean > followed by = or the single token >=. The
usual rule is to make the longest match, but that doesn't always work.

Take for instance my favourite nasty lexical construct which is from
Pascal (ironic because I am a great fan of Pascal style languages
simply because they were designed to make ll(1) parsing easy).
Consider the .. token in Pascal which is used to indicate a range.
When scanning, 1.7 is the real number one-and-seven-tenths, but 1..7
is the range 1 to 7. What makes this particularly nasty is that the
poor scanner will already have consumed the first period in the second
case assuming a real is on its way, and will have to put it back on
the input. This kind of thing is a real pain - in CoCo's Alex there
is a special construct IF FOLLOWED BY which is used to sneak a bit of

5. Cleverness on the part of the language designer: anybody looking
for a really frustrating student project might consider building a
parser for Occam, in which there are no explicit brackets for compound
statements but nesting is indicated by indentation. PRECC (an
extremely cute infinite lookahead parser generator) has a particulalry
powerful way of handling this problem, but I wouldn't recommend it for
the faint of heart.

So, my recomendation for a brighter future is: use EBNF for everything
and an optimising parser generator that can recognise things like
token sets and which will write you an efficient parser and an
efficient scanner. Now all I've got to do is finish writing it...


ps RDP V1.2 is now beta testing and can be collected from ( by anon ftp to pub/rdp. Full
release if V1.2 is expected later this week.

Dr Adrian Johnstone, Computer Science Dept, Royal Holloway, University of London
Email: Tel: +44 (0)784 443425 Fax: +44 (0)784 443420

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.