One lexer rule for keywords and identifiers

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Sun, 26 Jun 2022 00:09:46 +0300

          From comp.compilers

Related articles
One lexer rule for keywords and identifiers christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-06-26)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Sun, 26 Jun 2022 00:09:46 +0300
Organization: Compilers Central
References: 22-06-075
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="49447"; mail-complaints-to="abuse@iecc.com"
Keywords: parse, design
Posted-Date: 25 Jun 2022 20:48:13 EDT

Two parts to my answer.


First, if they are using the Yacc++ my partner and I wrote, they are
following that rule. Why? Because we built it into the product.


You simply declare keywords like this:


keyword if, then, else, begin, end, for, while, do, defun, let;
substring keyword subroutine, function, procedure; // these match any
unique substrings.
sensitive keyword EoF; // only accepts that exact spelling (not eof or EOF)
keyword amp "&amp;" // keywords that are not strictly alphanumeric
inline keyword ge ".ge.", le ".le.", eq ".eq."; // don't put these in the
symbol table but instead make rules for them...
...


It also incorporates the C-typedef hack if you want that. It also deals
with the PL/I issue of context-sensitive keywords, where the word sometimes
is and sometimes is not a keyword. . All of this is built-in.


It does that by using the idea from Frank Deremer of separating the scanner
from the screener. The keyword processing is done in the screener (using
the symbol table), and because it uses the symbol table, the parser can add
(or remove) entries from being keywords--although that's not the technique
used for PL/I-style context-sensitive ones but is the right approach for
C-style typedefs.


So, I would say the Dragon book is correct. Not because it makes the
tables smaller, but because it says cognitive space. You don't want to be
reinventing the wheel for all those variations. You just want to declare
what you mean and have the tool do the "right" thing.


-----


Now, the reverse side of the coin. ANTLR's lexer may be separate from the
parser, but doesn't have the concept of a screener. You cannot get between
the lexer and the parser to do "magic". Thus, you have to use explicit
keyword tokens and if you want case insensitive ones, use regular
expressions (character classes) to specify them, same thing for substring
keywords (and it doesn't warn you if your substrings are ambiguous).


So, while I normally laud Terence for his work on ANTLR. He gets far more
right than wrong. This is a case where he got it wrong. (BTW, JavaCC
which borrows heavily from ANTLRs implementation, suffers the same way).


Now, being fair, ANTLR4 does have a unique feature in that regard in that
if you write a keyword or any token (in quotes) in your grammar, it is
processed only when that rule is being recognized. That's part of how
ANTLR4 is more PEG-like. That is a different solution to the PL/I keywords
as identifiers issue than what we support. And as far as I can tell, it
even overrides the longest-match problem. Thus, if you have a place in
your grammar where you want only ">" to be recognized and not ">>", simply
put ">" in your parser at that location.


--
******************************************************************************


Chris Clark email:
christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.