Re: problems with identifiers and keywords...

Chris F Clark <cfc@shell01.TheWorld.com>
23 Oct 2004 22:39:08 -0400

From comp.compilers

Related articles
problems with identifiers and keywords... micha-1@fantasymail.de (Micha) (2004-10-21)
*Re: problems with identifiers and keywords... cfc@shell01.TheWorld.com (Chris F Clark)* (2004-10-23)**
Re: problems with identifiers and keywords... gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-10-24)
Re: problems with identifiers and keywords... clint@0lsen.net (Clint Olsen) (2004-10-25)
Re: problems with identifiers and keywords... cfc@shell01.TheWorld.com (Chris F Clark) (2004-11-02)
Re: problems with identifiers and keywords... gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-11-06)
Re: problems with identifiers and keywords... wclodius@lanl.gov (2004-11-06)
Re: problems with identifiers and keywords... wyrmwif@tsoft.org (SM Ryan) (2004-11-07)
[22 later articles]

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	23 Oct 2004 22:39:08 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	04-10-148
Keywords:	lex, design, comment
Posted-Date:	23 Oct 2004 22:39:08 EDT

Michael asked (several questions including):
> Identifiers may be the same as keywords. To get this right, I am now
> setting a global flag in the parser-action and if the scanner finds
> the flag set it doesn't look in the symbol-table to get the
> keyword-code but returns an Identifier-ID. Is this the correct way to
> handle this?

Our esteemed moderator replied:
[I've never found a better solution to this problem than the lexical
kludges you're using. -John]

And, while I rarely have cause to disagree with our moderator, this is
one case where I do. The lexical hack of returning an identifier in
some cases and a keyword in others [this is often implemented using
lexer states (in LEX or lexer classes in Yacc++)] while time tested
*may not* be the best solution for your problem.

The lexer state solution involves coordination between the lexer and
the parser. In particular, the parser must tell the lexer whether in
"this context" a keyword is expected or an identifier is
expected. There are several problems with coordinating the lexer's
state with the parser's context. One of the most frequently noted
ones is that it makes (multiple token) lookahead more difficult. A
related problem occurs if the context is miscommunicated between the
parser and the lexer, the lexer may return a keyword when only
identifers are expected or return the keyword as an identifier when
the keyword was supposed to be treated as a keyword.

There is fortunately another solution to the same problem. You can
use the grammar itself to resolve the context. This is done by addid
one (or more non-terminals to the grammar to specify which keywords
are legal as identifiers in each context. The basic principle is
quite simple. Let's assume you have ID as your identifier token and
LET, IF, THEN, and ELSE as keyword tokens.

1) You introduce a new identifier non-terminal, ident, with the
      following rule:

ident: ID | LET | IF | THEN | ELSE ;

2) You replace all uses of ID in your grammar, with ident, for example:

assign : LET ID "=" expr ";" ;

is changed to

assign : LET ident "=" expr ";" ;

3) recompile your grammar and see if there are any "new" conflicts

Resolving the conflicts detected at step 3 is generally the tricky
part. Most of the conflicts introduced will be shift/reduce
conflicts. These shift/reduce conflict generally result from places
where trailing context is required to disambiguate whether the keyword
is really a keyword or an identifier at that location.

For example, if you also had a rule like this one:

cond_assign: LET IF "=" expr THEN ident "=" expr ";" ;

You would get a shift/reduce conflict between IF and ident in the
state whare the parser had just seen LET IF and had the lookahead of
"=".

Note some people erroneously call such conflicts ambiguities, because
based on the first 3 tokens, the assign and cond_assign rules are
"ambiguous". However, I prefer to distinguish conflicts based on only
a portion of the text from "real" ambiguities where the 2 problem
rules really describe the same string after all the tokens have been
looked at.

You will get reduce/reduce conflicts rahter than shift/reduce
conflicts when the problematic keyword occurs at the end of a rule.
For example, with an END keyword, the conflict is likely to be a
reduce/reduce conflict rather than a shift/reduce.

There are several potential soultions you will need to explore to
resolve the conflicts.

1) If all your conflicts are shift/reduce conflicts and the "shift"
      action (i.e. recognizing the keyword as a keyword rather than an
      identifier) is the correct choice in each case, then you can let
      the "default" yacc rules resolve the problem (if you don't mind
      warnings).

2) If the above is true, but you want to get rid of the warnings, you
      can use the associativity and precdence declarations to effect the
      same behaviour with no warnings.

3) If you need more control, you need to intoduce additional rules and
      non-terminals. Let's keep looking at the same LET IF example. We
      can truly resolve the shift/reduce conflict by removing the IF
      keyword from the ident list in that particular context.

ident_no_if: ID | LET | THEN | ELSE ;

assign : LET ident_no_if "=" expr ";" ;

Now, with most grammars the number of conflicts is not too large and
you can get away with only have a few forms of the ident rule, say
ident, indent_no_if, ident_no_then_or_else, ident_no_let_else_or_end,
...

The only problem grammars tend to be highly "expression" oriented
rahter than "statement" oriented. In such grammars, you often don't
have much context to disambiguate your keywords with and the conflicts
can come from 2nd, 3rd or more removed interactions. That is, it
isn't that all ident's in expressions conflict with the THEN keyword,
but only ident's used at the end of expressions inside an THEN clause,
and other uses of expressions have othter conflicts. When this
happens, your grammar debugging becomes a nightmare.

Note, that sometimes the lexical state solution is the correct
resolution of those kinds of problems. The lexical state solution
works best if the problem can be resolved by preceding context. (For
example, after seeing an IF keyword, make THEN not an identifier until
a THEN keyword has been lexed.

However, if you have any kind of control, e.g. this is a language you
are designing for you own use, if you have *any* problems with
anything beyond the simplest "ident : ID | keyword" rule, consider not
allowing keywords as identifiers. For as a rule, if the grammar is
hard to write, then the language will be hard to use.

As a case in point, in Yacc++ we use the above ident: keyword rule,
and only in the simplest form. Recently, we went to add a feature
where it seemed like a keyword might be useful in an expression
context, e.g.

expr: RANGE ( expr ) ; // the expression represents a "range" of
// characters

And we liked the resolution of the RANGE ( as meaning the keyword, so
we used the appropriate precedence directive to suppress the warning.
No sooner had we done that and shipped the prototype to the customer
for whom we were working, did the customer report that their "RANGE"
identifiers stop working. Sure enough, the keyword in that context
was a mistake, it was already legal to say RANGE ( in Yacc++ grammars
and the shift-reduce conflict was trying to tip us off to that case.
Fortunately, the mistake was only made in a prototype version and we
can fix the grammar back to how it was before having to deal with an
obscure wart in our notation.

BTW, one of the reasons I prefer the "ident : keyword" rule approach
to the problem rather than the lexer state solution is that you do get
the conflict messages. And, had I not been the over-confident idiot
who simply said, "oh yeah, I know that there will be a shift-reduce
conflict here that I need precedence to resolve" and instead actually
looked at the underlying conflict, then the problem would have gotten
resolved much sooner. With a lexer state flag, I would never have
kknown that I had introduced a problem until some user had asked "why
doesn't this grammar parse like I expect it to?" Of course, in this
case, that was found soon, but it would have been a problem if we
hadn't caught it and we put it in our documentation and so on and so
on.

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
[You're right. I think I've been unlucky and mostly had languages where
the ambiguities from adding keywords back made the grammar explode. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: problems with identifiers and keywords...

Chris F Clark <cfc@shell01.TheWorld.com>23 Oct 2004 22:39:08 -0400

Chris F Clark <cfc@shell01.TheWorld.com>
23 Oct 2004 22:39:08 -0400