Re: problems with identifiers and keywords...

Chris F Clark <cfc@shell01.TheWorld.com>
2 Nov 2004 12:12:05 -0500

          From comp.compilers

Related articles
problems with identifiers and keywords... micha-1@fantasymail.de (Micha) (2004-10-21)
Re: problems with identifiers and keywords... cfc@shell01.TheWorld.com (Chris F Clark) (2004-10-23)
Re: problems with identifiers and keywords... gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-10-24)
Re: problems with identifiers and keywords... clint@0lsen.net (Clint Olsen) (2004-10-25)
Re: problems with identifiers and keywords... cfc@shell01.TheWorld.com (Chris F Clark) (2004-11-02)
Re: problems with identifiers and keywords... gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-11-06)
Re: problems with identifiers and keywords... wclodius@lanl.gov (2004-11-06)
Re: problems with identifiers and keywords... wyrmwif@tsoft.org (SM Ryan) (2004-11-07)
Re: problems with identifiers and keywords... vbdis@aol.com (2004-11-07)
Re: problems with identifiers and keywords... cfc@shell01.TheWorld.com (Chris F Clark) (2004-11-14)
Re: problems with identifiers and keywords... genew@mail.ocis.net (Gene Wirchenko) (2004-11-14)
[17 later articles]
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 2 Nov 2004 12:12:05 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 04-10-148 04-10-170 04-10-174
Keywords: parse, syntax
Posted-Date: 02 Nov 2004 12:12:05 EST

Glen Herrmannsfeldt wrote:


> Two languages which have no reserved words and a reasonable number
> of compilers are Fortran and PL/I. Traditionally Fortran has made
> it even harder by not making blank space significant, yet the
> process for writing Fortran parsers is well understood.


Yes, I think that is what the moderator was alluding to when he wrote:
> [You're right. I think I've been unlucky and mostly had languages where
> the ambiguities from adding keywords back made the grammar explode. -John]


PL/I's keywords are probably not parseable by any reasonable grammatic
method, which is why I started with *may not* be. Sadly, it is not
just "old" languages like Fortran and PL/I that have the problem.


While we were consulting at Honeywell on an ANSI-C compiler, Jim
Roskind and I had the opportunity to discuss the "typedef" problem.
His results were interesting. He found a grammatic solution to
typedef usage in ANSI C. As I recall, there are about 4 places you
have to split the grammar (on identifier/typedef) and clone sub-trees
to make it work. However, there is a grammatic solution. He pursued
the same approach on C++ and found that he could unravel the context
to get an approximate solution, but it appeared that a true solution
would require infinite lookahead.


When you look at that work, it becomes obvious why I think that using
hacks to get things to parse are generally a bad idea. Essentially
there are these problematic phrases in C++ where you cannot determine
from context whether an identifier is a new variable declaration or a
reuse of a typedef without looking indefinitely far ahead and those
aren't the only ambiguities. And the sad part about it, is that I
generally like C++ (of course, I also liked PL/I, so maybe I have bad
taste--or perhaps it's just that I used subsets of both languages and
ignore those parts that don't fit my mental model).


The point is that for language design, if you want to allow keywrods
as identifiers (and there are good reasons for that), then make
certain that your keywords can only appear (as keywords) in
unambiguous contexts. If you aren't designing the language, and have
to live with someone else's design, then do what you need to to parse
whatever mess the original designer left you.


How to make the context for keywords unambiguous? That's not that
hard.


1) Start all special phrases with a keyword.


2) Make certain that following the keyword the next token will always
      be:
      2a) an identifier or
      2b) some other token that isn't an expression operator.


3) make sure the following are not otherwise legal
      3a) identifier identifer (if 2a is used)
      3b) identifer special token (if 2b is used)


Pascal's declaration syntax is a good exmaple.
        var y: ^x; (* var y can have no other meaning;
                                    : is not used in expressions *)


C's declaration syntax is loaded with potential ambiguity.
        x*y; // * for ptr and for multiply; is x a typedef or not?


You can loosen those prescriptions a bit. Take the grammar for yacc
as an example.


All keywords are prefixed by a % token. So, if one sees a keyword and
the preceding token is not a %, then the usage is as a plain
identifier.


%left right // left is a keyword here, right is an id


Yacc uses "id :" as a special syntax to start rules. The : token is
only legal as the 2nd token in a rule. That makes it unambiguous
(although requiring two tokens of lookahead) as to the rule
boundaries, and allows yacc to omit trailing semi-colons.


However, personally, I found that too ambiguity prone and in Yacc++ we
required the closing semi-colons on rules (and declarations). It
isn't that difficult to come up with the appropriate lexical hack to
catch the "id :" pair, but once one allows comments and stuff to
intervene the hack becomes cumbersome both to implement and
potentially for users to parse. The problem is not so much
recognizing the start of the production, but recognizing the end of a
production, especially if there are syntax errors involved. Consider
the following fragment:


a: b c d
b // : d e
c : f


The comment "swallows" the : and changes the b line from being a rule
to being a continuation of the previous line. Without optional
semi-colons, the comment causes a syntax error.


And, this is my point about language design--the format with the
optional semi-colons (yacc-style) may appear simpler, but in the
presence of errors, it is actually harder. And, I think language
designers should always consider those impacts. Ask not only, does
your grammar make allow nice syntax for the cases you intend, but in
the presence of syntax errors, is it always easy to distinguish a
slight mispelling of one construct from a mispelling of a different
construct. (If you can, think in terms of error-correcting distances.
You don't want single "bit" errors, causing misparses into other valid
but unintended syntaxes.)


Please excude my wandering off-topic. It's just that one thing lead
to another....


-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.