Re: Help: PCLEX, PCYACC and reserved words

Chris Clark USG <clark@quarry.zk3.dec.com>
10 Dec 1997 00:42:56 -0500

From comp.compilers

Related articles
Help: PCLEX, PCYACC and reserved words garryw@cyberus.com (Garry Whitworth) (1997-12-05)
Re: Help: PCLEX, PCYACC and reserved words fjh@murlibobo.cs.mu.OZ.AU (1997-12-07)
Re: Help: PCLEX, PCYACC and reserved words mark@research.techforce.nl (Mark Thiehatten) (1997-12-10)
*Re: Help: PCLEX, PCYACC and reserved words clark@quarry.zk3.dec.com (Chris Clark USG)* (1997-12-10)**

| List of all articles for this month |

From:	Chris Clark USG <clark@quarry.zk3.dec.com>
Newsgroups:	comp.compilers
Date:	10 Dec 1997 00:42:56 -0500
Organization:	Digital Equipment Corporation - Marlboro, MA
References:	97-12-036 97-12-047
Keywords:	parse, syntax, lex

"Garry Whitworth" <garryw@cyberus.com> asked:
> Is there an easy way for my lex file to say that PER, PERC, PERCE,
> PERCEN, and PERCENT should all return PERCENT?

Our moderator correctly suggested the use of a keyword table with
partial matching. An alternative approach (not that I am recommending
this) would be rules like the following:

PER(C(E(N(T)?)?)?)? { return PERCENT; }

In addition to being ugly, you must figure out the minimal substrings
by hand, which over the course of 400 keywords can be somewhat of a
pain. {I personally would recommend a tool, such as Yacc++ which
automatically calculates the substrings for you, but I am biased in
that regard, as I helped write it. BTW, Yacc++ pushes the keyword
table approach, as that's what it will automatically generate.}

> The language I am dealing with also allows the programmer to declare
> a variable using one of its reserved words . . . .

The keyword as identifier problem is also clumsy. However, it can be
made less so with a production like:

ident: id | DECLARE | PERCENT | CHARACTER;

where id is your lexer token for you identifiers and ident is a new
non-terminal which you use in its place.

Now, it is quite likely(*), there will be at least one place in the
grammar where this produces conflicts. In those places, you need to
flatten the grammar, by using the rhs of the rule (id | DECLARE | ...)
instead of ident. Unfortunately, flattening may not resolve all the
conflicts, as there may be places where the grammar is truly ambiguous
for one of more of the keywords, in those places you remove the
keyword from the list and document the limitation. For example, at
the very start of a statement, the word DECLARE might always need to
introduce a declaration. Presuming that the 4GL you are parsing,
already has an implementation, it will be possible to get a conflict
free grammar (eventually). It is simply a matter of disentangling
exactly what the current parser allows. Doing so will make your
language much more robust, as you will have to figure out any cases
where the current parser has implied restrictions.

(*) Fortunately, it is not guaranteed that using the ident: id ...
rule will introduce conflicts. It really depends on how carefully you
grammar is constructed. For example, we have the same concept in
Yacc++. You can have a token named "token", and "token" is also the
keyword which introduces a token declaration. As in the following
grammar:

class example;
lexer
token token; // declare the token 'token'
token : "a".."z"+; // specify the regexp for the token 'token'
parser
list : token+; // use the token 'token' in the parser

This does not introduce any conflicts into the Yacc++ grammar for
Yacc++ because we were very careful about the language. In
particular, keywords are mostly used at the beginning of a rule for a
declaration. In those uses of keywords, a keyword is never followed by
a colon and is usually followed by an another keyword, an identifer,
or a block of code (and it is never ambiguous as to whether another
keyword is expected or an identifier). In the rule which allows a
keyword as an identifier at the start of the production, the
keyword/identifier is always followed by a colon. This assures us
that the follow-set for the ident production is distinct.

Of course, there were trade-offs in doing that. For example, Yacc++
grammars have keyword declarations and those declarations have
modifiers. However, the modifiers can only occur in a specific order,
as otherwise we would have had an ambiguity as to whether the keyword
representing the modifier was a keyword or an identifier being
declared.

keyword xxx; // xxx is now declared a keyword
substr keyword yyy; // yyy is now declared a keyword with substring matching
keyword substr zzz; // substr and zzz are now declared keywords
// not zzz is declared a keyword with substring matching

We had a choice. We could have disallowed certain keywords as being
identifiers in certain contexts (i.e. the user could not have declared
substr as a keyword). Or, we could do what we chose and signficantly
restrict the contexts where keywords could be used as keywords. As
they sing, "You can't always get what you want, but if you try
sometimes, you can get what you need."

Hope this helps,
-Chris Clark
************************************************************************
Compiler Resources, Inc. email: compres@world.std.com
3 Proctor St. http://world.std.com/~compres
Hopkinton, MA 01748 phone: (508) 435-5016
USA 24hr fax: (508) 435-4847
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Help: PCLEX, PCYACC and reserved words

Chris Clark USG <clark@quarry.zk3.dec.com>10 Dec 1997 00:42:56 -0500

Chris Clark USG <clark@quarry.zk3.dec.com>
10 Dec 1997 00:42:56 -0500