Re: How to parse keywords that may be used as identifiers

Scott Stanchfield <scooter@mccabe.com>
19 Aug 1996 23:17:04 -0400

          From comp.compilers

Related articles
How to parse keywords that can be used as identifiers? mark@research.techforce.nl (Mark Thiehatten) (1996-08-19)
Re: How to parse keywords that may be used as identifiers scooter@mccabe.com (Scott Stanchfield) (1996-08-19)
Re: How to parse keywords that may be used as identifiers gleeh@tulletts.sprint.com (1996-08-21)
| List of all articles for this month |

From: Scott Stanchfield <scooter@mccabe.com>
Newsgroups: comp.compilers.tools.pccts,comp.compilers
Date: 19 Aug 1996 23:17:04 -0400
Organization: McCabe & Associates
References: <32184418.167E@research.techforce.nl> 96-08-058
Keywords: parse

>[How can I parse languages where keywords aren't reserved words?]


One way to do this with PCCTS is to use "token classes." For example,
if you were doing a language like PL/I (no reserved words -- how evil!)
you might have something like


#token IF "if"
#token THEN "then"
#token ELSE "else"
#token IDENT "[a-zA-Z$_][a-zA-Z0-9$_]" // might not be quite right...


#tokclass IDENTIFIER {IDENT IF THEN ELSE}


then, in a rule like


if_statement
    : IF expression THEN statement ELSE statement
    ;


assuming that "expression" could match IDENTIFIER, you should be able to
parse an incredibly evil statement like


    if if then else = then else then = if


fairly well. (Don't sue me if it takes a while to get it to work
right...)


It's basically shorthand for


identifier
    : IDENT
    | IF
    | THEN
    | ELSE
    ;




This may look feasible in yacc as well, but you'll need to delay symbol
table lookup until you're inside the parser so you can determine the
symbol's function based on context. You can't lookup "if" in the symbol
table while scanning, you must wait until you see a rule like


primary_expression_component_or_whatever_you_call_it
    : literal
    | IDENTIFIER
            <<lookup in symbol table>>
    ;


This ends up leading to a potential bigger problem in that the language
you are parsing might be syntactically ambiguous (a statement's meaning
might only be known based on the "types" of its components. Such as the
T(x) ambiguity in C++ -- is this a var decl, or a function call.)


To resolve the syntactic ambiguity, a yacc-based parser would likely
have the lexer return different tokens based on the "type" of the ident
being scanned. (The scanner performs the symbol table lookup.) With a
language that has non-reserved words, you can't have the scanner just
look up something like "if" and tell if it's being used as a var or
keyword without the scanner keeping track of context as well.


A predicated parser generator, such as PCCTS, can resolve that ambiguity
using semantic predicates. (See my post on comp.compilers RE lookahead
and parser->scanner communication.) However, if you're lucky, the
language will not be syntactically ambiguous...


Hope this helps a bit,


Scott
--
Scott Stanchfield McCabe & Associates -- Columbia, Maryland
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.