Parsing C, the last word

jar@florida.HQ.Ileaf.COM (Jim Roskind x5570)
Tue, 14 Jan 92 16:23:39 EST

          From comp.compilers

Related articles
Lookahead vs. Scanner Feedback hjelm+@cs.cmu.edu (1992-01-03)
Parsing C, the last word jar@florida.HQ.Ileaf.COM (1992-01-14)
| List of all articles for this month |

Newsgroups: comp.compilers
From: jar@florida.HQ.Ileaf.COM (Jim Roskind x5570)
Keywords: C, parse, FTP
Organization: Compilers Central
References: 92-01-012
Date: Tue, 14 Jan 92 16:23:39 EST

> From: hjelm+@cs.cmu.edu (Mark Hjelm)
> Organization: School of Computer Science, Carnegie Mellon
>
> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
> taken pretty much verbatim from the standard. The scanner uses the symbol
> table to decide whether to return "identifier" or "typedef name" as the
> token type for an identifier. How do I KNOW that there are no situations
> which, due to parser lookahead, would cause the scanner to return an
> incorrect token type for an identifier (i.e. return "identifier", even
> though the identifier was just/will be made into a "typedef name")? Is
> there a general answer to this question for other parsing strategies
> (possibly with other amounts of lookahead) and other grammars (languages)?
>


This is an excellent and interesting question, and pulled out a very
nice line of responses. My answers are:


1) I have no "general method" of proving correctness.
2) In this case, correctness can be proved.
3) If your grammar came from the standard, the proof will
identify an error in your grammar.
4) A correct grammar exists, and I distribute it for free.




For folks that can't figure out why there is so much discussion about this
problem, consider the code fragment:


T(*b)[4];


Notice how the meaning is very different depending on whether T is a
typedef-name, or an identifier (and hopefully a function in this case).
For this reason most C lexers distinguish between typedef names and other
identifiers. Alas, as the poster points out, there is concern that the
feedback loop (parser annotates symbol table, which lexer interrogates,
and then feeds tokens to the parser) will fail in some complex scenario.


Other threads of response included the use of a GLR parser, which tries
all alternatives. Alas, as will be shown, the ambiguity in C is too
great, and such an approach will not provide a unique syntactically
correct parse when it does exist.


Other respondees mentioned that almost no yacc based parsers get this area
of the standard right. Since the grammar included in the standard is
misleading, the cause for this problem will become abundantly clear. I
will start with a description of how a proof of correctness could proceed,
take a slight aside to show why the traditional grammar tends to cause
this proof to fail, and explain how the grammar can be written (provably)
correctly. Since I am typing this on the fly, I will surely make some
errors, but I hope that most of the readers will find the content
interesting. Note that my grammar distribution includes a very large
paper that deals with many C/C++ ambiguities in much greater detail, and
with many more examples.


The fact that yacc uses LR(1) scanning is pivotal to the argument in this
special case. In this special case you can note that the typedef-ness of
an identifier can *only* change after a declarator is complete.
Specifically, the standard states that an identifier enters scope only
after its declarator is complete. The completion of a declarator is
always marked by one of three tokens:


";" end of declaration statement
"," end of individual declarator (another is comming)
"=" end of declarator, initializer is comming.


To prove correctness, you must show that reductions involved with
annotation of your symbol table take place while one of the three tokens I
just mentioned is waiting in the lookahead buffer. The argument notes
that since none of the three tokens I identified relies on the
typedef-ness of an identifier, everything will work out (i.e., the
identifier vs typedef-name separation by the lexer can always be done in a
correct and timely manner). Basically, all you have to do is look at the
position of the action code that is placed in the grammar to annotate the
symbol table, and be sure that this happens before one of the above three
terminating tokens (context counts) has being processed.


As an example, suppose you had a rule that looked like:


declaration: storage-class INT identifier ';' { /* annotate sym table */ }


Please don't bother to argue that you would never have such a specific
rule in the grammar, rather note that the action is being done *after*
reading the ';'. This is "bad." If you had such a rule, then there is a
chance that yacc would have read the next token (it is allowed to if it
needs to), and there is a chance that the next token *might* be an
identifier. Hence, with the above rule, there is a chance that the lexer
would process an upcoming identifier *prior* to the annotation of the
symbol table, and hence incorrect results could arise. For example:


typedef int T ;
T a;


might be incorrectly parsed (the second "T" might be read *and* tokenized
by the lexer before the symbol table is annotated). A guaranteed correct
rule could be written:


declaration: storage-class INT identifier { /* annotate sym table*/ } ';'


Note that here the action takes place with the parser at *most* looking
ahead at the ';', and hence the lexer would not have had to process future
identifier/typedef-name distinctions.


Getting back to your example, since you copied the grammar from the ANSI
Standard, your grammar is probably wrong. To expose the problem, try
parsing:


typedef int A, B(A); /* my bet is you get a syntax error*/


When correctly parsed, the above is equivalent to:


typedef int A;
typedef int B(A);


or simply:


typedef int A;
typedef int B(int); /* B's type is function taking an int
returning and int */




Again, to verify my interpretation, look at the verbose discussion of
declarations in the ANSI standard and note that an identifier enters scope
when its declarator is complete. Despite this assertion in prose, most
grammars are based on the K&R or ANSI grammars, which gathers an
"init-declarator-list" before combining the declarators with the
"declaration-specifiers." In the above example, gathering the entire list
"A, B(A)" before annotating the symbol table for "A" would allow the lexer
to incorrectly tokenize the second "A".


You might wonder how GNU GCC (or others related compilers) manage to get
the above parse right, when their grammar looks a lot like this standard
(and I assert, faulty) grammar. In the case of GCC, this area of the
action code includes a hack wherein the "decl-specifiers" are fetched from
lower points in the stack using $-1 (using negative numbers allows action
code to reach down into questionable areas of the yacc stack, whereas
positive numbers correspond to the tokens in the productions with the
action). Basically, by reaching backwards when they see a ',' (or '='),
they are able to annotate the symbol table even though they use the
standard grammar. Once again, it is provable that they get the correct
parse (if they can prove that $-1 always contains the desired
decl-specifier value. This latter assertion is far from trivial to prove,
as they must correctly process function declarations, which occasionally
have no decl-specifiers! They get it right, but a second hack is required
atop the first hack, and these two hack unite to make C++ parsing a horror
for them).


For those compiler hackers out there, it is worth noting that the problems
are greater than what was just given in the above example. Not only can
"typdef-ness" be established by a declarator, it can be taken away. For
example, consider the following:


#include <assert.h>
typedef char F;
main() {
long a=sizeof(F), F, b=sizeof(F);
assert(1 == a);
assert(sizeof(long) == b);
}


With proper rewriting, a grammar can be made to correctly process such
declaration without "cheating" and looking at $-1 (or where ever else this
info might be socked away). The trick is to combine the declarator with
the declaration specifier (and annotate the symbol table), then combine
with the optional initializer (rather than forming an init-declarator),
and then if there is a ',', form a token that once again acts just like
the decl-specs to process the next declarator. If you want to see the
correct grammar, take a look at my distribution (which includes a C and a
C++ grammar, along with a cheap FLEX input file).


Since one of the respondees included suggesting that a GLR parser could
just try all interpretations, and that only one would be valid, the
following is a rather interesting example:


typedef int T;
main() {
int T=100, a=(T)+1;
assert(101 == T);
}


Notice that the text "(T)+1" could incorrectly be interpreted, if "T" were
still a typedef-name, as a cast to type T of the expression "+1". Note
that the GLR approach would yield two distinct and valid parses, and hence
there would be no single winner (without additional heuristics).


Another person focused on the issues involving redeclaration of
typedefnames at inner scopes. Indeed, this was something of a bear to
tackle (without kludging). Examples of the code that usually causes
problems include:


typedef int T1 ;
typedef int T2 ;
typedef int T3;
main() {
const T1 T1; /* redeclares T1 to be an int */
const T2 (T2); /* redeclares T2 to be an int */
const T3; /* syntax error : missing declarator */
}


Bruce Blodgett (formerly of Honeywell Corporation) and I independently
implemented clean resolutions of this within a YACC grammar, but the
results can be seen within my distribution. Interestingly, this careful
analysis of C identified some unaddressed scenarios in the ANSI standard.
Since these results go beyond the scope of this posting, interested
readers are encouraged to look at my paper which is included with my
distribution.


Doug Lea and Doug Schmidt have graciously offered to provide anonymous ftp
sites for the 8 files, as well as the Berkeley YACC source (if you need
it).


ics.uci.edu (128.195.1.1) in the ftp/gnu directory (even though neither of
the archives are GNU related) as:


                c++grammar2.0.tar.Z
                byacc1.8.tar.Z


mach1.npac.syr.edu (128.230.7.14) in the ftp/pub/C++ directory as:


                c++grammar2.0.tar.Z
                byacc1.8.tar.z




Jim Roskind
Independent Consultant
(407)729-4348 or (617)290-4990 x5570
jar@hq.ileaf.com or ...!uunet!leafusa!jar


p.s.: I'm always looking for interesting compiler and/or grammar work.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.