Re: Writing a recursive descent parser in C

spinoza1111@yahoo.com (Edward G. Nilges)
3 Dec 2001 20:23:47 -0500

          From comp.compilers

Related articles
Writing a recursive descent parser in C bilbo@volcanomail.com (2001-11-29)
Re: Writing a recursive descent parser in C spinoza1111@yahoo.com (2001-12-03)
Re: Writing a recursive descent parser in C joachim_d@gmx.de (Joachim Durchholz) (2001-12-07)
Re: Writing a recursive descent parser in C lingolanguage@hotmail.com (Bill Rayer) (2001-12-07)
Re: Writing a recursive descent parser in C dr_feriozi@prodigy.net (SLK Parsing) (2001-12-07)
Re: 4GL language design, was Writing a recursive descent parser in C spinoza1111@yahoo.com (2001-12-09)
Re: Writing a recursive descent parser in C alexc@world.std.com (2001-12-11)
Re: 4GL language design, was Writing a recursive descent parser in C alexc@world.std.com (2001-12-11)
[5 later articles]
| List of all articles for this month |

From: spinoza1111@yahoo.com (Edward G. Nilges)
Newsgroups: comp.compilers
Date: 3 Dec 2001 20:23:47 -0500
Organization: http://groups.google.com/
References: 01-11-146
Keywords: C, parse
Posted-Date: 03 Dec 2001 20:23:47 EST

bilbo@volcanomail.com (Antoine Lec.) wrote...


> Given a free-context grammar, I'm trying to write a recursive descent
> parser for this grammar, with a minimum of function:
>
> That is, the only things I want to do, for the beginning, is only:
> incrementing the actual index in the tested string, and testing
> whether string[index]==x, where x is a token; maybe saving and
> restoring the index too..
>
> The dragon book is a bit quick on this subject, and i've found nothing
> that explain this algorithm well on the net
>
> Thanks for any advice!


The following discussion assumes that you have written an excellent
scanner which in most languages skips white space outside of strings
and builds an array which for each token contains the value and type
of the token. Instead of the value, it can be more economical to
store the position and length of the token. If source programs are
large, the tokenized array can be stored in a temporary data base, and
the following discussion assumes that you have full random access to
all source code either in RAM or on disk.


I cannot stress too much the importance of fully debugging the scanner
and in general not occluding or confusing the jobs of the parser and
that of the scanner. Quite a lot of so-called "fourth generation
languages", pushed by managers as "replacing all those
beatnik/hippie/slacker programmers" are in actuality written by people
without compiler development training and as a result occlude scanning
and parsing, with the result that actually coding in the "fourth
generation language" is a nightmare.


Fortunately, you see less of this than in the past (where working with
some "packages" involved understanding some clown's failure to
understand the syntax of Basic or SQL, where to understand
misunderstanding is a sad waste of the human spirit), but thanks to
the MIS anti-intellectualism, which dismisses compiler design theory
as a waste of time, "packages" still emerge based on pathological
programming languages.


Of course, it may be important to avoid scanning the entire source
text and creating a large data structure of tokens. But even here,
you can separate the scanner from the parser by separating the scanner
code as a "scanner server" which maintains a position in the raw text
and provides both the next token, AND the important ability to back
up.


OK, supposing you've completed the scanner


For each nonterminal, write a procedure that returns True on success
and False on failure.


Each procedure should have the array of parsed tokens t and the index
i as reference parameters (the index should be passed as an address
and not a value.) Every procedure should ALSO have the index of the
last token for a reason (discussed below).


When the BNF production for the nonterminal is a simple series of
nonterminals in the form


          nonterminal := nt1 nt2 ... ntn


code a series of if statements


          If !(s1(t,I,end)) return false // i is
address not value of index
          If !(s2(t,I,end)) return false ...


where sn is the recognizer coded for nonterminal sn.


Although it seems to me a good idea to code separate procedures for
each nonterminal symbol I suggest that you code a single procedure to
recognize terminals which in your BNF do not have further expansions.
This procedure should be able to recognize a specific string or a
terminal class as needed, so (for example) you can call it for
keywords (specific strings) or for a class of terminals such as "get
me the next identifier" or get me the next number." Overloading can
be used to implement this polmorphism.


If you have the single terminal recognizer properly coded and when the
BNF production for the nonterminal is a simple series of nonterminals
in the form


          nonterminal := s1 s2 ... sn


code a series of if statements


          If !(s1(t,&i…)) return false
          If !(s2(t,&i…)) return false ...


As above where each si is either a nonterminal recognizer, or the
single terminal recognizer.


Indirectly, each nonterminal recognizer and the terminal recognizer
can be thought of advancing I, the scan pointer, one beyond the most
recent terminal successfully scanned. This is somewhat of a
convention; if you start the I pointer at 0, the recognizers could
equally leave I at the index of the most recent terminal successfully
scanned. I prefer the "one beyond" convention because initializing to
one seems slightly more intuitive than starting at zero.


However, the only NORMAL incrementation of the scan pointer I should
be in the single terminal recognizer, after it is completely satisfied
that the terminal either matches a specific terminal or is of the
desired class. Most of the nonterminal recognizers will only advance
I indirectly, by calling the terminal recognizer indirectly or through
an indirect stack of subrecognizers.


The exception should be unusual and this is where a nonterminal
recgnizer decides the hell with this and needs to back up to a
previously saved I value; since I is always available to all the
nonterminal recognizers and also to the single terminal recognizer, it
can at any time be saved and restored.


But it is important, when you backtrack, to remember if the code you
have executed (between saving the token index and the point where you
must backtrack) has done ANYTHING besides advance the pointer…such as
update a symbol table, or generate machine code. These tasks need to
be undone. Here is where using a data base to hold the output of the
compiler can be very handy, for modern data bases support a
"transaction" capability.


This allows you to execute a "start transaction" method on the data
base and, when you are certain that the changes are ready, to execute
a "commit transaction", or, if you decide to backtrack, to "rollback."
    Effective implementations of transactioning support it with an
unlimited stack inside the data base engine, which makes it an obvious
choice in a recursive descent parser, which is using the runtime stack
(of C or other runtime) to track its position in the grammar.


If you do not have a data base, then you need to think about what the
"output" routines of your compiler do. Your symbol table should be
able to checkpoint and rollback, as should your machine code
generator. In my usual scenario I am generating a stream of code to a
reverse Polish machine, and at any time, checkpointing is merely
noticing the size of the generating code, and rollback is
re-allocating the Polish array back to the saved size.


As an example of the basics, take productions requiring some backup
like


nonterminal = s1 s2 | s1 s3


such as the rather contrived syntax


printStatement = PRINT identifier | PRINT * identifier


In parsing these productions, you must check for the specific keyword
PRINT. A successful check will leave I at 2 (relative to the entry
value of I) which in the contrived example must be an identifier or
the specific string asterisk. When you move on to the symbol after
PRINT (always assuming that your tokenizer has gotten rid of white
space) and you do not find an identifier, you must check for an
asterisk followed by an identifier. If you do not find the asterisk
followed by the identifier you must (to satisfy the expectations of
your caller) BACK UP to the position of PRINT since whatever you have,
you do not have a valid PRINT statement.


Here is the extempore C:


          isave = i;
          If ( checkTerminal(t, &i, end, "PRINT") ) {
                if ( checkTerminal(t, &i, end, identifier) )
                          ||
                          checkTerminal(t, &i, end, "*")
                          && _
                          checkTerminal(t, &I, end, identifier ) return(true);
                i = isave1; return(false);
                }


I use a SINGLE logical expression with || and && mainly to irritate
people who don't like logical expressions. This is because people who
do not like logical expressions DO NOT BELONG in the programming
profession.


C correctly short-circuits these expessions; if the first clause finds
an identifier, the remaining two clauses are not evaluated, etc. The
clauses are evaluated from left to right in any ANSI compiler. To me
it is better to express the result rather than code a bunch of
instructions which describe how to get the result, and which can be
more readily broken up and destroyed by ham-fisted software
maintainers. A single logical expression is to me more coherent.
Scratch that: it IS more coherent.


If you always know your context and have as assumed random access to
all source code, special cases of various kinds can be handled by
backtracking.


However, there are two things to WATCH OUT for: a correct BNF that
causes the WRONG CODE to be generated, and compiling inside
parentheses.


Backus Naur Form is not a programming language although it seems like
one. This is because all BNF does is give you a series of rules that
define what's valid; they do not show you how to construct a parser or
to generate machine code.


WATCH OUT for BNFs that present a correct language but cause the WRONG
CODE to be generated, depending on the machine language. For example,
for an symmetrical operation like || (logical OR) the following
production


          orExpression = orfactor || orExpression


is correct. And, if you are generating code to a reverse Polish
machine and in a number of other scenarios, when you hit the || op
followed by a recursive orExpression, it may seem to be Just Fine to
emit the Polish machine op for || AFTER you have recursively executed
orExpression (which will also emit stuff to the reverse-Polish code.)


But take the obvious production for an asymmetric math operation like
subtraction or division, such that a-b is not equal to b-a, or a/b is
not equal to b/a:


          subExpression = subFactor - subExpression


Converting this directly into code will seem to work. For a-b it will
generate (assuming a reverse Polish machine, or anything like a parse
tree structured along reverse Polish lines) "push a, push b, apply
subtraction."


But for a-b-c such a simple minded strategy will FAIL because it will
emit "push a, push b, push c, subtract, subtract." The odious thing
about this is that for some operands this will actually work but
suppose a is 1, b is 2 and c is 3. The simple minded strategy will
generate code that returns 2 whereas the RIGHT ANSWER is –4.


[Perhaps this is why guys like Nathan Myrvhold of Microsoft hate
little languages, at least according to former Microsoft programmer
Marlin Eller (in his amusing book Barbarians Led by Bill Gates.) To
be brutal, 99% of programmers haven't seen how nasty it can get, and
produce awful "little" or "fourth generation" languages which they
then inflict on other programmers, and which are powerful tools for
generating wrong answers. There's no substitute for experience. Many
posters to comp.compilers have written ten C compilers and can avoid
these problems in their sleep but as a programmer who has spent most
of his time writing other things besides compilers, I have to stress
that you have to practice the art in order to meet and overcome these
nasty problems.]


The solution, described in the dragon book, is to change the BNF:


        subExpression = subFactor (subRHS)*
        subRHS = - subExpression


This involves iterating the check for the subRHS until you run to the
end.


Which brings me to my final point. While in infix languages the
sequencing and the nesting of the BNF shows precedence, most infix
languages have a nonterminal "term" where a term is normally something
atomic such as a number, an identifier, or…a parenthesized expression
containing anything. For example, in a+(b-c) the two main terms are
the identifier a, and (b-c).


If you have coded a recursively callable handler for "any expression"
with a token array parameter, a reference to the index, the value of
the end of the expression, etc., then inside the term handler, when
you see a left parentheses, you can call all the way back to the top,
you can call the expression parser, to handle the stuff inside the
parentheses.


The only issue is limiting the scope, and this is done by a scan-ahead
to find the BALANCING right parentheses. This algorithm is simple,
involving a level counter initialized to one, and increasing it for
left parentheses and lowering it for right parentheses until level
goes to zero. It does assume that you have some sort of random access
to all the source code, since the parenthesized expression an be
large. If source texts can be unlimited in size, consider using a
data base to store the tokenized array.


Once you find the balancing right parenthesis pass its position as the
end to a deep recursive call of the expression parser.


I find this overall approach more effective for compiling "little
languages" than yacc. It can be automated if you get bored cranking
out large collections of non-terminal recognizers, each with the same
parameters and in general the same logic (although I would mention
Nilges' law, if you can mess it up it is fun.) I have automated the
generation of recursive descent in two cases; to parse the Rexx
language in Rexx and to parse all popular dialects of SQL. But as my
technically adept and aware boss at Bell-Northern Research pointed out
to me some time ago, if you have a modern editor, the economics of
code generation change. You can use cutting and pasting more
effectively than running some mainframe batch-oriented tool which yacc
unfortunately is these days.


In general, recursive descent seems more intuitive than yacc's
bottom-up method for it sets a goal (compile me a program) and breaks
it down into subgoals. An early article on recursive descent, printed
in Saul Rosen's 1964 collection on Programming Systems and Languages
(now out of print) described a nonterminal in sexist terms as a father
with many sons (the symbols on the right hand side) who would tell a
son, go get my your goal, and then kill him if the goal was not
attained. This was all rather anti-Oedipal since of course the
Oedipal story has it the other way around, and the primal horde of
sons kill the old man.


One flaw of recursive descent is that it is in general harder to
provide effective contextual error messages and recovery. If an
expected production is not found, the tendency in a recursive descent
parser is to bubble back up to the top of recursion, returning False
at each level, and the default behavior of the rd parser is to say,
baldly, "I cannot parse your stinking code."


However, you can add recovery at any level because the scan pointer
can be advanced on failure, for example, to the next new line.


A related issue is progress reporting, which is easy in rd if you
have, as you should, one procedure for recognizing all the terminals.
This procedure has the source code and the pointer to the current
entry, as well as the goal symbol, and therefore it can produce texts
of the form "compiling at symbol s of t: p% complete: looking for x"
where x is either a specific keyword or string, or the name of a class
of tokens. The danger is a lot of confusing messages in a large
grammar and/or program, and another method is useful if the terminal
recognizer has access to a GUI; for here it can highlight the token it
is looking at. This provides the user with something to watch and the
developer an idea of the efficiency as the highlight jumps around the
screen.


A real killer-diller is to call a highlighting progress at the end of
each successful nonterminal recognition highlighting the beginning and
the end by saving the scan pointer at the beginning of the nonterminal
recognizer; for this tends to highlight bigger and bigger contexts in
a way that is meaningful to the programmer who is sensitive to syntax,
finally highlighting the entire program.


I hope this helps. Best of luck.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.