|Pre-Parsers firstname.lastname@example.org (Jim Granville) (2000-09-08)|
|Re: Pre-Parsers email@example.com (Randall Hyde) (2000-09-09)|
|Re: Pre-Parsers firstname.lastname@example.org (2000-09-13)|
|Re: Pre-Parsers email@example.com (2000-09-15)|
|Re: Pre-Parsers firstname.lastname@example.org (2000-09-21)|
|Re: Pre-Parsers email@example.com (Hans-Bernhard Broeker) (2000-10-08)|
|Re: Pre-Parsers firstname.lastname@example.org.OZ.AU (2000-10-10)|
|Re: Pre-Parsers email@example.com (2000-10-12)|
|Re: Pre-Parsers firstname.lastname@example.org (2000-10-12)|
|Date:||13 Sep 2000 20:20:52 -0400|
|Organization:||AOL Bertelsmann Online GmbH & Co. KG http://www.germany.aol.com|
Im Artikel 00-09-065, Jim Granville
> I am looking into pre-parsers, esp those that also include
>MACRO capability, with the usual define/ifdef/endif.
Ten years ago I implemented such a program in Basic, to extract declarations
from C header files. AFAIR the implementation was as follows:
First the input is tokenized. Several input streams exist, for every
#include'd source file, and every #define'd macro. A stack of input
streams is used to allow for #include and macro
expansion. Tokenization is required only for source files, the macro
definition streams already contain tokens. Every stream includes a
special symbol table, containing either the current arguments of a
macro invocation, or predefined symbols from the invocation of the
When an escape character (#) is found, the following input is
redirected to the preprocessor and interpreted immediately, until the
whole preprocessor statement is processed. In this state an
interpretation of constant expressions is required, in order to
determine the values of conditional expressions. Global semaphores
are needed e.g. to prevent the processing (delivery) of tokens inside
the FALSE branches of conditional "statements".
When an input token matches a predefined symbol (macro argument...),
it's substituted by the value of that symbol. Within a macro
definition the token is converted into a reference to the according
macro argument, and is substituted again by the current macro
parameter, when the macro is invoked later.
When an input token matches a macro symbol, then the macro arguments
must be parsed into the macro argument list, then input is switched to
the macro definition stream.
Finally the tokens, delivered by the current input stream, are written
either to the output, or to a macro definition table, after
encountering a #define token.
I don't remember whether or how concatenation (## operator in C) was
implemented, here an evaluation of the left and right side may be
necessary, before the resulting strings are merged and tokenized
The whole thing is a state machine, with several stacks for the
states, symbol tables (scopes) for macro definitions and other
symbols, and the input and output streams. Every I/O stream can be
another state machine, which behaves differently according to both the
private and global state. The states reflect the handling of the input
tokens, which can be passed, skipped or evaluated. These states may
be different for several token classes, like ordinary tokens,
preprocessor directives, symbols, constants etc.
Some features can require more processing, like the evaluation of
sizeof(x) in C. In this case all type and variable declarations must
also be stored by the parser, so that the size of every declared
symbol can be evaluated in conditional expressions. At the same time
nested scopes must be implemented, so that the parser can find the
appropriate definition of a symbol within the current nesting of
subroutine declarations etc. Such features are closesly related to a
specific compiler, and that's why you'll never find a "general"
preprocessor for the current C standard. Even the stand-alone
preprocessors, shipped with some C compilers, may be usable for some
general preprocessing, but may fail to produce correct output for a
different C compiler.
Return to the
Search the comp.compilers archives again.