Re: what are parsing generators?

"Orlando Llanes" <ollanes@pobox.com>
1 Nov 2000 18:33:45 -0500

          From comp.compilers

Related articles
what are parsing generators? Drum.Sefex@btinternet.com (Paul Drummond) (2000-10-31)
Re: what are parsing generators? ollanes@pobox.com (Orlando Llanes) (2000-11-01)
Re: what are parsing generators? Drum.Sefex@btinternet.com (Paul Drummond) (2000-11-05)
| List of all articles for this month |

From: "Orlando Llanes" <ollanes@pobox.com>
Newsgroups: comp.compilers
Date: 1 Nov 2000 18:33:45 -0500
Organization: Prodigy http://www.prodigy.com
References: 00-10-240
Keywords: lex, parse
Posted-Date: 01 Nov 2000 18:33:45 EST

"Paul Drummond" <Drum.Sefex@btinternet.com> wrote in message
> 1) In order to learn parsing I am reading COMPILER books. The
> question is, how much of this do I need to know to write a doctool?
> The tool will insert comments above functions, classes and variables,
> and also generate skeliton HTML documentation like KDOC and Doxygen.


        Compiling is basically done in five stages, 1)
Scanning/Tokenizing/Lexing/etc, 2) Parsing, 3) Optimizing, 4) Compiling, 5)
Linking. You definitely need to know how to scan, you'll need to know a
little about how to parse, but you don't need to know how to optimize,
compile, or link to write a doctool. Most likely you'll need to expand
macros, process type declarations, and process variable declarations at a
bare minimum. The easy part is scanning for the obvious stuff such as the
keywords "class", "int", "void", "char", etc.
        Scanning, Tokenizing, and Lexing each mean the same thing (Lex is short
for Lexographical Scanning). What you do is skip junk characters (typically
referred to as whitespace), combine groups of characters together into a
string, and then return a "token code" or "token" based on that string. A
token is typically an enum or integer constant, it's used to keep memory
usage down a bit, and to simplify parsing. A token can represent anything
such as "literals" (aka constants - aka data that can be manipulated at
compile time), reserved words, identifiers, symbols (+, ++, -, /=, <<, %, &,
||, etc), as well as others I don't remember off the top of my head.
        Parsing mainly involves syntax checking, but it also produces
intermediate code. Syntax checking involves making sure types are
compatible, processing expressions, making sure that parenthesis/braces/etc
match up, making sure that variables are declared, expanding macros, etc.
Intermediate code can be assembly language (rare I think), "byte codes", a
parse tree, etc.


        Even though you don't need to know, the following might help you...
        Optimizing can be done during parsing for things such as expression
simplification, removing redundant code out of inner loops, etc. Optimizing
can also examine the intermediary code after parsing, and simplify or
re-arrange things to produce more optimal code.
        Compiling converts the intermediate code into machine code, such as
object files, libraries, etc.
        Linking involves taking the machine code, combining it with other
machine code if necessary, and convert it into an executeable platform and
OS specific executeable.




The following could be a way to accomplish your goal:


A) Open source code file for input


B) Open temporary file for output


C) Read a line of text from input


D) Scan and parse the line of text


Ea) If the line starts with #define it's a macro
Eb) If the line starts with "typedef" it's a user type declaration then
process it and make note of its type name


Fa) If the line starts with the keyword "class" then write comments to the
temporary file
Fb) If the line starts with a user defined type or "int", "char", "void",
"struct"(?), etc and if the line contains "(" it's a function then write
comments to the temporary file
Fc) If the line starts with a user defined type or "int", "char", "void",
"struct", etc then it's a variable, write comments to the temporary file


G) Write the line to the temporary file


H) Repeat C through G until the input file reaches EOF (End Of File)


I) Close and delete the input file


J) Close and rename the temporary file to the input file's name




        Don't forget that several variables can be seperated by a comma,
typedef's may take up more than one line, and typedef's can have multiple
types declared, and typedef's can either have the name after typedef or
before the semi-colon (don't count on braces {} because they're not always
required). Basically when you find a typedef, loop until there are no more
commas.




> 2) Should I learn lex and yacc and make use of them, or should I do it
> myself? I am gonna do it myself anyway (requirement of the
> project)!!, but I'm interested to know whether it is better to use
> lex/yacc or not.


        Lex generates a scanner while Yacc generates a compiler. Yacc can also
generate just a parser, or a translator. There tools called Flex and Bison
which do the same thing as Lex and Yacc but have a different license. I
don't remember the specifics, but I think that Flex and Bison are covered
under the GNU GPL license whereas Lex and Yacc do not allow you to make a
compiler for profit. Also, not all implementations of Lex and Yacc are free.
For what you're doing though, Lex/Yacc might be overkill.


        If this doesn't help, hopefully it's a start.


See ya!
Orlando Llanes


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.