Compiler Compilers: Design and Implementation

Alexander Morou <Alex@alexandermorou.com>
Sun, 8 Feb 2009 21:03:12 -0600

          From comp.compilers

Related articles
Compiler Compilers: Design and Implementation Alex@alexandermorou.com (Alexander Morou) (2009-02-08)
Re: Compiler Compilers: Design and Implementation idbaxter@semdesigns.com (Ira Baxter) (2009-02-10)
Re: Compiler Compilers: Design and Implementation DrDiettrich1@aol.com (Hans-Peter Diettrich) (2009-02-10)
Re: Compiler Compilers: Design and Implementation alexander.morou@gmail.com (Alexander Morou) (2009-02-11)
| List of all articles for this month |

From: Alexander Morou <Alex@alexandermorou.com>
Newsgroups: comp.compilers
Date: Sun, 8 Feb 2009 21:03:12 -0600
Organization: Compilers Central
Keywords: question
Posted-Date: 10 Feb 2009 09:36:43 EST

I'm researching language design in general, as well as another area called
'Generative Programming'. Into that goal I have a few questions for other
programmers who are interested in parser/compiler design:


1. Are context-aware lexical analyzers favorable?


2. I've written my share of lexers/parsers/interpreters, and I've looked at
code from code generators. A lot of the generators emit large sets of
integers that relate to what I'm guessing is state->state transitions based
upon the character index and current state. Is this really the preferred,
or would a series of highly optimized state machines be better (for speed,
wherein they handle the state->state via a switch over the current state,
and a sub-switch over the character; yielding true if another character is
possible in the series, and false if it's invalid or at a proper edge that
terminates that line).


3. What kind of speed should I be focusing on when I've written the parser?
I ask this question because I write programs that write code, quite
frequently, and I've noticed that they tend to emit very large files.
What's a good test to throw at it to ensure that it's not only valid but
capable of handling large files (somewhere in the range of 10+ MB).


4. Beyond issues with Left Recursion, what are some issues with LL(k) parsers
that I should be aware of, and why should I ever prefer LR (or LALR)
parsers? I personally dislike LR parsers due to their ambiguity issues
(think: hanging else), when writing for an LL(k) parser generator, you're
aware that it's recursive in nature, and the most nested if statement gets
the 'else' when the curly braces are omitted. Are there any other
LL(k) ambiguities that I need be aware of?


5. I want to write an LL(k) parser generator program, and to that end I'm
hand-writing a parser for C# to get used to writing like a generator (by
writing state machines myself, and handling context awareness, this stuff
sucks by hand). I'm thinking about adding context awareness to the lexical
analyzer, because in certain cases it'll be necessary (a simple example is
C#'s generics, nesting a generic declaration like so: A<B<C>> would be annoying
to parse without context-awareness, since '>>' is right-shift, unless the
parser itself dipped into the character stream while parsing a generic type.
Unless I'm missing something very simple.)


6. I'm planning on writing the C# parser by hand, which means I'll be writing
the parser's state system, the tokens are few in actual count, but each are
individually complex (there's not a different token for each keyword, but JUST
a keyword token, 423 states in total, 97 keywords) Context awareness means
that each parser state will require not only the individual type of token to
be declared but the individual sub-structure information (ie. keywords and what
keywords), will there be speed issues attributed to this, or is passing ~20
bytes of information each lexical step fine? The parse method will be simple,
all state machines will process in tandem (that is: one character at a time
until none process further, limited execution based upon first character at the
start of analysis). Based upon the length of the token, it'll return back to
the parser the longest token. To avoid worst case scenario (re-evaluation due
to following the improper path), does more than one token need returned if
there are more than one that succeeds, or does that depend on the necessary
look ahead for the given grammar (since the identification of a token could
change based upon the next token)?


7. If I write a parser generator, should I have options for multiple result
parsers? Different parsers have different uses, some are focused more on
simple validation and light structure (ie. Visual Studio integration, varying
degrees of awareness depending on desired functionality), others require full
context awareness and the need to build a fully structured version of the
CST for transformation into an AST for compilation/interpretation, and others
might be useful for simple syntax validation and even weaker for code
colorization (such as emitting html for the code).


8. Similar to 7, but focused on the parse stages. Some languages have more
than one parse phase, and different rules on how to handle them. Should the
program be built, what would be a good way to handle the different phases (from
a language description standpoint, I'll be writing a DSL specifically for
this project)?


If I sound very naive: I apologize. I'm researching language theory on my
own, and have an interest in programs that build programs. I'm sure a lot
here have used compiler compilers, but I'm not sure how many have written them
so there's likely things I'm just oblivious to at this point. Regular
expression state machines are probably a walk in the park compared to the
phase that comes after.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.