Re: Parser classifications?

Alexander Morou <alexander.morou@gmail.com>
Mon, 17 Aug 2009 00:32:34 -0700 (PDT)

          From comp.compilers

Related articles
Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-11)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-13)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-13)
Re: Parser classifications? zaimoni@zaimoni.com (Kenneth 'Bessarion' Boyd) (2009-08-16)
Re: Parser classifications? alexander.morou@gmail.com (Alexander Morou) (2009-08-17)
| List of all articles for this month |

From: Alexander Morou <alexander.morou@gmail.com>
Newsgroups: comp.compilers
Date: Mon, 17 Aug 2009 00:32:34 -0700 (PDT)
Organization: Compilers Central
References: 09-08-028@comp.compilers
Keywords: parse
Posted-Date: 18 Aug 2009 14:19:02 EDT

On Aug 16, 11:30 pm, "Kenneth 'Bessarion' Boyd" <zaim...@zaimoni.com>
wrote:
> On Aug 13, 2:56 pm, Alexander Morou <alexander.mo...@gmail.com> wrote:
> > . . .
> > This is primarily due to the focus on no back tracking. It still
> > doesn't include cases where tokens might overlap. The parser that
> > drives the data aspect of this project is a very simple [badly]
> > hand-written parser. So I ignored the maximal munch principle, though
> > I think I will add a small bit of information to the parser to define
> > precedence (on equal length tokens) since grammars can span multiple
> > files.
>
> Ok; no comment there (not sure how no back-tracking and the maximal
> munch principle/optimization interact).


The reason for this is more a matter of preference. In a given
language you might have one token described as '<' and another
described as '<<', based upon the language, you can have cases where
two individual '<' encountered in a row would be interpreted
differently than a single '<<' grouped together. Some say that it
would probably be better to parse it distinctly as two separate '<'
tokens and resolve the differences in the semantic analysis phase. In
my case, since it's a such a simple system, for me it is easier to
have it parse both possibilities, and sort it out by one of the paths
failing (or it captures a complete ambiguity which can be easily
figured out via disambiguation).


Granted this introduces back-tracking, but I think it's a better
alternative to parsing the first fifty tokens three times due to a
shared sub-production in an alternation (the sub-production would be
inside the individual rules inside the alternation).


Granted a large majority of this deals with how the grammar is
defined, but I wanted to structure the system so that it was easy to
define the grammar without worrying about how it sorts it out. Since,
normally, the best solution to such an ambiguity would be to factor
out the point in common and place it before the alternation (someone
correct me if I'm wrong here).


Maximal munch would imply that you would always go with the longest
parse, which works for distinguishing between identifiers and
keywords, since if the identifier is longer, it should pick the
identifier. Since I didn't use the maximal munch principal, I've
decided that I'll use a mixture of the methods. I'll provide a
precedence system which will require equal-length tokens to have the
same precedence before they're both considered a part of the parse,
and thus create an ambiguity. Put keywords above identifiers, and
they take precedence unless the identifier is longer. I added a 'self
ambiguous' option for tokens, such as operators, which allows them to
ignore maximal munch, and give all possible parses for the given pass.
On '<<' it would return both '<' and '<<' if both are in scope (scope
is still important).


> > Improvising? Yes, true; however, everything was improvised at one
> > point. If I had a more complete educational background, perhaps I
> > would have had better luck defining a perfect theory the first time.
>
> Unlikely, when one is doing any kind of new research.


I am not educated enough on parser history to know whether this is new
research.


>>> . . .
> * lex is a tool that generates a C library (and a driver main() ) for
> incrementally lexing a text file. Again, there's a state machine in
> the background. The constraint of returning tokens in sequence,
> rather than doing all of them at once, seems to be a historical
> efficiency measure. (I'm speculating, but based on what I've read
> about the 1970's systems this seems very plausible.)
>
> However, while state machines are relatively easy to write automatic
> generators for, and even to automatically optimize -- they are very
> hard to explicitly verify. So they're not closely related to how one
> would write a parser directly, as explicit verification (both proof,
> and extensive unit-testing) becomes very important.
>
* [Lex and yacc both generate state machines to match their input. Yacc
* also has a state stack that lets it match nested and recursive
* patterns, which the lex state machine can't do. A lex scanner calls
* code you write every time it matches a token; you can return it
* immediately to a calling routine (usually a yacc parser) or buffer it
* somewhere and keep going, lex doesn't care. Compilers frequently feed
* hints back from the parser to the scanner, that modify its behavior
* for subsequent tokenizing, which wouldn't work if the scanner blasted
* through the whole input first. -John]


John brings up a good point. The primary function of the look-ahead
portion of the machine is to provide the scanner with context
information about what kinds of tokens are in scope. This enables it
to avoid scanning over the text on certain lexical patterns that
aren't pertinent in the parser's current state. It also doubles as a
way to determine when something goes wrong. (read: unexpected token)


The difference here is there's no on-match code-injection, which means
there's certain limitations and certain liberties I can take as a
result.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.