Re: The speed of an Earley parser?

Benjamin S.Scarlet <news0@greynode.net>
6 Aug 2001 04:03:16 -0400

          From comp.compilers

Related articles
The speed of an Earley parser? newspub@wuggy.co.uk (2001-06-17)
Re: The speed of an Earley parser? sting@linguist.Dartmouth.EDU (Michael J. Fromberger) (2001-06-21)
Re: The speed of an Earley parser? joachim_d@gmx.de (Joachim Durchholz) (2001-06-21)
Re: The speed of an Earley parser? newspub@wuggy.co.uk (2001-06-28)
Re: The speed of an Earley parser? idbaxter@semdesigns.com (Ira D. Baxter) (2001-07-02)
Re: The speed of an Earley parser? news0@greynode.net (Benjamin S.Scarlet) (2001-08-06)
Re: The speed of an Earley parser? Mark.van.den.Brand@cwi.nl (M.G.J. van den Brand) (2001-08-08)
Re: The speed of an Earley parser? holzmueller@ics-ag.de (2001-08-15)
| List of all articles for this month |

From: Benjamin S.Scarlet <news0@greynode.net>
Newsgroups: comp.compilers
Date: 6 Aug 2001 04:03:16 -0400
Organization: Posted via Supernews, http://www.supernews.com
References: 01-06-041
Keywords: parse, practice, comment
Posted-Date: 06 Aug 2001 04:03:16 EDT

<posted & mailed>


I've actually implemented something like what you described here. I
did it just using LALR(1) with some additional backtracking to
minimize the impact of shift/reduce conflicts. I'm very interested to
hear about your project -- it sounds like my project is quite a bit
like yours.


I blurred the distinction between terminals and non-terminals. In my
grammar for "foo" I would have productions like


sublang --> "lang" langlookup langblock
langlookup --> string_literal
langblock -> "(" placeholder ")"


with no further productions for placeholder in the "foo" grammar. Then
I'd use langblock as a start symbol for the "bar" grammar, with all
the interesting part of "bar" decending from the placeholder
symbol. The "foo" grammar then knows just enough about "bar" (that it
will begin with a "(") for it to correctly calculate the lookaheads
for the langlookup production. When that production is reduced, it
uses the value of the string_literal token (in your case "bar") to dig
up a different parser from somewhere. It then runs that parser, which
parses, leaves a "langblock" symbol on the input stream, and
returns. The the "foo" parser picks back up and happily shifts the
"langblock" symbol, reduces to "sublang" and marches merrily onward.


It all works quite well in practice. Have you made progress with your
approach?


                Benjamin S. Scarlet
                news0@greynode.net


Ian Woods wrote:


> I'm working on a semi-learning, semi-serious project which requires
> some CF parsing. There are three major considerations on the selection
> of the parsing technique to be used:
>
> o It mustn't substantially restrict the layout of the grammar. It's
> necessary for the grammar to be easily read and understood by humans,
> and forcing a lot of restrictions makes their life harder.
>
> o The parsing operation must be completed in a reasonable period -
> we're not competing for speed, just the user shouldn't have died and
> decomposed before the parsing is completed. Well, preferably a bit
> better than that :)
>
> o This one takes a little explanation. Imagine a language construct a
> little like this:
>
> // language foo
>
> lang "bar"
> (
>
> (* this is language bar! *)
>
> )
>
> // more lang foo
>
> You can define a block but instead of it being some fixed language,
> the language is specified as part of the 'host' languages construct.
>
> (note, the open and close paren after 'lang "bar"' are part of bar, not
> foo... this gets rid of possible ambiguity between the grammars of bar and
> foo.
>
> This kind of thing, I imagine, either requires the construction of a
> meta grammar which contains all of the grammars for all of the
> languages with sufficient syntax for the above (which is probably a
> very bad idea since it will be huge)... or I need to be able to
> 'switch' from one grammar to another and splice their parse trees
> together (which seems a much better idea to me). This switching
> operation requires that the grammar is effectively a data structure in
> memory rather than encoded into a program (like most of the common
> tools seem to produce - programs which can parse one specific
> grammar).
>
> The question is this:
>
> What parsing techniques have the properties I require?
>
> To me, Earley's parser seems to me a good bet: it has 2 of the
> requirements. It's only the speed which I'm concerned about. Although I'm
> likely to be alive by the time it finishes parsing a few thousand tokens
> against a realistic programming language grammar, is the time taken going
> to be to long (in the order of hours)?
>
> Should I simply restrict the grammars to those able to be parsed by
> LALR(k) and just follow the crowd? Or is there some other technique
> which I haven't ran into yet which may be better than either?
>
> Ian Woods.
> [Go ahead and use an Earley parser. For practical languages, it'll almost
> certainly be fast enough. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.