Re: The case of directly-executable parsers

Karsten Nyblad <148f3wg02@sneakemail.com>
31 Jan 2006 21:20:08 -0500

From comp.compilers

Related articles
The case of directly-executable parsers momchil.velikov@gmail.com (2006-01-28)
Re: The case of directly-executable parsers derek@_NOSPAM_knosof.co.uk (Derek M. Jones) (2006-01-28)
Re: The case of directly-executable parsers momchil.velikov@gmail.com (momchil.velikov@gmail.com) (2006-01-30)
Re: The case of directly-executable parsers clint@0lsen.net (Clint Olsen) (2006-01-31)
Re: The case of directly-executable parsers rmathew@gmail.com (Ranjit Mathew) (2006-01-31)
*Re: The case of directly-executable parsers 148f3wg02@sneakemail.com (Karsten Nyblad)* (2006-01-31)**
Re: The case of directly-executable parsers momchil.velikov@gmail.com (momchil.velikov@gmail.com) (2006-02-02)

| List of all articles for this month |

From:	Karsten Nyblad <148f3wg02@sneakemail.com>
Newsgroups:	comp.compilers
Date:	31 Jan 2006 21:20:08 -0500
Organization:	Compilers Central
References:	06-01-104
Keywords:	parse, performance
Posted-Date:	31 Jan 2006 21:20:08 EST

momchil.velikov@gmail.com wrote:
> Directly-executable LR parsers are well-known and a number of
> researchers have published amazing speedups over table-driver LR
> parsers, in the 2x to 6x or more range, cf.:
>
> Achyutram Bhamidipaty and Todd A. Proebsting
> "Very Fast YACC-Compatible Parsers (For Very Little Effort)
>
> R.N.Horspool and M. Whitney
> "Even faster LR parsing"
>
> Thomas J. Pennello
> "Very fast LR parsing"
>
> I have developed such a LALR(1) generator, which generates a
> directly-executable parser in ISO C. Admittedly, the parser performs
> only a few low-level optimizations - inverted table for non-terminal
> transitions and separating the most frequent transition as a default.
> However, the speedup I observe is nowhere near the previously reported
> ones. Compared to a parser, generated by Bison 2.1 for the ISO C 99
> grammar (taken straight from the standard, with one minor
> modification), the directly-executable parser is only 5-9% faster on
> P4 and about 12% faster on P3 Xeon, at the same time being about 3
> times bigger.
>
> I assume I'm not actually doing anything stupid with the generated
> parser. A sample of the generated code (preprocessed and indent(1)ed)
> is here: http://www.geocities.com/velcok/test-17.i.txt
>
> What could be the reason for this parser performing so poorly? Or for
> the Bison parser performing so well? Has anyone got some recent
> observations with directly-executable vs. table driven recognizers -
> scanning (re2c?), parsing, tree-matching?

There are several possibilities. You have not written, which compiler
you are using, so we can only guess. Your code is full of examples of
obvious inefficiencies, e.g.,

              do
                  {
                  }
              while (0);

Such construct will most likely be removed by the optimizer, but they
may confuse it such that the code does not get as good as it should be.
    It is a dangerous assumption to assume that a compiler generates good
code for constructs human programmers would never write. It might be
that your compiler can handle only so many loops or basic blocks before
it gives up optimizing.

You are manipulating the stack through a set of inlined procedures. You
are passing the stack as a pointer to a record. First, are you sure the
procedures are inlined? Secondly, the compiler might not be smart
enough to realize that the stack as a variable is not truly aliased.
There could be a severe performance hit because the compiler may think
the stack is aliased and therefore put the top of the stack in main
memory in stead of having it in registers. I would drop that library of
stack operations and implement the stack without the use of a record
structure and as local variables of the parse routine.

Then you are jumping around much more than is necessary, e.g.,

        goto symbol_387;

        // lots of code

symbol_387:
      switch (state)
          {
          default:
              goto push_317;
          }

First you are again assuming that the compiler will generate good code
from weird constructs that human programmers will never write. You are
doing that with the switch statement with only a default case label.
Secondly, the compiler will have to be very smart in order to generate
code such that you jump directly from the first goto to push_317. Why
not substitute "goto push_317;" for "goto symbol_387;" and delete the
code following "symbol_387:".

Along the same line: From each block of code starting with a label like
reduce_* you jump to some code starting with a label like symbol_*. Why
not move the code of symbol_* to after the first piece of code jumping
to it?

Then there is your variable state. There is no need to assign to it if
the non kernel of the state does not include items of productions very
the right hand side is not the empty string. You do not need to push
the state, because it will never be read. All you need to do is to
increase the stack top pointer with 1 and let the stack top have an
undefined value. It will never be read anyway.

Also, you are passing the get_token routine in a record. The optimizer
cannot find out that it is the same routine that is called each time.
It will have to generate code to dereference the ctx variable each time
the routine is called. In most cases there is one input source, one
output, and the parser is called once. I would call the lexer directly
and only make support for multiple concurrent calls to the parser optional.

You call xg__stack_ensure in lots of places, but it is possible to
optimize most of these away. Consider a DFA were you store all the read
tokens in a stack, i.e., the highs of the stack is equal to the number
of read symbols. If there is no loops in the DFA, then it is a simple
search to find the maximum number of symbols that can be read. Thus you
can always make sure the stack is big enough before running the DFA. If
there are loops, then you can do with one check for each time the DFA
run through all the states of the loop to ensure enough free space in
the stack, e.g., assume that the numbers are states and 1->2 is a
transition from state 1 to state 2. We do not care about which symbols
are read:

1->2
2->3
3->4
4->5
5->2
5->6

state 1 is the start state and state 6 is the accept state. You can do
with checking the among of free space on the stack at the start and in
one of the states 2, 3, 4, or 5. The same trick can be used in a LR(1)
automaton. In your case it will most likely save you most of the
procedure calls to xg__stack_ensure.

A few more comments: First what about error recovery? A parser
generator that generates parsers without error recovery has only so much
use, and the error recovery can have many implications on how the
directly executed parser is implemented. Secondly, you do not have any
semantic actions on your reductions. All modern computers have cache
between main storage and the processor. That cache will speed up code
as long as the code can be in the cache. That has two implications.
The first implication is that you current benchmarking really is not
that interesting. It should wait until you have your semantic actions
in place. The second implication is that if the table driven parser can
be in the cache, but you directly executed cannot, then it is no
surprise that the directly executed parser is not that fast.

Karsten Nyblad

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: The case of directly-executable parsers

Karsten Nyblad <148f3wg02@sneakemail.com>31 Jan 2006 21:20:08 -0500

Karsten Nyblad <148f3wg02@sneakemail.com>
31 Jan 2006 21:20:08 -0500