Re: Is that difficult to write a Shift-Reduce parser

Stephen Horne <sh006d3592@blueyonder.co.uk>
Fri, 07 May 2010 11:50:15 +0100

          From comp.compilers

Related articles
Is that difficult to write a Shift-Reduce parser kuangpma@gmail.com (kuangpma) (2010-04-29)
Re: Is that difficult to write a Shift-Reduce parser kym@sdf.lonestar.org (russell kym horsell) (2010-05-01)
Is that difficult to write a Shift-Reduce parser chakaram@auth.gr (Chariton Karamitas) (2010-05-02)
Re: Is that difficult to write a Shift-Reduce parser rpw3@rpw3.org (2010-05-01)
Re: Is that difficult to write a Shift-Reduce parser cfc@shell01.TheWorld.com (Chris F Clark) (2010-05-02)
Re: Is that difficult to write a Shift-Reduce parser dot@dotat.at (Tony Finch) (2010-05-04)
Re: Is that difficult to write a Shift-Reduce parser sh006d3592@blueyonder.co.uk (Stephen Horne) (2010-05-07)
Re: Is that difficult to write a Shift-Reduce parser rpw3@rpw3.org (2010-05-09)
| List of all articles for this month |

From: Stephen Horne <sh006d3592@blueyonder.co.uk>
Newsgroups: comp.compilers
Date: Fri, 07 May 2010 11:50:15 +0100
Organization: virginmedia.com
References: 10-04-072
Keywords: parse, LR(1)
Posted-Date: 09 May 2010 12:22:07 EDT

On Thu, 29 Apr 2010 19:48:37 -0700 (PDT), kuangpma
<kuangpma@gmail.com> wrote:


>I am learning compiler theory, several books teach how to write (even
>with source code) a Recursive Decent parser but not Shift Reduce
>parser, only tell the theroy behind it.


I'm a non-expert, but I have written two LR(1) parser generators, sort
of. Both were written as parts of essentially the same learning
exercise.


The first attempt was written in Python, never worked at all, and I
was too much of a mess to be worth trying to debug. As it was written
in odd hours of spare time over probably a month or so, there was also
the memory issue - I was spending more time working out what I had
written already than adding to it.


The second attempt was written in C++, and "probably works" as far as
it goes. That is, once I finished the interesting code and got it to
generate state models that looked about right, I stopped developing
it. Some of the underlying digraph code got refactored out and
massively expanded, such that it now supports a wide range of finite
automata composition stuff developed for other projects, but neither
"generator" ever generated code - only descriptions of state models.


The main problem with the Python version is obvious - incremental
development with no real plan other than the theory, and in
particular, without appropriate underlying types for working results.
Mistakes due to bad memory. Even though it wasn't very big, it was a
"big ball of mud" (or at least a small ball of mud).


In the C++ version, I had a the benefit of that earlier experience,
and had some idea what underlying types and containers I would need.


The main underlying types represented, at various levels, an LR(1)
state description - IIRC something along the lines of...


    - A state description is a set of "state items".
    - A state item is a set of "grammar states" and a lookahead token.


That's not complete and probably not accurate, but should give the
basic idea.


The "grammar states" identified nodes in a collection of digraphs,
because each nonterminal represented its set of rules as a digraph.
This was partly because I had plans to support eBNF in grammar rules,
but mostly because (even for standard LR(1), restricting each
nonterminal to having a tree-structured digraph) it helped simplify
the description of LR states.


Sets obviously occur a fair bit. This gets more manageable if you can
use a bitvector set-of-unsigned-integers for every set - which is easy
so long as you can map from other types to an ID, allocating a new ID
every time a new value occurs. Defining a two-way mapping to handle
this using two underlying map containers is easy enough.


I'm pretty sure I needed some mappings to support searches for partial
keys (ie using custom comparisons). Not possible with std::map AFAIK,
but I used my own associative containers. These days I'd probably use
the intrusive container library from boost.


Useful bitvector operations include finding the lowest member and the
lowest member greater than some specified value, allowing easy
iterating through sets of IDs.


Once you have the underlying types and containers, the state model
derivation is (in principle) a simple "closure" algorithm.


This is "closure" in the abstract algebra sense. A set is closed over
an operation if applying that operation from one value in that set
always takes you to another value in that set. Closing a set means
adding in the minimum number of missing values so that the set becomes
closed. In this case our values are LR(1) state descriptions
(starting, obviously, with the start state) and the basic parsing
actions - in particular shifting both terminal and non-terminal
tokens.


For the closure algorithm, you have a queue of states to evaluate. At
each step, you remove one state from the queue and get its
description. You determine which parse actions are possible from that
state, and for each, derive the description of the resulting next
state. You map that description to a (potentially new) ID, and add the
state (if new) and transition to the result. New states also get added
to the queue.


When the queue is exhausted, all states derived have been fully
evaluated, and all transitions lead to existing states - the state
model is closed and (hopefully) correct.


LALR(1) parsing is identical except for tweaks to the description and
derivation of the state descriptions. LL(1) is probably also very
similar, though I've never attempted it. Certainly the key top-level
algorithm is the closure of the set-of-states, with the low-level
detail being the derivation of new state descriptions from old state
descriptions and the valid operations.


So basically, no, I don't think that this is very hard - but it can
get very fiddly and prone to turning into a "big ball of mud" unless
you have a design to work to and the right underlying containers.


BTW - it's quite impressive how often closure algorithms turn up in
automata algorithms. One annoying thing about them is that the
commonality is difficult to effectively factor out in C++ - it can be
done, but the result is usually a horrible mess, so that you're better
off just duplicating the common code. The spaghetti caused by gotos
back in the day was nothing compared to the
little-fragments-scattered-all-over-the-place that happens in so much
"best practice" OOP.


This is probably a good reason to use a functional language such as
Objective CAML instead.



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.