# Algorithms in Muskok parser generator

## Boris Burshteyn <bburshte@us.oracle.com>Wed, 16 Mar 1994 20:31:53 GMT

From comp.compilers

Related articles
Algorithms in Muskok parser generator bburshte@us.oracle.com (Boris Burshteyn) (1994-03-16)
| List of all articles for this month |

 Newsgroups: comp.compilers From: Boris Burshteyn Keywords: tools, parse Organization: Compilers Central Date: Wed, 16 Mar 1994 20:31:53 GMT

MUSKOX ALGORITHM
Boris Burshteyn, bburshte@us.oracle.com

in MUSKOX parser generator, and about its implementation in C++. This
short note explains the algorithm and presents some C++ classes used in
the implementation (the overall MUSKOX functionality is explained in
([1])).

1. Big Picture

MUSKOX implements a version of David Spector's full LR1 algorithm. In the
original article ([2]), D. Spector outlined his algorithm with no
implementation details. So, I really cannot comment on the implementation
differences. On the conceptual level Spector's and MUSKOX algorithms are
similar, but differ in the details and in the order of steps. It seems
that Spector's algorithm first calculates lookaheads and builds trees that
encode the paths coming to the final states with more than one reduction,
then it splits the paths that lead to the states with reduce/reduce
conflicts. On the other hand, the MUSKOX algorithm does not build trees
that encode paths, neither it calculates lookaheads to find out potential
reduce/reduce conflicts before the split. Instead, it splits paths that
lead to the final states with several reductions right away, then
calculates lookaheads, and finally merges some of the states back during
the optimization phase.

2. General LR Considerations

Let G be a context free grammar, and A0 is its LR0 directed graph (you can
find how to build LR0 graphs in the literature, for example in [3]). For
our purposes we note that A0 has a unique start state S, that every
transition in A0 is marked by a symbol from grammar G, and that some
states, called final states, are marked with one or more grammar rules so
that the following `path property' holds:

If F is a final state and r = `a : alpha' is a rule that marks F, `alpha' is
a possibly empty sequence of tokens, then all the paths that lead to F
have lengths no less than length(alpha), and all of them spell
`xalpha', where x is some (possibly empty) prefix. Moreover, let R
be a state with a path of length(alpha) that starts in R and leads to F.
Then R has an outgoing transition marked by `a'. The above holds
for empty alpha as well: that is if r = `a:', then F has an outgoing
transition marked by `a'.

The `path property' allows for LR-kind parsing of strings generated by
grammar G. That is, a push-down machine that uses the graph A0 can accept
strings generated by G and reject strings that cannot be generated by G.
Given a string s, the machine pushes the unique start state S on the
stack, and then scans symbols from s one at a time. After scanning each
next symbol t, the machine repeats one of the following steps:

1. If the top-most stack state C is not final, then the machine does the
`shift' move: it follows the transition from C that is marked by the
scanned symbol t to some state D, and D is pushed on the stack. Note, that
after a series of shifts the machine advances within the graph A0
following a path that spells a substring from s.

2. If C is a final state, the decision has to be made: either to make a
shift described in the previous step, or to make a `reduction'. The
reduction is always associated with a particular rule that marks the final
state C. Let r = `a:alpha' be the rule for which a reduction is to be
made. Then the machine reduces the stack by length(alpha). Let D be the
top-most stack state after the reduction. According to the path property,
D has a transition marked by `a' that comes to a state E. Then E is pushed
on the stack.

It is clear that step 1 is fully deterministic since A0 is a deterministic
graph -- it has no two transitions from the same state marked by the same
symbols. The source of nondeterminism lies in step 2. Actually there are
three possibilities:

1. The top-most final state C has only one associated rule and one or more
transitions. Then the machine has to choose between a shift and a single
reduction. This is a state with a potential `shift/reduce' conflict.

2. The top-most final state C has several associated rules and no
transitions. Then the machine has to choose between a number of
reductions. This is a state with a potential `reduce/reduce' conflict.

3. The top-most final state C has several associated rules and one or more
transitions. Then the machine has to choose between a shift and a number
of reductions. This is a state with potential `shift/reduce' and
`reduce/reduce' conflicts.

So, one can pre-calculate so called `lookaheads' before the parsing in
order to choose the right step when a potential shift/reduce or a reduce/
reduce conflict arises. These lookaheads are sets of symbols that can
potentially appear in the source string s after a particular shift or
reduction. So, when scanning a string s, the machine can look ahead one
symbol and decide to make a reduction if the lookahead symbol belongs to
the lookahead set of the rule. Otherwise, if the lookahead symbol does not
belong to the lookahead set, the machine can choose the shift.

If all lookahead sets for all rules of a final state F have no common
elements, and they also have no common elements with the set of symbols
that mark the transitions from F, then the machine can determine
unambiguously what step (shift or reduction) to make by scanning one
symbol ahead. However, it may happen that a lookahead set for a particular
reduction from a final state F contains a symbol that marks a transition
from F (shift/reduce conflict). Or it may happen that lookahead sets for a
number of rules from a final state F have common symbols (reduce/reduce
conflict). In both cases lookahead sets do not help to make a
deterministic decision.

One can classify a number of LR parser generation techniques by how the
graph and its lookahead sets are calculated during the parser generation
phase. For example, there is LALR1, and a canonical LR1 ([3]). MUSKOX uses
a different concept which has been originated in ([2]).

In MUSKOX, the lookaheads for a final state F are calculated by traversing
A0 backwards starting from F. If r='a:alpha' is the rule that marks F,
then consider all the pairs of states (Ri,Gi) such that there is a path
from Ri to F that spells `alpha', and `a' marks a transition from Ri to
Gi. According to the `path property', there is at least one such pair.
Let T(G) are all the symbols that mark transitions from a state G. Then
the lookahead L(F) for F is calculated recursively by the formulas:

L(G) = T(G), if G is not final

L(F) = T(F) + [Summ(L(r,F)) for all reductions r that mark F], if F is Final

L(r,F) = Summ(L(Gi)) for all i, if F is final

It is clear from the formulas, that the fewer Gi has a rule r from state
F, the thinner may be its lookahead set L(r,F), and, therefore, it is less
likely that L(r,F) for different r's will have common elements that cause
a reduce/ reduce conflict. Indeed, one can transform the A0 graph into
another graph A0' by splitting the common parts of the paths that come
from different Ri to F in order to decrease the number of states Gi for a
particular rule r, and by that to eliminate reduce/reduce conflicts. Below
is the idea of the split.

4. Path Splitting

If there are two different states R1 and R2 with paths p1 and p2 that
start in R1 and R2 respectively, both p1 and p2 lead to a final state F,
and both spell `alpha' where r='a:alpha' for a rule r that marks the final
state F, then we say that there is a fork on the paths leading to F. Let H
be a first common state on both of the paths p1 and p2 (the `fork'). Let
B1 be a state on p1 previous to H, and B2 is a state on p2 previous to H.
Let q is the common part of p1 and p2 that starts from H and goes to F. In
order to split the paths p1 and p2, we duplicate all the states from q in
another path q' that leads to a duplicated state F', then cut the link
from B2 to H and connect B2 with H' (the duplicate of H on q').

While duplicating q into q', we preserve all the outgoing transitions from
the states on q for the corresponding states on q', except of the
transitions that constitute q. It is clear from the formulas above and
from the fact that we cut the transition from B2 to H, that L(r,F) might
have no L(G2) symbols in its lookahead after the split (however, this is
not guaranteed). If indeed L(G2) symbols go away, then L(r,F) after the
split may be a proper subset of L(r,F) before the split. Therefore, the
chances that there will be no reduce/ reduce conflicts in F after the
split increase.

After splitting all existing forks we eventually reach the resultant
automaton A1. Since A1 still has the `path property', it can be used in
the push-down algorithm. At the same time, the lookahead sets for its
states are somewhat `thinner' than the lookahead sets for A0 and,
therefore, potentially have fewer reduce/reduce conflicts.

After the split we calculate lookaheads for those final states that have
outgoing transitions or that are marked by more than one reduction. Then
we compare lookaheads to find out what states still have reduce/reduce and
shift/reduce conflicts. If we find such states, then we conclude that the
original grammar is not LR1. Since we split all the forks from A0
unconditionally even for the states with no reduce/reduce conflict, some
of the new states are merged back during optimization phase.

Note, that path splitting can eliminate only reduce/reduce conflicts, but
cannot eliminate shift/reduce conflicts. Really, every duplicated state S'
has transitions to the same states to which the original state S has.
Therefore, for any final state F with a shift/reduce conflict there is
always a state F' among its images that will have the same conflict.

5. Performance

I hope that the above sketch together with the original Spector's article
give enough insight to the technique. His article also contains a number
of graphical examples that show exactly how lookahead sets become thinner
after the split.

The original Spector's algorithm first calculates lookahead sets for the
final states with two or more reductions. For that, it maintains a tree of
paths that lead to the final states. Then, only those paths are split for
which the calculated lookaheads resolve the potential conflict.

In Muskox, paths are split unconditionally, and no tree of paths is
maintained. This is done in order to simplify the implementation: instead
of introducing a new tree-like structure to account for lookaheads,
lookaheads are stored with the states of the automaton with split paths.
After the split occurs and lookaheads are calculated, the optimization
phase merges equivalent states that appeared (may be unnecessarily) during
the split. Indeed, the resultant automata for a number of popular
grammars (like C, F77, etc.) have no more states than the automata
generated by YACC or BISON.

Unless there is a pathological conflict in a grammar, MUSKOX will not
consume a lot of memory while splitting the paths and calculating
lookaheads. In fact, it takes ~400Kbytes for each of the C, F77, and
MUSKOX grammars (note - this is not the size of the generated tables, but
the amount of consumed memory during the parser generation; the resulting
tables are usually smaller than the ones generated by YACC or BISON).
Also, MUSKOX usually spends the same or less time than YACC or BISON while
working on the same grammars.

The only interesting pathological case I know about is a COBOL grammar
from PCYACC (TM of ABRAXAS SOFTWARE). There, MUSKOX algorithm reached the
limit of 90Mbytes and then stopped since there were no virtual memory
left. However, MUSKOX can calculate lookaheads without splitting the
paths, and it occurred that the lookaheads are different for the original
LR0 automaton, so the grammar had no reduce/ reduce conflicts.

6. Acknowledgments

My sincere thanks to Gray Clossman and John Weisz who have carefully
read this note and suggested improvements in the presentation.

REFERENCES

[1] Boris Burshteyn. "Muskox - a Tool for Object Oriented Design and
Programming" ftp from <iecc.com:pub/file/muskox.ps.gz>, or email <send
muskok.uu> to <compilers-server@iecc.com>.

[2] David Spector. "Efficient Full LR(1) Parser Generation" SIGPLAN
NOTICES, V. 23 N. 12 1988 (143-150).

[3] A.V. Aho, R. Sethi, J. D. Ullman. "Compilers, Principles, Techniques,

Appendix

Here we present some annotated class definitions used in MUSKOX algorithm.
These classes implement states of the parsing automaton and operations
involved in the construction of the automaton, splitting paths,
calculating lookaheads, etc. The main inheritance hierarchy implements
states with no reductions (simpleState), final states with no transitions
(finalState), and final states with transitions (fullState). Class
protoState contains members common to all the states. Class 'edges'
implements transitions between the states.

Please note that these are not all the classes used in MUSKOX code. The
information presented just gives a flavor on C++ features used in the
implementation. For example, one can see how data and function members are
reused, and how some of the member-functions are redefined in the derived
classes.

//
// state transitions
//
class edges : virtual public memAlloc
{
public:
int Counter; // # of outgoing edges
int Free; // # of free slots in Edges
protoState **Edges; // pointer to the array of outgoing edges
. . .
};

//
// common members to all states
//
class protoState : public proto
{
public:
token *In; // name of the incoming edges
edges InEdges; // incoming edges
superState *SuperState; // this state is a superstate
protoState *Trace; // trace path during NFA traversing
virtual void SplitPaths(protoState **){}
void SplitPathBackwards(recordPath *, protoState **);
protoState *FindPaths(int, int *, protoState *, int);
virtual int EnumerateLookAheads() { return 0; }
virtual void ResolveConflicts() {}
void MergeWith(protoState *);
. . .
};

//
//
class tokens : public virtual memAlloc
{
public:
int Counter; // # of lookaheads
token **Tokens; // array of tokens
. . .
};

//
// record a state after reduction, then goto
//
class reduceAndGoto : public memAlloc
{
public:
edges *Reduce; // states after reduce
edges *Goto; // states to goto after reduce
. . .
};

//
// classes for conflicts
//
enum conflictType
{
SHIFTREDUCE_,
REDUCEREDUCE_
};

class conflictTokens : public memAlloc
{
public:
token *Token;
conflictType ConfType;
conflictTokens *Next;
. . .
};

//
// reduction for a particular rule
//
class reduction : public memAlloc
{
public:
int Reduce; // # of stack elements to reduce
nonTerminal *Goto; // goto after reduce
reduceAndGoto **Paths; // states to reduce in the case of conflicts
// (array of InEdges.Counter elements)
assoc *Assoc; // associativity and precedence of the rule
conflictTokens *Conflict; // shif/reduce or reduce/reduce lookaheads
action *Action; // action list to execute before reduction
. . .
};

//
// all reductions for a final or full state
//
class reductions : public virtual memAlloc
{
public:
int Counter; // # of reductions
reduction *Reductions; // pointer to the array of reductions
void CreatePaths(protoState *, int);
void SplitPaths_(protoState *, protoState **);
void ResolveConflicts_(squezeColumn *);
void ResolveConflicts_();
. . .
};

//
// state with no rules
//
class simpleState : public protoState, public edges
{
public:
simpleState(token *t, int e) : protoState(t), edges(e){};
simpleState() : protoState(), edges(){};
virtual tokenType Type() { return SIMPLESTATE_; }
virtual void CutOutEdges();
. . .
};

//
// final state with rules and transitions
//
class fullState : public simpleState, public reductions
{
public:
virtual tokenType Type() { return FULLSTATE_; }
virtual void SplitPaths(protoState **);
virtual void ResolveConflicts();
. . .
};

//
// final state with no transitions
//
class finalState : public protoState, public reductions
{
public:
virtual tokenType Type() { return FINALSTATE_; }
virtual void SplitPaths(protoState **);