Wed, 16 Mar 1994 20:31:53 GMT

Related articles |
---|

Algorithms in Muskok parser generator bburshte@us.oracle.com (Boris Burshteyn) (1994-03-16) |

Newsgroups: | comp.compilers |

From: | Boris Burshteyn <bburshte@us.oracle.com> |

Keywords: | tools, parse |

Organization: | Compilers Central |

Date: | Wed, 16 Mar 1994 20:31:53 GMT |

MUSKOX ALGORITHM

Boris Burshteyn, bburshte@us.oracle.com

Several netters have asked me about the parser generation algorithm used

in MUSKOX parser generator, and about its implementation in C++. This

short note explains the algorithm and presents some C++ classes used in

the implementation (the overall MUSKOX functionality is explained in

([1])).

1. Big Picture

MUSKOX implements a version of David Spector's full LR1 algorithm. In the

original article ([2]), D. Spector outlined his algorithm with no

implementation details. So, I really cannot comment on the implementation

differences. On the conceptual level Spector's and MUSKOX algorithms are

similar, but differ in the details and in the order of steps. It seems

that Spector's algorithm first calculates lookaheads and builds trees that

encode the paths coming to the final states with more than one reduction,

then it splits the paths that lead to the states with reduce/reduce

conflicts. On the other hand, the MUSKOX algorithm does not build trees

that encode paths, neither it calculates lookaheads to find out potential

reduce/reduce conflicts before the split. Instead, it splits paths that

lead to the final states with several reductions right away, then

calculates lookaheads, and finally merges some of the states back during

the optimization phase.

2. General LR Considerations

Let G be a context free grammar, and A0 is its LR0 directed graph (you can

find how to build LR0 graphs in the literature, for example in [3]). For

our purposes we note that A0 has a unique start state S, that every

transition in A0 is marked by a symbol from grammar G, and that some

states, called final states, are marked with one or more grammar rules so

that the following `path property' holds:

If F is a final state and r = `a : alpha' is a rule that marks F, `alpha' is

a possibly empty sequence of tokens, then all the paths that lead to F

have lengths no less than length(alpha), and all of them spell

`xalpha', where x is some (possibly empty) prefix. Moreover, let R

be a state with a path of length(alpha) that starts in R and leads to F.

Then R has an outgoing transition marked by `a'. The above holds

for empty alpha as well: that is if r = `a:', then F has an outgoing

transition marked by `a'.

The `path property' allows for LR-kind parsing of strings generated by

grammar G. That is, a push-down machine that uses the graph A0 can accept

strings generated by G and reject strings that cannot be generated by G.

Given a string s, the machine pushes the unique start state S on the

stack, and then scans symbols from s one at a time. After scanning each

next symbol t, the machine repeats one of the following steps:

1. If the top-most stack state C is not final, then the machine does the

`shift' move: it follows the transition from C that is marked by the

scanned symbol t to some state D, and D is pushed on the stack. Note, that

after a series of shifts the machine advances within the graph A0

following a path that spells a substring from s.

2. If C is a final state, the decision has to be made: either to make a

shift described in the previous step, or to make a `reduction'. The

reduction is always associated with a particular rule that marks the final

state C. Let r = `a:alpha' be the rule for which a reduction is to be

made. Then the machine reduces the stack by length(alpha). Let D be the

top-most stack state after the reduction. According to the path property,

D has a transition marked by `a' that comes to a state E. Then E is pushed

on the stack.

It is clear that step 1 is fully deterministic since A0 is a deterministic

graph -- it has no two transitions from the same state marked by the same

symbols. The source of nondeterminism lies in step 2. Actually there are

three possibilities:

1. The top-most final state C has only one associated rule and one or more

transitions. Then the machine has to choose between a shift and a single

reduction. This is a state with a potential `shift/reduce' conflict.

2. The top-most final state C has several associated rules and no

transitions. Then the machine has to choose between a number of

reductions. This is a state with a potential `reduce/reduce' conflict.

3. The top-most final state C has several associated rules and one or more

transitions. Then the machine has to choose between a shift and a number

of reductions. This is a state with potential `shift/reduce' and

`reduce/reduce' conflicts.

3. Lookaheads

So, one can pre-calculate so called `lookaheads' before the parsing in

order to choose the right step when a potential shift/reduce or a reduce/

reduce conflict arises. These lookaheads are sets of symbols that can

potentially appear in the source string s after a particular shift or

reduction. So, when scanning a string s, the machine can look ahead one

symbol and decide to make a reduction if the lookahead symbol belongs to

the lookahead set of the rule. Otherwise, if the lookahead symbol does not

belong to the lookahead set, the machine can choose the shift.

If all lookahead sets for all rules of a final state F have no common

elements, and they also have no common elements with the set of symbols

that mark the transitions from F, then the machine can determine

unambiguously what step (shift or reduction) to make by scanning one

symbol ahead. However, it may happen that a lookahead set for a particular

reduction from a final state F contains a symbol that marks a transition

from F (shift/reduce conflict). Or it may happen that lookahead sets for a

number of rules from a final state F have common symbols (reduce/reduce

conflict). In both cases lookahead sets do not help to make a

deterministic decision.

One can classify a number of LR parser generation techniques by how the

graph and its lookahead sets are calculated during the parser generation

phase. For example, there is LALR1, and a canonical LR1 ([3]). MUSKOX uses

a different concept which has been originated in ([2]).

In MUSKOX, the lookaheads for a final state F are calculated by traversing

A0 backwards starting from F. If r='a:alpha' is the rule that marks F,

then consider all the pairs of states (Ri,Gi) such that there is a path

from Ri to F that spells `alpha', and `a' marks a transition from Ri to

Gi. According to the `path property', there is at least one such pair.

Let T(G) are all the symbols that mark transitions from a state G. Then

the lookahead L(F) for F is calculated recursively by the formulas:

L(G) = T(G), if G is not final

L(F) = T(F) + [Summ(L(r,F)) for all reductions r that mark F], if F is Final

L(r,F) = Summ(L(Gi)) for all i, if F is final

It is clear from the formulas, that the fewer Gi has a rule r from state

F, the thinner may be its lookahead set L(r,F), and, therefore, it is less

likely that L(r,F) for different r's will have common elements that cause

a reduce/ reduce conflict. Indeed, one can transform the A0 graph into

another graph A0' by splitting the common parts of the paths that come

from different Ri to F in order to decrease the number of states Gi for a

particular rule r, and by that to eliminate reduce/reduce conflicts. Below

is the idea of the split.

4. Path Splitting

If there are two different states R1 and R2 with paths p1 and p2 that

start in R1 and R2 respectively, both p1 and p2 lead to a final state F,

and both spell `alpha' where r='a:alpha' for a rule r that marks the final

state F, then we say that there is a fork on the paths leading to F. Let H

be a first common state on both of the paths p1 and p2 (the `fork'). Let

B1 be a state on p1 previous to H, and B2 is a state on p2 previous to H.

Let q is the common part of p1 and p2 that starts from H and goes to F. In

order to split the paths p1 and p2, we duplicate all the states from q in

another path q' that leads to a duplicated state F', then cut the link

from B2 to H and connect B2 with H' (the duplicate of H on q').

While duplicating q into q', we preserve all the outgoing transitions from

the states on q for the corresponding states on q', except of the

transitions that constitute q. It is clear from the formulas above and

from the fact that we cut the transition from B2 to H, that L(r,F) might

have no L(G2) symbols in its lookahead after the split (however, this is

not guaranteed). If indeed L(G2) symbols go away, then L(r,F) after the

split may be a proper subset of L(r,F) before the split. Therefore, the

chances that there will be no reduce/ reduce conflicts in F after the

split increase.

After splitting all existing forks we eventually reach the resultant

automaton A1. Since A1 still has the `path property', it can be used in

the push-down algorithm. At the same time, the lookahead sets for its

states are somewhat `thinner' than the lookahead sets for A0 and,

therefore, potentially have fewer reduce/reduce conflicts.

After the split we calculate lookaheads for those final states that have

outgoing transitions or that are marked by more than one reduction. Then

we compare lookaheads to find out what states still have reduce/reduce and

shift/reduce conflicts. If we find such states, then we conclude that the

original grammar is not LR1. Since we split all the forks from A0

unconditionally even for the states with no reduce/reduce conflict, some

of the new states are merged back during optimization phase.

Note, that path splitting can eliminate only reduce/reduce conflicts, but

cannot eliminate shift/reduce conflicts. Really, every duplicated state S'

has transitions to the same states to which the original state S has.

Therefore, for any final state F with a shift/reduce conflict there is

always a state F' among its images that will have the same conflict.

5. Performance

I hope that the above sketch together with the original Spector's article

give enough insight to the technique. His article also contains a number

of graphical examples that show exactly how lookahead sets become thinner

after the split.

The original Spector's algorithm first calculates lookahead sets for the

final states with two or more reductions. For that, it maintains a tree of

paths that lead to the final states. Then, only those paths are split for

which the calculated lookaheads resolve the potential conflict.

In Muskox, paths are split unconditionally, and no tree of paths is

maintained. This is done in order to simplify the implementation: instead

of introducing a new tree-like structure to account for lookaheads,

lookaheads are stored with the states of the automaton with split paths.

After the split occurs and lookaheads are calculated, the optimization

phase merges equivalent states that appeared (may be unnecessarily) during

the split. Indeed, the resultant automata for a number of popular

grammars (like C, F77, etc.) have no more states than the automata

generated by YACC or BISON.

Unless there is a pathological conflict in a grammar, MUSKOX will not

consume a lot of memory while splitting the paths and calculating

lookaheads. In fact, it takes ~400Kbytes for each of the C, F77, and

MUSKOX grammars (note - this is not the size of the generated tables, but

the amount of consumed memory during the parser generation; the resulting

tables are usually smaller than the ones generated by YACC or BISON).

Also, MUSKOX usually spends the same or less time than YACC or BISON while

working on the same grammars.

The only interesting pathological case I know about is a COBOL grammar

from PCYACC (TM of ABRAXAS SOFTWARE). There, MUSKOX algorithm reached the

limit of 90Mbytes and then stopped since there were no virtual memory

left. However, MUSKOX can calculate lookaheads without splitting the

paths, and it occurred that the lookaheads are different for the original

LR0 automaton, so the grammar had no reduce/ reduce conflicts.

6. Acknowledgments

My sincere thanks to Gray Clossman and John Weisz who have carefully

read this note and suggested improvements in the presentation.

REFERENCES

[1] Boris Burshteyn. "Muskox - a Tool for Object Oriented Design and

Programming" ftp from <iecc.com:pub/file/muskox.ps.gz>, or email <send

muskok.uu> to <compilers-server@iecc.com>.

[2] David Spector. "Efficient Full LR(1) Parser Generation" SIGPLAN

NOTICES, V. 23 N. 12 1988 (143-150).

[3] A.V. Aho, R. Sethi, J. D. Ullman. "Compilers, Principles, Techniques,

and Tools" Addison Wesley 1986.

Appendix

Here we present some annotated class definitions used in MUSKOX algorithm.

These classes implement states of the parsing automaton and operations

involved in the construction of the automaton, splitting paths,

calculating lookaheads, etc. The main inheritance hierarchy implements

states with no reductions (simpleState), final states with no transitions

(finalState), and final states with transitions (fullState). Class

protoState contains members common to all the states. Class 'edges'

implements transitions between the states.

Please note that these are not all the classes used in MUSKOX code. The

information presented just gives a flavor on C++ features used in the

implementation. For example, one can see how data and function members are

reused, and how some of the member-functions are redefined in the derived

classes.

//

// state transitions

//

class edges : virtual public memAlloc

{

public:

int Counter; // # of outgoing edges

int Free; // # of free slots in Edges

protoState **Edges; // pointer to the array of outgoing edges

. . .

*};*

//

// common members to all states

//

class protoState : public proto

{

public:

token *In; // name of the incoming edges

edges InEdges; // incoming edges

superState *SuperState; // this state is a superstate

tokens **LookAheads; // lookahead tokens

protoState *Trace; // trace path during NFA traversing

virtual void SplitPaths(protoState **){}

void SplitPathBackwards(recordPath *, protoState **);

protoState *FindPaths(int, int *, protoState *, int);

virtual void GatherUnitedLookAhead(long, int){}

virtual int EnumerateLookAheads() { return 0; }

virtual void MoveLookAheads(token ***){}

virtual void ResolveConflicts() {}

void MergeWith(protoState *);

. . .

*};*

//

// lookaheads

//

class tokens : public virtual memAlloc

{

public:

int Counter; // # of lookaheads

token **Tokens; // array of tokens

tokens *Link; // link to the next set of tokens

. . .

*};*

//

// record a state after reduction, then goto

//

class reduceAndGoto : public memAlloc

{

public:

edges *Reduce; // states after reduce

edges *Goto; // states to goto after reduce

tokens *LookAhead; // lookahead set for this reduce and goto

. . .

*};*

//

// classes for conflicts

//

enum conflictType

{

SHIFTREDUCE_,

REDUCEREDUCE_

*};*

class conflictTokens : public memAlloc

{

public:

token *Token;

conflictType ConfType;

conflictTokens *Next;

. . .

*};*

//

// reduction for a particular rule

//

class reduction : public memAlloc

{

public:

int Reduce; // # of stack elements to reduce

nonTerminal *Goto; // goto after reduce

reduceAndGoto **Paths; // states to reduce in the case of conflicts

// (array of InEdges.Counter elements)

tokens *LookAhead;// all lookaheads united from Paths

assoc *Assoc; // associativity and precedence of the rule

conflictTokens *Conflict; // shif/reduce or reduce/reduce lookaheads

action *Action; // action list to execute before reduction

void GatherLookAhead(long, int);

void UniteLookAheads(int);

. . .

*};*

//

// all reductions for a final or full state

//

class reductions : public virtual memAlloc

{

public:

int Counter; // # of reductions

reduction *Reductions; // pointer to the array of reductions

void CreatePaths(protoState *, int);

void GatherUnitedLookAhead_(protoState *, long, int);

void SplitPaths_(protoState *, protoState **);

void ResolveConflicts_(squezeColumn *);

void ResolveConflicts_();

. . .

*};*

//

// state with no rules

//

class simpleState : public protoState, public edges

{

public:

simpleState(token *t, int e) : protoState(t), edges(e){};

simpleState() : protoState(), edges(){};

virtual tokenType Type() { return SIMPLESTATE_; }

virtual void GatherUnitedLookAhead(long, int){};

virtual int EnumerateLookAheads() { return EnumerateLookAheads_(); }

virtual void MoveLookAheads(register token ***t) { MoveLookAheads_(t); }

virtual void CutOutEdges();

. . .

*};*

//

// final state with rules and transitions

//

class fullState : public simpleState, public reductions

{

public:

virtual tokenType Type() { return FULLSTATE_; }

virtual void SplitPaths(protoState **);

virtual void GatherUnitedLookAhead(long, int);

virtual int EnumerateLookAheads() { return EnumerateLookAheads_(); }

virtual void MoveLookAheads(register token ***t) { MoveLookAheads_(t); }

virtual void ResolveConflicts();

. . .

*};*

//

// final state with no transitions

//

class finalState : public protoState, public reductions

{

public:

tokens *UnitedLookAhead;

virtual tokenType Type() { return FINALSTATE_; }

virtual void SplitPaths(protoState **);

virtual void GatherUnitedLookAhead(long, int);

virtual int EnumerateLookAheads() { return 0; }

virtual void MoveLookAheads(register token ***) {}

virtual void ResolveConflicts();

. . .

*};*

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.