LR(infinity) with EBNF and Context-Free Expressions

"Mark" <whopkins@alpha2.csd.uwm.edu>
18 Oct 2002 23:38:14 -0400

          From comp.compilers

Related articles
LR(infinity) with EBNF and Context-Free Expressions whopkins@alpha2.csd.uwm.edu (Mark) (2002-10-18)
| List of all articles for this month |

From: "Mark" <whopkins@alpha2.csd.uwm.edu>
Newsgroups: comp.compilers
Date: 18 Oct 2002 23:38:14 -0400
Organization: University of Wisconsin - Milwaukee, Computing Services Division
Keywords: parse
Posted-Date: 18 Oct 2002 23:38:14 EDT

From SLK Parsers (parsersinc@yahoo.com):
> The classical, and proven grammar analysis algorithms in parsing theory are
> based on definitions that do not include a notion of regular expressions as
> in EBNF. Adhoc programming methods that internally translate EBNF run the
> risk of introducing incorrectness.


A proper understanding of the parsing problem (i.e., in algebraic
terms), allows one to extend the common methods to the more general
case.


First, one should understand that parsers do not work with context
free grammars, but with syntax direct translations (SDT's), and
usually with simple SDT's (SSDT's). One generally only says that such
and such grammar G is being "processed" by a parser only in reference
to one of its canonical extensions to a SSDT. It's the extension
that's actually being worked with, not G.


For a grammar G, such as:
                    S -> T | S a T
                    T -> F | T m F
                    F -> u S v | n F | x
over alphabet X = {a,m,u,v,n,x}, the canonical bottom-up extension BU(G) would
be the SSDT:
                  start -> S z
                  S -> T b | S a T c
                  T -> F d | T m F e
                  F -> u S v f | n F g | x h
with output alphabet Y = {b,c,d,e,f,g,h,z} and a "standard bottom-up
interpretation" for Y as
                      z = accept
                      b = reduce T to S
                      c = reduce S a T to S
and so on.


The canonical top-down extension TD(G) would be the SSDT:
                  start -> z S
                  S -> b T | c S a T
                  T -> d F | e T m F
                  F -> f u S v | g n F | h x
with the same output alphabet, but a "standard top-down interpretation" as:
                      z = build root
                      b = build branch S -> T
                      c = build branch S -> S a T
and so on.


The original grammar G is a grammar over the word algebra X*, and the
extensions BU(G) and TD(G) are grammars over the word algebra X* x Y*.


However, since the concept of grammars over word algebras is (still!)
unknown, it'll be necessary to define this first.


So, with that, we'll proceed below to show how to do LR parsing in the
most general context.


This is divided into the sections:


(1) Grammars Over Word Algebras
(2) Grammars As Systems Of Inequalities
(3) LR Parsing On CFG's In Algebraic Form
(4) LR Parsing On EBNF Systems In Algebraic Form
(5) Graphical Depiction Of The Parsing Mechanism
(6) The Formal Bra-Ket Stack Algebra
(7) The LR Parser As A Context-Free Expression (simple CFG's)
(8) The Context-Free Expresson Algebra
(9) The LR Parser As A Context-Free Expression (EBNF's)
(10) The LR Parser For As A Context-Free (Translation) Expression (SSDT's)


capped off with the punchline:


(11) Direct Algebraic Manipulation Supersedes And Obsoletes LR Parsing


--------------------------------------


(1) Grammars Over Word Algebras


In general, a grammar G = (Q,S,M,H) defines a subset of a
word algebra M. At the very least, M is a MONOID, which is defined as an
algebra containing the items
                              1 (corresponding to "empty word")
                              (u,v) |-> uv (corresponding to "concatenation").
M could have more structure, such as the structure for a group, linear
algebra, ring, field, etc., and everything below proceeds virtually
unchanged (the exception: free extensions M[Q], referred to below, are
defined as free extensions *with respect to* whatever structure one is
working within).


Q is a set of variables (often called, in respect to the tree-interpretation
"non-terminals") and S the main variable. Q can be any set, finite or
infinite. There is, for instance, such a notion as an infinite regular
grammar.


For general grammars, H is a subset of M[Q] x M[Q], where M[Q] is the
free extension of M by the set Q. For context-free grammars, H is
restricted to being a subset of Q x M[Q].


For a regular grammar, one restricts H further to the subset Q x (M[Q] + M).


The closest thing you have to a general notion of context-sensitive
grammars is to restrict H to the subset M[Q] x M[Q] of words (a,b)
for which |a| <= |b|, where |.| is defined recursively by:
                                |m| = L(m), m in M
                                |q| = 1, for q in Q
                                |ab| = |a| + |b|
where L: M -> Real Numbers is a suitable function defined over M such
that
                                            L(1) = 0; L(mn) = L(m) + L(n)
For X*, one normally defines L(w) = length of word w, to recover the
usual definition of context-sensitive grammar.


If the monoid M were X*, then M[Q] would be X*[Q] = (X + Q)*, where +
denotes disjoint union. So, in the "standard" definition of a grammar,
you'll see reference to (X + Q)* in the description of the set H of
productions. The fact that (X + Q)* generalizes to M[Q] is what's
still unknown. (Except it's not unknown anymore since you're reading
it here).


The set H of productions defines a transformation system over M[Q].
Corresponding to H is the relation => defined recursively by:
                                  (1) a => a, for all a in M[Q]
                                  (2) if a => bdc, (d,e) in H then a => bec
                                          for all a, b, c, d, e in M[Q]
And corresponding to each a in M[Q] is the set
                          [a] = { m in M: a => m }, for a in M[Q].
of all words in M generated from a.


The subset of M generated by the grammar is then [S]. Thus L(G) = [S].


This relation has the following properties:


                  (3) If a => b, b => c, then a => c
                  (4) If a => b, then cad => cbd
                  (5) If (a,b) in H, then cad => cbd
                  (6) If (a,b) in H, cbd => e, then cad => e
                  (7) If a => b, c => d, then ac => bd
and following from this, you get:
                  (8) [ab] >= [a][b]
                  (9) [m] >= {m}, for m in M.
                  (10) [man] >= {m}[a]{n}, for m, n in M
For context-free grammars, (8) and (9) strengthen to
                  (8') [ab] = [a][b]
                  (9') [m] = {m}.
                  (10') [man] = {m}[a]{n}
In fact, the property (10') is where "context-free" part of the name, itself,
comes from. It says that the reductions a => m' are independent of the
left context m and right context n.


When M = X*, the general definition of a context-free grammar gives you the
usual definition of a context-free grammar over an alphabet X. For
M = X* x Y*, it gives you the usual definition of a simple syntax directed
translation with input alphabet X, output alphabet Y. Note that:


                              X*[Q] = (X + Q)*


                              X* x Y* = (X + Y)*,
                              subject to relations xy=yx, x in X, y in Y


                              (X* x Y*)[Q] = (X + Y + Q)*,
                              subject to relations xy=yx, x in X, y in Y.


The role the commutativity rule xy = yx plays in parsing applications is that
each application of the rule is, in essence, a "look-ahead" operation, when
applied in the direction yx -> xy. The commutativity rule is, in fact,
the algebraic underpinning of look-ahead.


--------------------------------------


(2) Grammars As Systems Of Inequalities


The grammar description can be recast in purely algebraic form.


Let M be a monoid. Its power set 2^M is then also a monoid with:


                                    * Identity {1}
                                    * Product AB = { ab: a in A, b in B },


but is also partially ordered with
                                    * A >= B if A contains B as a subset.


It has the additional structure:
                                        0 <==> {}
                                        A + B <==> A union B


The + operation corresponds to the "|" above. It is the least-upper-bound
operator with respect to the partial ordering >=, and 0 is the least
element:
                                                      0 = lub {}
                                              A + B = lub {A, B}


A grammar (of the general type) can be regarded as a system
of inequalities:
                                      a >= b, if (a,b) is in H,
taken over 2^M. For context-free grammars, then, the solution set:
                                        (q = [q]: q in Q)
represents the LEAST solution to the corresponding system. The
subset [S] generated by the grammar is then the least solution to the
system of inequalities in the main variable S.


The free extension 2^M[Q] is, itself, also a partially ordered algebra with
the same additional structure as 2^M:
                                                    a >= b; a + b; 0.


It's this structure that the corresponding system of inequalities is
defined over.


Example: S -> empty word; S -> S T; T -> x; T -> u S v.
                  X = { u, x, v }.
The corresponding system is:
                  S >= 1 + ST = {empty word} union ST
                  T >= x + uSv = {x} union {u}S{v}
and the least solution is:
                  S = [S] = D(u,v) ^ {x}*
                  T = [T] = (D(u,v) ^ {x}*)+
where
                  D(u,v) = The Dyck language on the complementary set u, v.
                  A ^ B = The interleave of languages A, B
                  A+ = A A*.


--------------------------------------


(3) LR Parsing On CFG's In Algebraic Form


This background provides you with a proper understanding in algebraic terms
of what parsers actually deal with. Also, with that background, I'll revert
to using "+" in place of "|" below. So the grammar in the original example
would be written as systems:


                Original CFG:
                        S >= T + SaT; T >= F + TmF; F >= uSv + nF + x
or equivalently
                        S >= T; S >= SaT; T >= F; T >= TmF; F >= uSv; F >= nF; F >= x.


The LR parsing method works with the bottom-up canonical extension BU(G) of a
grammar G = (Q,S,X*,H). So, if G is as in the original example, then BU(G) is:


                  start >= Sz
                  S >= Tb + SaTc
                  T >= Fd + TmFe
                  F >= uSvf + nFg + xh


defined as a grammar over X* x Y*, with Y = {a,b,c,d,e,f,g,h,z}.


LR defines an auxillary system through the definitions:


          For all a in (X* x Y*)[Q]:
          [R] <a> >= a' + sum (<q>: taken over all q in Q such that da/dq != 0)
          [S] a' >= a0 + sum (z <da/dz>: taken over all z in X + Q)


the least solution of which yields definitions for all the <a> and a'.


The first set [R] yields the effect of reductions; the second set [S], the
effect of shifts.


The derivative operation is defined over (X* x Y*)[Q] by:
                                da/dz = { v in (X* x Y*)[Q]: zv is in a }
which yields an equivalent recursive definition:
                          For y in X + Q: dy/dz = 1 if y = z, 0 else
                          For y in Y: dy/dz = 0
                          For a, b in (X* x Y*)[Q]:
                              d(a+b)/dz = da/dz + db/dz
                              d(ab)/dz = da/dz b + a0 db/dz
                              d(a*)/dz = da/dz a*
with
                          a0 = a intersect {1}
also having an equivalent recursive definition:
                          For y in X + Y + Q: a0 = 0
                          For a, b in (X* x Y*)[Q]:
                              (a+b)0 = a0 + b0
                              (ab)0 = a0 b0
                              (a*)0 = 1.


This is the algebra underlying the Item Set Construction. The set of
"states" of the LR parser is actually the set of expressions generated
from "start" by taking derivatives.


Note that,
                          <a+b> = <a> + <b> and (a+b)' = a' + b'.
These both follow, via inductive arguments, ultimately from the linearity
of d/dz: d(a+b)/dz = da/dz + db/dz. Also, we have:
                          <y> = y' = y; for all y in Y
                          <1> = 1' = 1; <0> = 0' = 0


Let's apply the algebra to the example above.


<start> >= Sz + <S>
<S> >= Tb + <T> + SaTc + <S>
<T> >= Fd + <F> + TmFe + <T>
<F> >= uSvf + nFg + xh


the least solution of which is:


<start> = (Sz + Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'
                = S <z+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<S> = (Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'
        = S <aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<T> = (Fd + TmFe + uSvf + nFg + xh)'
        = T <mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<F> = (uSvf + nFg + xh)'
        = u <Svf> + n <Fg> + x <h>


All that the item-set construction does is obtain the closure of <start>.
This is done algebraically below:


<start> >= S <z+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<z+aTc> >= (z + aTc)' >= z + a <Tc>
<b+mFe> >= (b + mFe)' >= b + m <Fe>
<d> >= d' >= d
<Svf> >= (Svf)' + <S>
            >= (Svf + Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'
            >= S <vf+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<Fg> >= (Fg)' + <F>
          >= (Fg + uSvf + nFg + xh)'
          >= F <g> + u <Svf> + n <Fg> + x <h>
<h> >= h' >= h
<Tc> >= (Tc)' + <T>
          >= (Tc + Fd + TmFe + uSvf + nFg + xh)'
          >= T <c+mFe> + F <d> + u <Svf> + n <Fg> + x <h>
<Fe> >= (Fe)' + <F>
          >= (Fe + uSvf + nFg + xh)'
          >= F <e> + u <Svf> + n <Fg> + x <h>
<vf+aTc> >= (vf + aTc)' >= v <f> + a <Tc>
<g> >= g' >= g
<c+mFe> >= (c+mFe)' >= c + m <Fe>
<f> >= f' >= f
<e> >= e' >= e,


thus resulting in the following set of expressions and equivalent system:


q0 = <start>; q0 >= S q5 + T q7 + F q9 + u q1 + n q4 + x qC
q1 = <Svf>; q1 >= S q6 + T q7 + F q9 + u q1 + n q4 + x qC
q2 = <Tc>; q2 >= T q8 + F q9 + u q1 + n q4 + x qC
q3 = <Fe>; q3 >= F qA + u q1 + n q4 + x qC
q4 = <Fg>; q4 >= F qB + u q1 + n q4 + x qC
q5 = <z+aTc>; q5 >= z + a q2
q6 = <vf+aTc>; q6 >= v qD + a q2
q7 = <b+mFe>; q7 >= b + m q3
q8 = <c+mFe>; q8 >= c + m q3
q9 = <d>; q9 >= d
qA = <e>; qA >= e
qB = <g>; qB >= g
qC = <h>; qC >= h
qD = <f>; qD >= f,


The rules q >= y (y in Y) are all REDUCE's, the rules q >= n q (n a
non-terminal in Q) are all GOTO's, and the rules q >= x q' (x in X) are
all SHIFT's.


--------------------------------------


(4) LR Parsing On EBNF Systems In Algebraic Form


This works just as well on ANY system of inequalities, where the right-hand
side is a regular expression. So, it can handle EBNF grammars. Take the
following example:
                                                            S >= T (aT)*
                                                            T >= F (mF)*
                                                            F >= n* (uSv + x)


The canonical bottom-up extension to a SSDT is then:
                                                      start >= Sz
                                                      S >= T (aT)* i
                                                      T >= F (mF)* j
                                                      F >= n* (uSv + x) k
with
                                          Y = { i, j, k, z }.


One finds,
                <start> >= (Sz)' + <S>
                        <S> >= (T (aT)* i)' + <T>
                        <T> >= (F (mF)* j)' + <F>
                        <F> >= (n* (uSv + x) k)'


The least solution for <start> is:
                <start> >= (Sz + T (aT)* i + F (mF)* j + n* (uSv + x) k)'
                                >= S <z> + T <(aT)*i> + F <(mF)*j>
                                  + n <n* (uSv+x) k> + u <Svk> + x <k>.


This time the derivative operation acts non-trivially, giving you (for
instance):
                  d(n* (uSv+x) k)/du = d(n*)/du (uSv+x) k + d((uSv+x)k)/du
                                                        = dn/du n* (uSv+x) k + d(uSvk)/du + d(xk)/du
                                                        = 0 n* (uSv+x) k + Svk + dx/du k
                                                        = Svk
However, this and the other derivatives are found directly by expanding all
stars A* like A* = A A* + 1 and applying algebra. Thus, for instance, you
get:
                  n* (uSv + x) k = (uSv + x) k + n n* (uSv + x) k
                                                = u (Svk) + x (k) + n (n* (uSv + x) k),
resulting in:
                <n* (uSv + x) k> >= u <Svk> + x <k> + n <n* (uSv + x) k>.


So there's no need to actually work out derivatives at all.


The least solution for <S>, <T> and <F> are thus worked out to be:
                        <S> >= (T (aT)* i + F (mF)* j + n* (uSv + x) k)'
                                >= T <(aT)* i> + F <(mF)* j>
                                  + u <Svk> + x <k> + n <n* (uSv + x) k>
                        <T> >= (F (mF)* j + n* (uSv + x) k)'
                                >= F <(mF)* j> + u <Svk> + x <k> + n <n* (uSv + x) k>
                        <F> >= (n* (uSv + x) k)'
                                >= u <Svk> + x <k> + n <n* (uSv + x) k>.


The results for the closure of <start> are progressively determined to be:


<start> >= (Sz + T(aT)*i + F(mF)*j + n* (uSv + x) k)'
                >= S <z> + T <(aT)*i> + F <(mF)*j>
                  + u <Svk> + x <k> + n <n* (uSv+x) k>


<z> >= z' >= z


<(aT)*i> >= ((aT)*i)' >= i + a <T(aT)*i>


<(mF)*j> >= ((mF)*j)' >= j + m <F(mF)*j>


<n* (uSv+x) k> >= u <Svk> + x <k> + n <n* (uSv+x) k>


<Svk> >= (Svk + T (aT)* i + F (mF)* j + n* (uSv + x) k)'
          >= S <vk> + T <(aT)*i> + F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>


<k> >= k' >= k


<T(aT)*i> >= (T(aT)*i + F(mF)*j + n* (uSv+x) k)'
                    >= T <(aT)*i> + F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>


<F(mF)*j> >= (F(mF)*j + n* (uSv+x) k)'
                    >= F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>


<vk> >= (vk)' >= v <k>


This yields the following equivalent system:


q0 = <start>; q0 >= S q6 + T q5 + F q4 + u q1 + x q9 + n q8
q1 = <Svk>; q1 >= S q7 + T q5 + F q4 + u q1 + x q9 + n q8
q2 = <T(aT)*i>; q2 >= T q5 + F q4 + u q1 + x q9 + n q8
q3 = <F(mF)*j>; q3 >= F q4 + u q1 + x q9 + n q8
q4 = <(mF)*j>; q4 >= j + m q3
q5 = <(aT)*i>; q5 >= i + a q2
q6 = <z>; q6 >= z
q7 = <vk>; q7 >= v q9
q8 = <n* (uSv+x) k>; q8 >= u q1 + x q9 + n q8
q9 = <k>; q9 >= k


--------------------------------------


(5) Graphical Depiction Of The Parsing Mechanism


By itself, none of these results provides a parsing mechanism. The
extra missing ingredient is usually kept separate from this part of the
description in the usual presentation of the LR method: that is the stacking
mechanism. This has to be explicitly described in order to best understand
how it works in the special case of ordinary grammars; but especially in
the general case of more general types of systems, including EBNF.


Take the first result, consisting of the "states" (expressions) q0,q1,...,qD.
For each element of X and Q, one can write down the set of all transitions
corresponding to that element in graphic terms (abbreviating the q's to
numbers and letters):


                            S: [0] -> 5 [1] -> 6
                            T: [0,1] -> 7 [2] -> 8
                            F: [0,1,2] -> 9 [3] -> A [4] -> B
                            u: [0,1,2,3,4] -> 1
                            n: [0,1,2,3,4] -> 4
                            x: [0,1,2,3,4] -> C
                            a: [5,6] -> 2
                            m: [7,8] -> 3
                            v: [6] -> D


Consider the rule S -> S a T, which eventually became S >= SaTc. Combine
the graphs for S, a, and T in a natural fashion.


                / [0] -> 5 \ ( [5,6] -> 2 ) / [0,1] -> 7 \
                \ [1] -> 6 / \ [2] -> 8 /


            = / [0] -> 5 ---*---> 2 \ / [0,1] -> 7 \
                \ [1] -> 6 --/ / \ [2] -> 8 /


            = / [0] -> 5 ---*---> 2 -> 8 \
                \ [1] -> 6 --/ /


This represents all the possible progressions of "states" that are
"traversed" going through S, a and T. A REDUCE is supposed to backtrack
on this and re-traverse along S. So, take the final graph and take its
ADJOINT (the reversal of the graph), and combine it with S:


                  / [0] -> 5 ---*---> 2 -> 8 \+ / [0] -> 5 \
                  \ [1] -> 6 --/ / \ [1] -> 6 /


where G+ denotes the adjoint of G. Then reduce it:


                  / [0] -> 5 ---*---> 2 -> 8 \+ / [0] -> 5 \
                  \ [1] -> 6 --/ / \ [1] -> 6 /


            = / 8 <- 2 <---*--- 5 <- [0] \ / [0] -> 5 \
                  \ \-- 6 <- [1] / \ [1] -> 6 /


            = / 8 <- 2 <---*--- 5 <- [0] -> 5 \
                  \ \-- 6 <- [1] -> 6 /


In LR parsing, all the action symbols in the alphabet Y = {a,b,c,d,e,f,g,h,z}
are interpreted as reduce actions, with z specially interpreted as an
"accept" action. The graph above corresponds to element c.


The corresponding graphs for b, d, e, f, g and h are derived likewise. Let
[Z] be the graph corresponding to a symbol Z in X union Q. Define a in X*[Q]
recursively by:
                                                  [ab] = [a] [b]; [1] = 1.


Corresponding to the rule S -> S a T is the action c = [SaT]+ [S]. The
other actions are listed below:


                    b = [T]+ [S]
                        = (7 <- [0,1]) ([0] -> 5; [1] -> 6)
                        = / 7 <--*---- [0] -> 5 \
                            \ \--- [1] -> 6 /


                    d = [F]+ [T]
                        = (9 <- [0,1,2]; A <- [3]; B <- [4]) ([0,1] -> 7; [2] -> 8)
                        = / 9 <--*---- [0,1] -> 7 \
                            \ \--- [2] ---> 8 /


                    e = [TmF]+ [T]
                        = / A <- 3 <--*---- 7 <- [0,1] -> 7 \
                            \ \--- 8 <-- [2] --> 8 /


                    f = [uSv]+ [F]
                        = / /--- [0,1,2] -> 9 \
                            | D <- 6 <- 1 <--*------ [3] ---> A |
                              \ \----- [4] ---> B /


                    g = [nF]+ [F]
                        = / /--- [0,1,2] -> 9 \
                          | B <- 4 <--*------ [3] ---> A |
                            \ \----- [4] ---> B /


                    h = [x]+ [F]
                        = / /--- [0,1,2] -> 9 \
                          | C <--*------ [3] ---> A |
                            \ \----- [4] ---> B /




The accept action is supposed to be where the parser stops, so the
action corresponding to it would be:


                                      z = [S]+ (0 <-)
                                          = / 5 <- [0] \ (0 <- )
                                              \ 6 <- [1] /
                                          = (5 <- 0 <- )


Finally, the parser starts out at "state" q0:


                                  Initial action: (-> 0)


The end result is a 2-state "automaton" given by the right-linear system:


                        S >= (-> 0) F
                        F >= x [x] F SHIFTS: for any symbol x in X
                        F >= [b]+ [a] F REDUCES: for any inequality a >= b in H
                        F >= [S]+ (0 <-) ACCEPT




--------------------------------------


(6) The Formal Bra-Ket Stack Algebra


The graphs just a convenient way to diagram the stacking actions the parser
takes. A more precise rendition of the above is obtained by embedding the
(implicit) graph algebra into a stack algebra.


We treated a combination of graphs (G1; G2; ...; Gn) as a sum
                              (G1; G2; ...; Gn) = G1 + G2 + ... + Gn;
and a concatenation of graphs as a product. Products distribute over
sums, so the graph
                            / A <- 3 <--*---- 7 <- [0,1] \
                            \ \--- 8 <-- [2] /
would actually be
                          (A <- 3 <- ) (7 <- [0,1]; 8 <- [2])
                      = (A <- 3 <- ) ( (7 <- [0,1]) + (8 <- [2]) )
                      = (A <- 3 <- ) (7 <- [0,1]) + (A <- 3 <- ) (8 <- [2])
                      = (A <- 3 <- 7 <- [0,1]) + (A <- 3 <- 8 <- [2])
                      = (A<-) (3<-) (7<-) [0,1] + (A<-) (3<-) (8<-) [2]


The symbol [a1,a2,...,an] was just a sum:
                                  [a1,a2,...,an] = [a1] + [a2] + ... + [an].
When combined in graphs, combination was done according to the following
rules:
                                    [a] [b] = [a], if a = b
                                                    = 0, else
                                (->a) [b] = (->a), if a = b
                                                    = 0, else
                                [a] (b<-) = (b<-), if a = b
                                                    = 0, else


We also pose the identity
                                                [0,1,2,...,n] = 1
where n is the number of distinct states, since this element will behave
like 1, upon multiplication with any other graph.


Completing the correspondence to stacking actions, the basic graphs
(->a) and (b<-) would be respectively the push and pop operations defined
by:
                              (->a) = push symbol a
                              (b<-) = pop and test for equality to b.
The symbol 0 receives special interpretation as:
                              (->0) = create empty stack
                              (0<-) = test for empty stack and delete stack.


Under this interpretation, the symbol [b] would be
                              [b] = test for equality to b on stack top (0 means empty)
                                      = (b<-)(->b)


Thus, the algebra above can be subsumed by the more fundamental identities:


                                    (->a)(b<-) = 1, if a = b
                                                          = 0, else
                                    (0<-)(->0) + (1<-)(->1) + ... + (n<-)(->n) = 1


These are the same identities as are encountered in the algebra of Bra's and
Ket's used in Quantum Physics. So, we'll make the identification:


                                        (->a) = <a|; (<-b) = |b>
                                        [a] = |a><a| = projection on a
Adjoints are defined recursively by:
                                        (uv)+ = (v+) (u+)
                                        (u+v)+ = (u)+ + (v)+
                                        <a|+ = |a>; |b>+ = <b|


The fundamental identities of the formal bra-ket algebra are then:
                                            <i| |j> = delta_{ij}
                                            |0><0| + |1><1| + ... + |n><n| = 1
where delta_{ij} is the Kroenecker Delta defined by:
                                        delta_{ij} = 1, if i = j; 0 else.


--------------------------------------


(7) The LR Parser As A Context-Free Expression (simple CFG's)


So, revisiting the first LR result, we can write (using, again, the
notaton [S] to stand for S's graph, and so on):


                      [S] = [0]<5| + [1]<6|
                      [T] = [0,1]<7| + [2]<8|
                      [F] = [0,1,2]<9| + [3]<A| + [4]<B|
                      [u] = [0,1,2,3,4]<1|
                      [n] = [0,1,2,3,4]<4|
                      [x] = [0,1,2,3,4]<C|
                      [a] = [5,6]<2|
                      [m] = [7,8]<3|
                      [v] = [6]<D|


Thus, for instance:
      [SaT]+ [S] = [T]+ [a]+ [S]+ [S]
                            = (|7>[0,1] + |8>[2]) (|2>[5,6])
                                (|5>[0] + |6>[1]) ([0]<5| + [1]<6|)
                            = (|8>|2>[5,6]) (|5>[0]<5| + |6>[1]<6|)
                            = |8>|2> (|5>[0]<5| + |6>[1]<6|).
We'll also use the extended notation defined recursively by:
                            <ab| = <a|<b|; |ab> = |a>|b>
                            <a+b| = <a|+<b|; |a+b> = |a>+|b>
                            [a1 ... an] = |an ... a1><a1 ... an|
                            [a1,...,an] = [a1] + ... + [an]
so we can write
                                    c = [SaT]+ [S] = |8 2> [0 5, 1 6]
since
                      |5>[0]<5| = |5>|0><0|<5| = |5 0><0 5| = [0 5]
                      |6>[1]<6| = |6>|1><1|<6| = |6 1><1 6| = [1 6]
                      |5>[0]<5| + |6>[1]<6| = [0 5] + [1 6] = [0 5, 1 6]


The other actions are, respectively:
                    b = [T]+ [S] = |7> ([0]<5| + [1]<6|)
                    d = [F]+ [T] = |9> ([0,1]<7| + [2]<8|)
                    e = [TmF]+ [T] = |A 3> [2 8, (0, 1) 7]
                    f = [uSv]+ [F] = |D 6 1> ([0,1,2]<9| + [3]<A| + [4]<B|)
                    g = [nF]+ [F] = |B 4> ([0,1,2]<9| + [3]<A| + [4]<B|)
                    h = [x]+ [F] = |C> ([0,1,2]<9| + [3]<A| + [4]<B|)


                    Final Action: z = [S]+ |0> = |5 0>
                    Initial Action: <0|


The parsing automaton is then represented as 2-state automaton with:
                        S >= <0| F
                        F >= x [x] F SHIFTS: for any symbol x in X
                        F >= [b]+ [a] F REDUCES: for any inequality a >= b in H
                        F >= [S]+ |0> ACCEPT


which, in turn is a system that has, as its least solution the regular
expression:
                                S = <0| ([X] + [H])* [S]+ |0>
where
                                [X] = sum ([x] x: x in X)
                                [H] = sum ([b]+ [a]: (a,b) in H)


--------------------------------------


(8) The Context-Free Expresson Algebra


Let D be the set of all these symbols:
                      D = { <i|: i = 0,...,n-1 } union { |i>: i = 0,...,n-1 }


It is assumed that these symbols commute with those of X.


                                              dx = xd; dy = yd;
                                              for x in X, d in D.


Then the expression above will reduce to the sum (reinserting the "state"
q0 for 0):
                    S = <q0| ([X] + [H])* [S]+ |q0>
                        = sum (w in X*: w is in the language L(G) recognized by G)


In general, we'd have:


                    <q0| ([X] + [H])*
                = sum (<q0|<qa|...<qz| v: v reduces bottom-up right-to-left
                              to za...zz, with q0 >= za qa; qa >= zb qb; ... qy >= zz qz)


which can still be established by the same inductive proof for general
grammars/systems that would be carried out in the "standard" LR parsing
formalism for ordinary context-free grammars. This is where the rules
[R] and [S] defined way up above would come into play.


The underlying algebra is a partially ordered monoid in which:


                      (A) (Rational) Completeness
                              Every rational subset U has a least upper bound sum U,
                              e.g., sum {1,a,aa,aaa,...} = sum {a}* = a*
                      (B) (Rational) Distributivity
                              If U, V are rational subsets then sum (UV) = (sum U) (sum V)
                              e.g., sum a{b}*c = sum {ac,abc,abbc,abbbc,...}
                                                                = sum {a}{b}*{c}
                                                                = (sum {a}) (sum {b}*) (sum {c})
                                                                = a b* c
Such an algebra is called a *-continuous Kleene Algebra and can be
equivalently defined by the items
                                            0 = sum {}; a + b = {a,b};
                                            a* = sum { 1, a, aa, aaa, ... }
through the axioms:
                                  (uv)w = u(vw); u1 = u = 1u
                                  (u+v)+w = u+(v+w); u+0 = u = 0+u; u+v = v+u; u+u = u
with general distributivity stated in the forms:
                                  u0w = 0;
                                  u(v+w)x = uvx + uwx;
                                  sum (u v^n x: n = 0,1,2,...) = u v* x


Every rational subset will have a least upper bound which will be
distributive. This can be proven inductively, noting that the rational
subsets are the closure of the finite subsets under the 3 rational
operations:
                        set-wise product: AB = {ab: a in A, b in B}
                        union: A+B = A union B
                        Kleene star: A* = {1} union A union AA union AAA union ...


If one takes any such algebra K, and adds a copy of the formal Bra-Ket
algebra symbols D in such a way that the D's commute with the K's, then the
result is a *-continuous Kleene Algebra which contains the least upper
not only of the rational subsets of K, but of all the CONTEXT-FREE subsets
of K. So, the result provides an algebra for context-free expressions
over K.


The special case K = R(M), for a monoid M yields an algebra of context-free
expressions that faithfully represents the family C(M) of context-free
subsets of M. This specializes further into the two cases:


      M = X* -- C(M) = context-free languages over X.
      M = X* x Y* -- C(M) = simple syntax directed translations between X and Y.


--------------------------------------


(9) The LR Parser As A Context-Free Expression (EBNF's)


For the example involving a general system with regular expressions in
the right-hand side (which is what EBNF actually is):
                                                            S >= T (aT)*
                                                            T >= F (mF)*
                                                            F >= n* (uSv + x)
with the corresponding canonical bottom-up SSDT:
                                                      start >= Sz
                                                      S >= T (aT)* i
                                                      T >= F (mF)* j
                                                      F >= n* (uSv + x) k
we established the following system:
                        q0 >= S q6 + T q5 + F q4 + u q1 + x q9 + n q8
                        q1 >= S q7 + T q5 + F q4 + u q1 + x q9 + n q8
                        q2 >= T q5 + F q4 + u q1 + x q9 + n q8
                        q3 >= F q4 + u q1 + x q9 + n q8
                        q4 >= j + m q3
                        q5 >= i + a q2
                        q6 >= z
                        q7 >= v q9
                        q8 >= u q1 + x q9 + n q8
                        q9 >= k
which yields the definitions (again, abbreviating q's by numbers):
                        [S] = [0]<6| + [1]<7|
                        [T] = [0,1,2]<5|
                        [F] = [0,1,2,3]<4|
                        [u] = [0,1,2,3,8]<1|
                        [x] = [0,1,2,3,8]<9|
                        [n] = [0,1,2,3,8]<8|
                        [v] = [7]<9|
                        [m] = [4]<3|
                        [a] = [5]<2|
The actions are computed as follows:
                        z = [S]+ |0> = |6 0>
                        i = [T (aT)*]+ [S]
                        j = [F (mF)*]+ [T]
                        k = [n* (uSv + x)]+ [F].
where [] in the more general context is defined by identifying:
                                                  |A*> = |A>*; <A*| = <A|*.
Thus,
                    [T (aT)*] = [T] ([a] [T])*
                                        = [0,1,2]<5| ([5]<2| [0,1,2]<5|)*
                                        = [0,1,2]<5| ([5]<2|<5|)*
                                        = [0,1,2]<5| (|5><5|<2|<5|)*
                                        = [0,1,2] (<5||5><5|<2|)* <5|
                                        = [0,1,2] (<5|<2|)* <5|
                                        = [0,1,2] <(5 2)* 5|
Thus
                  i = [T (aT)*]+ [S] = |5 (2 5)*> [0,1,2] ([0]<6| + [1]<7|)
                                                        = |5 (2 5)*>([0]<6| + [1]<7|).
Similarly,
                  j = [F (mF)*]+ [T] = |4 (3 4)*>[0,1,2]<5|
                  k = [n* (uSv + x)]+ [F] = |9 [7 1] 8*>[0,1,2,3]<4|
where we also write:
                                  [A] = 1 + A
with
                                  <[A]| = 1 + <A|; |[A]> = 1 + |A>.


The corresponding automaton is then:
                                  A = <q0| ([X] + [H])* |q6 q0>
with
                        [X] = [q0,q1,q2,q3,q8] (u<q1| + x<q9| + n<q8|)
                                + [q7] v<q9|
                                + [q4] m<q3|
                                + [q5] a<q2|
                        [H] = i + j + k
                                = |q5 (q2 q5)*>([q0]<q6| + [q1]<q7|)
                                + |q4 (q3 q4)*>[q0,q1,q2]<q5|
                                + |q9 [q7 q1] q8*>[q0,q1,q2,q3]<q4|.


Finally, infinite look-ahead is accomplished by applying commutativity,
noting that the X's commute with the symbols <q|, |q>. We can write:


                                  A = <0| ([X] + [H])* |6 0>
                                      = <0| ([H]* [X])* [H]* |6 0>.


Noting that
                        [H] = |5 (2 5)*>([0]<6| + [1]<7|)
                                + |4 (3 4)*>[0,1,2]<5|
                                + |9 [7 1] 8*>[0,1,2,3]<4|,
we have
                  [H]^2 = |4 (3 4)*>[0,1,2]<5| |5 (2 5)*>([0]<6| + [1]<7|)
                              = |4 (3 4)*>[0,1,2] |(2 5)*> ([0]<6| + [1]<7|)
                              = |4 (3 4)* (2 5)*> ([0]<6| + [1]<7|)
and
                  [H]^3 = 0.
Thus
                  [H]* = 1 + [H] + [H]^2
                            = 1
                            + |(5 + 4 (3 4)*) (2 5)*>([0]<6| + [1]<7|)
                            + |4 (3 4)*>[0,1,2]<5|
                            + |9 [7 1] 8*>[0,1,2,3]<4|,
and
                  [H]* |6 0> = |6 0> + |(5 + 4 (3 4)*) (2 5)*>[0]<6| |6 0>
                                        = |(6 + (5 + 4 (3 4)*) (2 5)*) 0>
                  [H]* [X] = [H]* [0,1,2,3,8] (u<1| + x<9| + n<8|)
                                    + [H]* [7] v<9|
                                    + [H]* [4] m<3|
                                    + [H]* [5] a<2|


                                    = [0,1,2,3,8] (u<1| + x<9| + n<8|)
                                    + [7] v<9| + |(5 + 4 (3 4)*) (2 5)*>[1] v <7 9|
                                    + [4] m<3| + |9 [7 1] 8*>[0,1,2,3] m <4 3|
                                    + [5] a<2| + |4 (3 4)*>[0,1,2] a <5 2|


                                    = u [0,1,2,3,8]<1|
                                    + x [0,1,2,3,8]<9|
                                    + n [0,1,2,3,8]<8|
                                    + v (|7> + |(5 + 4 (3 4)*) (2 5)*>[1]) <7 9|
                                    + m (|4> + |9 [7 1] 8*>[0,1,2,3]) <4 3|
                                    + a (|5> + |4 (3 4)*>[0,1,2]) <5 2|


Thus, the automaton can be represented as an infinite-lookahead automaton
defined by:
                                  A = <0| ([H]* [X])* [H]* |6 0>
                                      = <0| U* V |0>
with
                    U = [H]* [X]
                        = u [0,1,2,3,8]<1|
                        + x [0,1,2,3,8]<9|
                        + n [0,1,2,3,8]<8|
                        + v (|7> + |(5 + 4 (3 4)*) (2 5)*>[1]) <7 9|
                        + m (|4> + |9 [7 1] 8*>[0,1,2,3]) <4 3|
                        + a (|5> + |4 (3 4)*>[0,1,2]) <5 2|
and
                    V = [H]* |6> [0]
                        = |(6 + (5 + 4 (3 4)*) (2 5)*)> [0].


--------------------------------------


(10) The LR Parser For As A Context-Free (Translation) Expression (SSDT's)


All of the formalism is purely algebraic and applies regardless of whether
the grammar is taken over a free monoid X* or a more general monoid M. So,
it's not only more general in the sense of applying to more general types
of grammars/systems, but also in the sense of applying to more general
types of word algebras.


In particular, if the original grammar, itself, was an SSDT, then the
corresponding word algebra would have been of the form M x N for an
input word algebra M and output word algebra N. All the results above
apply virtually unchanged.


So this can be used even when actions are embedded within the right-hand
sides of rules. This is particularly important for EBNF, where you almost
always embed actions within the rules, instead of relying on the
rule-reduction mechanism alone.


Take the EBNF
                                                            S >= T (aT)*
                                                            T >= F (mF)*
                                                            F >= n* (uSv + x)


and extend it with an action alphabet Y = {b,c,d,e} to the SSDT:


                                            S >= T (aTb)*
                                            T >= F (mFc)*
                                            F >= nFd + uSv + xe.


This is then a grammar over the word algebra X* x Y*, where X = {a,m,n,u,v,x}.


The corresponding system canonical extension is:
                                    start >= Sz
                                            S >= T (aTb)* i
                                            T >= F (mFc)* j
                                            F >= (nFd + uSv + xe) k
which is a grammar over the word algebra X* x Y* x Z* with Z = {z,i,j,k}.


The result of applying the definitions <a> and a' to find the closure
of <start> is then:


              q0 = <start>; q0 >= S qD + T q5 + F q6 + n q4 + u q1 + x qA
              q1 = <Svk>; q1 >= S q9 + T q5 + F q6 + n q4 + u q1 + x qA
              q2 = <Tb(aTb)*i>; q2 >= T q7 + F q6 + n q4 + u q1 + x qA
              q3 = <Fc(mFc)*j>; q3 >= F q8 + n q4 + u q1 + x qA
              q4 = <Fdk>; q4 >= F qB + n q4 + u q1 + x qA
              q5 = <(aTb)*i>; q5 >= i + a q2
              q6 = <(mFc)*j>; q6 >= j + m q3
              q7 = <b(aTb)*i>; q7 >= b q5
              q8 = <c(mFc)*j>; q8 >= c q6
              q9 = <vk>; q9 >= v qC
              qA = <ek>; qA >= e qC
              qB = <dk>; qB >= d qC
              qC = <k>; qC >= k
              qD = <z>; qD >= z,


which yields:
                        [S] = [0]<D| + [1]<9|
                        [T] = [0,1]<5| + [2]<7|
                        [F] = [0,1,2]<6| + [3]<8| + [4]<B|


[0,1,2,3,4]<(4 B + 1 9 + A) C|
                        [n] = [0,1,2,3,4]<4|
                        [u] = [0,1,2,3,4]<1|
                        [x] = [0,1,2,3,4]<A|
                        [a] = [5]<2|
                        [m] = [6]<3|
                        [v] = [9]<C|


                        [b] = [7]<5|
                        [c] = [8]<6|
                        [d] = [B]<C|
                        [e] = [A]<C|.


>From this, we get the following actions:
i = [T(aTb)*]+ [S] = |5 (7 2 5)*> ([0]<D| + [1]<9|)
j = [F(mFc)*]+ [T] = |6 (8 3 6)*> ([0,1]<5| + [2]<7|)
k = [nFd + uSv + xe]+ [F] = |C (9 1 + A + B 4)>([0,1,2]<6| + [3]<8| + [4]<B|)
z = |D 0>
Thus, the automaton is given by the expression:


                                      <0| ([X] + [Y] + [H])* |D 0>
with
          [X] = n[n] + u[u] + x[x] + a[a] + m[m] + v[v]
                  = [0,1,2,3,4] (n<4| + u<1| + x<A|)
                  + [5] a<2|
                  + [6] m<3|
                  + [9] v<C|


          [Y] = b[b] + c[c] + d[d] + e[e]
                  = [7]b<5| + [8]c<6| + ([A]e + [B]d)<C|


          [H] = i + j + k
                  = |5 (7 2 5)*> ([0]<D| + [1]<9|)
                  + |6 (8 3 6)*> ([0,1]<5| + [2]<7|)
                  + |C (9 1 + A + B 4)>([0,1,2]<6| + [3]<8| + [4]<B|)


The reduction of this to an infinite-lookahead automaton would be
similar, using the identities, for instance:


                  <0| ([X] + [Y] + [H])* |D 0>
              = <0| ([H]* [X] + [H]* [Y])* [H]* |D 0>
              = <0| (([H]* [Y])* [H]* [X])* ([H]* Y) [H]* |D 0>
              = <0| U* V |0>
this time with
              U = W* [H]* [X]
              V = W* [H]* |D> [0]
              W = [H]* [Y].


The evaluation of these expressions will be left as an exercise.


--------------------------------------


(11) Direct Algebraic Manipulation Supersedes And Obsoletes LR Parsing


The punchline, unfortunately, is that much of this formalism -- having
been developed in the 1960's, is simply rendered obsolete by the formalism
just developed. It is, in fact, very easy to derive a context-free
expression for even the most general case by hand:


                                    start >= Sz
                                            S >= T (aTb)* i
                                            T >= F (mFc)* j
                                            F >= (nFd + uSv + xe) k.


The least solution to any context-free system of inequalities is also a
solution to the corresponding equations:
                                    start = Sz
                                            S = T (aTb)* i
                                            T = F (mFc)* j
                                            F = (nFd + uSv + xe) k.
We'll use that below.


Define SF, TF, FF as the least solutions respectively to:


                      SF >= |0>z + |1> v k FF
                      TF >= [0,1] (aTb)* i SF + |2> b (aTb)* i SF
                      FF >= [0,1,2] (mFc)* j TF + |3> c (mFc)* j TF + |4> d k FF.


In the extended algebra incorporating the formal Bra-Kets, the least
solutions to the original system are sums over context-free subsets of
X* x Y*, and so commute with the Bras and Kets. Thus, defining
                      SS = S SF; TS = T TF; FS = F FF
                      SQ = (aTb)* i SF; TQ = (mFc)* j TF,
we can write
                      <0| SS = S <0| SF = S z = start.
                      <1| SS = S <1| SF = S v k FF
                      [0,1] TS = T [0,1] TF = T (aTb)* i SF = S SF = SS
                      <2| TS = T <2| TF = T b (aTb)* i SF
                      [0,1,2] FS = F [0,1,2] FF = F (mFc)* j TF = T TF = TS
                      <3| FS = F <3| FF = F c (mFc)* j TF
                      <4| FS = F <4| FF = F d k FF.
Thus
                  start >= <0| SS
                  SS >= T (aTb)* i SF = [0,1] TS
                  TS >= F (mFc)* j TF = [0,1,2] FS
                  FS >= (nFd + uSv + xe) k FF
                        = n F d k FF + u S v k FF + x e k FF
                        = m <4| FS + u <1| SS + x e k FF
                  FF >= [0,1,2] TQ + |3> c TQ + |4> d k FF
                  TF >= [0,1] SQ + |2> b SQ
                  SF >= |0>z + |1> v k FF
                  TQ >= (mFc)* j TF
                        = m F c j (mFc)* j TF + j TF
                        = m <3| FS + j ([0,1] + |2>b) SQ
                  SQ >= (aTb)* i SF
                        = a T b (aTb)* i SF + i SF
                        = a <2| TS + i (|0>z + |1>vk FF)
thus arriving at a right-linear system:
                    start >= <0| FS
                    FS >= (m<4| + u<1|) FS + xek FF
                    FF >= ([0,1,2] + |3>c) TQ + |4>dk FF
                    TQ >= m<3| FS + j ([0,1] + |2>b) SQ
                    SQ >= a<2| FS + |0>iz + |1>ivk FF


whose least solution for "start" will be the corresponding context-free
expression.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.