Re: regular expression operators in CF grammars (& SSDTs).

"Mark" <whopkins@alpha2.csd.uwm.edu>
21 Jul 2002 02:11:53 -0400

From comp.compilers

Related articles
*Re: regular expression operators in CF grammars (& SSDTs). whopkins@alpha2.csd.uwm.edu (Mark)* (2002-07-21)**

| List of all articles for this month |

From:	"Mark" <whopkins@alpha2.csd.uwm.edu>
Newsgroups:	comp.compilers
Date:	21 Jul 2002 02:11:53 -0400
Organization:	University of Wisconsin - Milwaukee, Computing Services Division
Keywords:	parse
Posted-Date:	21 Jul 2002 02:11:52 EDT

"Chris F Clark" <cfc@world.std.com> writes:
>Our solution (and it is not a perfect one) is that one uses regular
>expressions when one is willing to accept the ambiguity issues and
>recursive rules when ambiguity is unacceptable.

You can extend the regular expression algebra with the addition of
a suitable operator algebra to acquire the full power of a stack
based language/transduction. Two examples:

              p, q, b, d: defined by relations
pq = bd = 1, pd = bq = 0, db + qp = 1
xy = yx, x a language or transduction word, y = p,q,b,d
or more generally
              The 2n symbols <0|, <1|,..., <n-1|, |0>, |1>, ..., |n-1>
              subject to the same identities regarded as bra's and ket's:
<i| |j> = 1 if i = j, 0 else
|0><0| + |1><1| + ... + |n-1><n-1| = 1

(+ is the algebraic notation corresponding to |, 1 to empty word, here)

or the 2nd example:
              The stack algebra
p1,...,pn,q1,...,qn,W: defined by the relations
pi qj = 1 if i = j, 0 else
pi W = 0 = W qj
W + q1 p1 + ... + qn pn = 1

The 2nd algebra is embedded into the first by the relations
pi = <i|, qj = |j>, W = |0><0|.

It has a direct interpretation in terms of stack primitives as
W = empty stack test
pi = push symbol #i
qi = pop & test for symbol #i

The other issue is that when you're dealing with parsing, you're
not actually dealing with grammars, but transductions. When one
refers to the "structure" of a grammar "surviving" (or not
"surviving" transformation) or a grammar "being" LR(k), it's
actually the Canonical simple syntax directed transduction
associated with the grammar they're referring to. Two grammars
for the same language may yield different SSDT's.

If you do the algebra on the grammars, you're not representing
and preserving relevant structures, so there's a natural
reluctance and "disclaimer" for doing unfettered algebra on
grammars. But algebra done on SSDT's DOES preserve the
relevant structure and can be done with abandon -- by hand
or by machine.

An SSDT is just a context free grammar whose word algebra
is NOT a free word algebra {x1,x2,...,xn}*, but rather one
that splits into a product of free word algebras:

                X* x Y*, X = input alphabet, Y = output alphabet.

That's the same thing as taking the word algebra (X union Y)*
and subjecting them to the relations

x y = y x, x in X, y in Y.

Application of the commutativity rule in an algebraic
calculation, in fact, corresponds directly to what you would
call a "look-ahead". That's what a look-ahead actually is.

If you context-free grammar were like this:

S -> T + S T
T -> "{" S "}" + "(" S ")" + "x"
with input alphabet
X = { "{", "(", "}", ")", "x" }.

then the thing actually being parsed (the canonical bottom-up SSDT, for
instance) would be:
                      start -> S z
S -> T a + S T b
T -> "{" S "}" c + "(" S ")" d + "x" e
with output alphabet
Y = {a,b,c,d,e,z}.

You can freely manipulate this to your heart's content and whatever
you transform it into will give you exactly the same parse structures
(as encoded by output words from Y). For instance, the S rule becomes

start -> T a (T b)* z.

The parse of an input word in X*, such as v = "{x(x)}" would be the set
of all words w in Y* for which S -> vw. If the SSDT is deterministic
there's only 1 w.

If it's not deterministic, there can be an *infinite* set of w's. In
fact, in the most general case, the words w, themselves can form an
entire context-free langauge over Y*.

In all cases, the actual set can (in actual fact) be written down
as a context-free expression over Y* in a size that's LINEAR in v.
For the specific example, a context-free expression corresponding
to S can be written down as a right-linear grammar:

start -> <0| T
T -> "{" <1| T +
          "(" <2| T +
          "x" e F
F -> [0,1,2] a Q +
          |3> b Q
Q -> <3| T +
          |1> "}" c F +
          |2> ")" d F +
          |0> z
with
              [0,1,2] = |0><0| + |1><1| + |2><2|.

(the actual expression is
                  S -> <0| U V ((W + <3| U) V)* |0> z
where
                  U = ("{" <1| + "(" <2|)* "x" e
                  V = [0,1,2]a + |3>b
                  W = |1> "}" c + |2> ")" d
).

The shift-reduce conflict is proven to be superficial only by algebraic
manipulation (substitution for T and commuting the symbols (i.e.
"look-ahead")):

Q -> "{" <3|<1| T +
          "(" <3|<2| T +
          "x" <3| e F +
          "}" |1> c F +
          ")" |2> d F +
          |0> z

So the SSDT, itself, is deterministic (regardless of what form it's
expressed in, even if you don't make the manipulations).

The output set for the word {x(x)} would come straight out of the SSDT:

<0| <1| e [0,1,2]a <3| <2| e [0,1,2]a |2>d |3>b |1>c [0,1,2]a |0> z
= <0|<1|[0,1,2]<3|<2|[0,1,2]|2>|3>|1>[0,1,2]|0> eaeadbcaz
= eaeadbcaz.

-- a 1 element set. You can readily see the tree corresponding to
this with the interpretations:

a = Build S from [T].
b = Build S from [S T].
c = Build T from ["{" S "}"].
d = Build T from ["(" S ")"].
e = Build T from ["x"].
                  z = Build tree with root [S].

which corresponds you the tree

        S
        |
        T
      /|\
    / | \
  / S \
| / \ \
| S T \
| | /|\ \
| | / S \ |
| | | | | |
| T | T | |
| | | | | |
{ x ( x ) }

(In more complex examples, cyclic grammars, the algebraic
notation necessarily supersedes any tree-based notation).

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: regular expression operators in CF grammars (& SSDTs).

"Mark" <whopkins@alpha2.csd.uwm.edu>21 Jul 2002 02:11:53 -0400

"Mark" <whopkins@alpha2.csd.uwm.edu>
21 Jul 2002 02:11:53 -0400