Re: Have I discovered something new?

"Mark" <whopkins@alpha2.csd.uwm.edu>
31 Jul 2002 01:17:59 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Have I discovered something new? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Have I discovered something new? cfc@world.std.com (Chris F Clark) (2002-07-21)
Re: Have I discovered something new? kgw-news@stiscan.com (2002-07-21)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-24)
Re: Have I discovered something new? steve@lurking.demon.co.uk (Steve Horne) (2002-07-25)
Re: Have I discovered something new? robert.corbett@sun.com (Robert Corbett) (2002-07-31)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-31)
Re: Have I discovered something new? cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-04)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-08-04)
| List of all articles for this month |

From: "Mark" <whopkins@alpha2.csd.uwm.edu>
Newsgroups: comp.compilers
Date: 31 Jul 2002 01:17:59 -0400
Organization: University of Wisconsin - Milwaukee, Computing Services Division
References: 02-07-057
Keywords: parse
Posted-Date: 31 Jul 2002 01:17:59 EDT

"Steve Horne" <steve@lurking.demon.co.uk> writes:
>The basic idea was that for each token recognised, the 'parser' should
>form a new and more specialised grammar from the old one in which the
>head part of all the strings in the grammar is restricted the the
>sequence of tokens shifted before. Thus any ambiguity in the input is
>simply maintained in the output grammar.


The general problem is to find the intersection of a SSDT T and the
regular expression v (X* x Y*), where v is an input symbol in X.


Suppose the grammar for T is


qi -> q1 Ai1 | ... | qn Ain | Bi; i = 1,...,n


where all the left-recursions are removed to the A-expressions.


The "empty-test" grammar is:


[qi] -> [q1] [Ai1] | ... | [qn] [Ain] | [Bi]; i = 1,...,n
where for the A's and B's:
[e] = 1 if e contains the empty word 1; [e] = 0 else


That gives you the solutions [qi] = 1 or 0, for each i (which, thus,
embodies the emptiness test on each q).


In terms of this, the grammar for the intersection with v (X* x Y*) is:


qi' -> q1' Ai1 | ... | qn' Ain |
[q1] Ai1' | ... | [qn] Ain' | Bi'; i = 1,...,n
where for the e's


                      e' = { vrs in e: r in X*, s in Y* }


The example
                            [(avb)* + (bw + cv)*]' = abv(abv)* + c (bw + cv)*


with X = {v,w,x}; Y = {a,b,c}, shows the effect that the commutativity
of the X's with the Y's has on the determination of e'.


The Y symbols all clump up to the head of the components of the
expression e'.


The total grammar is the reduction of the one involving the rules for
q1,...,qn,[q1],...,[qn],q1',...,qn'.


All the [qi]'s reduce to 0 or 1, so they are all gone in a reduced
grammar.


A tree-based approach is a grammar-based approach, since a tree is
just a fancy notation for a grammar (as described in my previous
article).


Any tree-based or grammar-based method is out.


To put it more directly: the instant you decide to do anything with
trees/forests or grammars, in any way at all (no matter what or how),
you've already lost the game, so to say.


To get something that maintains something linear in the size of the
original word MUST make use of some equivalent algebraic notation for
context-free expressions, as illustrated in the previous article.
It's the bare minimum necessary to get an O(n) resolution to the
version of the parsing problem you're alluding to (which matches what
I called "Parsing" under header (E)).


There is, in fact, a very simple way to get the desired partial
reductions. Once you've expressed the SSDT in question as a context
free expression e, then you simply write down e', itself.


In fact, everything can be PRE-COMPUTED all in advance. The net
result will be that you'll be traversing what, for all intents
and purposes, is a finite state automaton (in a suitably generalized
context).


This is defined recursively by:


(AB)' = A' B + [A] B'
1' = 0
x' = 1, if x = V, 0 else; x in X union Y union D
                                (sum U) = sum { u': u in U }
where D comprises the set of operator symbols.


The last rule leads to:
0' = 0; (A+B)' = A' + B'; A*' = A' A*


as consequences; using the definitions
0 = sum {};
A+B = sum {A,B};
A* = sum {1,A,AA,AAA,...}


Here's a real eqregious example to show how this all works:
S -> 1 | S | SS | x S T
T -> u S v | y
with input alphabet
X = {u,v,x,y}.


The canonical bottom-up SSDTG is then


start -> Sz
S -> a | Sb | SSc | xSTd
T -> uSve | yf
with output alphabet
Y = {a,b,c,d,e,f,z}
and standard interpretation in terms of tree-building operations:
a = "reduce S -> 1"
b = "reduce S -> S"
c = "reduce S -> SS"
d = "reduce S -> xST"
e = "reduce S -> uSv"
f = "reduce S -> y"
z = "reduce root -> S"


You can get a context-free expression (CFE) in a similar way as before.
Remove the left-recursion, so:
start >= Sz
                                    S >= a + Sb + SSc + xSTd
T >= uSve + yf
becomes:
                            start >= Sz
S >= (a + xSTd) (b + Sc)*
T >= uSve + yf
Define SF and TF by:
SF >= |0>z + |1>Td (b+Sc)* SF + |2>veTF + |3> c (b+Sc)* SF
                      TF >= |4>d (b+Sc)* SF.
and SS, TS by:
SS = S SF; TS = T TF
and set
SQ = (b + Sc)* SF.
Then the corresponding CFE comes out of reducing the system to right-linear
form:
                          start >= <0| SS
SS >= a SQ + x <1| SS
SQ >= b SQ + <3| SS + SF
SF >= |0> z + |1><4| TS + |2> ve TF + |3> c SQ
TS >= u <2| SS + yf TF
                          TF >= |4> d SQ


The components of the rules not involving input symbols are treated as
right-recursions and are eliminated. The above system can be written
as
start >= A SS
SS >= B SQ + xM
SQ >= C SQ + D SS + E SF
SF >= F TS + G SQ + vN + |0>z
TS >= uP + yQ
TF >= H SQ
where
    A = <0|, B = a, C = b, D = <3|, E = 1, F = |1><4|, G = |3> c, H = |4> d
    M = <1| SS, N = |2>e TF, P = <2| SS, Q = f TF


This yields,
SS >= R (xM + W ((vN + |0>z) + F(uP + yQ)))
where
R = (B (C + EG)* D)* = (a (b + |3>c)* <3|)*
W = B (C + EG)* E = a (b + |3>c)*
and
                        TF >= H X (DxM + E(vN + |0>z) + EF(uP + yQ))
with
X = (C + DB + EG)* = (b + <3|a + |3>c)*
and
start >= AR (xM + W ((vN + |0>z) + F(uP + yQ)))


Together, this yields a quasi-deterministic finite-state automaton
via the rlght-linear system:


start >= x <0| R <1| SS
+ v <0| R W |2> e TF
+ u <0| R W |1><4|<2| SS
+ y <0| R W |1><4| f TF
+ <0| R W |0> z
                  SS >= x R <1| SS
+ v R W |2> e TF
+ u R W |1><4|<2| SS
+ y R W |1><4| f TF
+ R W |0> z
                  TF >= x |4>d <3|<1| SS
+ v |4>d |2> e TF
+ u |4>d |1><4|<2| SS
+ y |4>d |1><4| f TF
+ |4>d |0> z
where
R = (a (b + |3>c)* <3|)*
W = a (b + |3>c)*
X = (b + <3|a + |3>c)*.


Parsing the word xuxyv yields the context-free expression:


    start = (<0|R<1|) (RW|1><4|<2|) (R<1|) (RW|1><4|f) (|4>d|2>e) (|4>d|0>z)


where the bracketed items corresponding respectively to the symbols
x, u, x, y and v in the input; and (for |1>d|0>z) the end-marker.


This expression gives you the subset of Y* corresponding to all the
parses of the input. This subset is not only infinite, but it's not
even regular -- it's a context-free subset in its own right!


This is why you can't get away from using some form of algebraic
notation for context-free expressions.


The other versions of parsing I mentioned (recognition, enumeration)
then reduce to the problem of testing for emptiness and enumerating
on this context-free expression.


The emptiness test comes by replacing all the Y-symbols by 1's:


R = ((1 + |3>)* <3|)*
W = (1 + |3>)*
X = (1 + <3| + |3>)*.
    start = <0|R<1| RW|1><4|<2| R<1| RW|1><4| |4>|2> |4>|0>


and reducing to normal form:
R = X = |3>* <3|*
W = |3>*
and
    start = <0|<3|*<1| |1><4|<2| <3|*<1| |1><4| |4>|2> |4>|0>
= <0| <3|* <4| <2| <4| |4> |2> |4> |0>
= <0| <3|* |0> = 1


For the word xuxv, the expression would be the part without the
RW|1><4|:


    start = <0|R<1| RW|1><4|<2| R<1| |4>|2> |4>|0>


and the emptiness test would give you:


        start = <0|<3|*<1| |1><4|<2| <3|* <1| |4>|2> |4>|0> = 0


since <1| |4> = 0. So, xuxv is not recognized by the original CFG.


In general, if a version of Valiant's algorithm is used, rendered
suitably in algebraic form as a process for reducing pure operator
expressions to normal form, then the test will be O(n^log_2(7)) in
complexity.


The problem of enumeration would be that of rattling off the actual
output words of Y* from the context-free expression. Here, that would mean
enumerating all the words, one-by-one, that come out of the
expression:


      <0|R<1| RW|1><4|<2| R<1| RW|1><4|f |4>d|2>e |4>d|0>z
with
R = (a (b + |3>c)* <3|)*
W = a (b + |3>c)*
X = (b + <3|a + |3>c)*


Since the SSDT came from the original grammar via the canomical "bottom-up"
extension, then each word is actually a word that directs the building of a
parse tree from bottom-up.


I won't discuss that here.


------------------


As an aside:


In the first article, different varieities of Kleene algebras were
described: one closed under arbitrary sums (Quantales); another
under sums of rational subsets (*-continuous Kleene algebras) and
two others closed respectively closed under sums of context-free
subsets (a "CF-continuous" Kleene algebra), and Turing subsets
(a "turing-continuous" Kleene algebra). The latter two are unknown.


The key theorem, not precisely stated in the first article, in full
generality gives you a way to represent these larger algebras using
regular expressions.


In detail, if you take a *-continuous Kleene Algebra, K; then close
it up under context-free sums or turing-sums to get respectively C(K)
or T(K), then the result -- regarded as a *-continous Kleene algebra --
can be expressed as that part of the algebra
Cn x K (for any n >= 2)
Cm x Cn x K (for any m, n >= 1)
which commutes with the operator symbols in the Cm's and Cn's.


The tensor product A x B of the algebras A and B is the common
extension of A and B in which the A's commute with the B's. It can
be formally defined by the set { (a,b): a in A, b in B } subject to
the identities:


(a,b) (a',b') = (a a', b b')
(1,1) = 1
and
sum (U x V) = (sum U, sum V); U, V rational subsets of A & B


What was described in th first article was the construction of
Cn x R(M), for a monoid M. The closure of R(M) under context-free sums
is C(R(M)) = C(M), the context-free subsets of M. So the result
is that C2 x R(M) is an (*-continuous) Kleene algebra for context-free
subsets of M. The cases M = X*, M = X* x Y* give you respectively
algebra for context free languages over X, and SSDT's for input alphabet
X, output alphabet Y. The latter are what parsers work with.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.