24 Jul 2002 02:26:47 -0400

Related articles |
---|

Have I discovered something new? steve.horne@iris.co.uk (Stephen Horne) (2002-07-15) |

Re: Have I discovered something new? torbenm@diku.dk (Torben Ægidius Mogensen) (2002-07-21) |

Re: Have I discovered something new? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21) |

Have I discovered something new? cfc@world.std.com (Chris F Clark) (2002-07-21) |

Re: Have I discovered something new? kgw-news@stiscan.com (2002-07-21) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-24) |

Re: Have I discovered something new? steve@lurking.demon.co.uk (Steve Horne) (2002-07-25) |

Re: Have I discovered something new? robert.corbett@sun.com (Robert Corbett) (2002-07-31) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-31) |

Re: Have I discovered something new? cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-04) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-08-04) |

From: | "Mark" <whopkins@alpha2.csd.uwm.edu> |

Newsgroups: | comp.compilers |

Date: | 24 Jul 2002 02:26:47 -0400 |

Organization: | University of Wisconsin - Milwaukee, Computing Services Division |

References: | 02-07-057 |

Keywords: | parse |

Posted-Date: | 24 Jul 2002 02:26:46 EDT |

"Stephen Horne" <steve.horne@iris.co.uk> writes:

*>My main question refers to something I read in 'Parsing Techniques, A*

*>Practical Guide'. This book contained a statement that the possibility*

*>of an linear-time parsing algorithm for arbitrary context-free*

*>grammars was still an open question - that there was no proof or*

*>disproof, that every known context-free grammar can be parsed in*

*>linear time using specially designed methods, but that no general*

*>algorithm currently exists which works for all context-free grammars.*

The problem is ambiguous and can actually be resolved into quite a few

INDEPENDENT problems, each with its own complexity:

(A) Recognition:

Given a CFG G over an alphabet X, and a word w in X*,

determine if w is in the subset L(G) of X*.

That -- commonly confused with "parsing" is most decidedly NOT parsing.

Known optimal recognition methods are based on Valiant (a path-search

method) and are of order O(n^log_2(7)).

(B) Enumeration 1:

Given an SSDT G over an input alphabet X, output alphabet Y,

and a word w in X*, enumerate the set G(w) of v in Y* such that

wv is generated by G.

An SSDT with input alphabet X, output alphabet Y is a context free grammar

over the word algebra X* x Y*.

A complete enumeration is impossible since since G(w) is generally an

infinite set for a given w in X* -- in fact, a context free language

over Y*, itself.

(C) Enumeration 2:

With the same conditions as in (B), let G(w) = {v0,v1,v2,...},

a countable subset of Y*. Given G, a word w, and a number

n = 0,1,2,..., find vn (if G(w) is finite of size N and n >= N,

no answer need be given).

This can be regarded as "parsing", if G is deterministic, for then G(w)

will never have more than one member in it.

(D) Enumeration 3: Choice

Enumeration 2, with n = 0. Find at least ONE output v corresponding

to w.

This MIGHT be equivalent, in complexity terms, to (A) Recognition.

(E) Parsing

Given G and w, as in (B), find the set G(w), itself.

This is the form of "parsing" as it more commonly thought of, for

instance, in comprehensive methods (any dynamic-programming based

method: Tomita, Earley, etc.) where entire tree-sets are produced

for a given word.

The complexity depends on the form of representation chosen for

sets G(w). The "list them one-by-one" representation, besides being

the least efficient form, is not an option since G(w) may be infinite.

Since G(w) can be any context-free language (over Y), then whatever

representation used for parse-sets must amount to nothing more or

nothing less than a representation for context-free languages, themselves

-- i.e., a context-free analogue of the notation for regular

expressions -- "context-free expressions".

Tree-based representation is also out of the question. A tree (or

"packed" forest) is just a fancy graphic notation for a context-free grammar.

At best, it will yield representations of O(n^3) complexity for the

description of a set G(w).

A suitable notation for context-free expressions makes this problem LINEAR

-- as is already dictated by Kolomogorov complexity theory. Translation

cannot increase the complexity of a description. Just writing down G(w)

is, itself, a notation for the sets G(w), since it uniquely identifies

the set. G(w) is its own representation ... and is linear in w and G

(it's just G + w + "(" + ")").

So, the best representation gives you a linear complexity for (E).

But this particular example not particularly amenable to use by the other

versions of parsing listed before.

Generally, you can factor the various versions of parsing into (E),

plus something:

(A) Recognition = Parsing + Emptiness Test (of G(w)'s).

(B),(C),(D) Enumeration = Parsing + Enumeration (of individual elements

v in Y* from the set G(w)).

So, the notation used for sets G(w) -- and, indeed, for context-free

expressions in general -- should be suitable to use by these other

problems.

So, let me describe one such notation for context-free expressions,

which I've made reference to here a large numbers of times here and

elsewhere.

Regular expressions, context free expressions and general language

expressions can be thought of as residing in an algebra called a

Quantale. This has the following structure:

(A) A PRODUCT (interpreted as word concatenation) (a,b) |-> ab.

(B) An IDENTITY (interpreted as the 'empty word' or 'null action') 1.

(C) An PARTIAL ORDERING relation <= (which can be interpreted as

'inclusion' or 'deriveability').

(A) and (B) together form a MONOID -- an algebra with the following

rules:

(ab)c = a(bc); a1 = a = 1a.

Every word algebra X* over an alphabet X is a monoid, in fact a FREE

monoid generated by the set X. That is: the structure derived by

forming 'words' from X and subjecting them to the two rules above,

but nothing more.

A non-free monoid would, in contrast, be one with the additional

rule:

ab = ba.

This example gives you a COMMUTATIVE monoid ... and the free commutative

monoid P(X) generated by X is just X* itself, with the additional rule

ab = ba imposed. P(X) is just the family of Parikh vectors over X with

the correspondence:

If X = {x1,x2,...}

then (a1,a2,...,an) <-> x1^a1 x2^a2 ... xn^an.

The other example, alluded to above, is the non-free monoid derived by

taking (X union Y)* and subjecting it to the relations

xy = yx, x in X, y in Y.

This gives you the direct product

X* x Y*

which is the word algebra underlying SSDT's and all other classes of

translation/transductions.

A 3-way or n-way generalization (i.e. context-free relation) could be

arrived at by considering more general products, like

X1* x X2* x ... x Xn*.

The partial ordering, in a Quantale Q, has the following property:

(A) Completeness

Every subset A of Q has a least upper bound (sum A).

Sums correspond to "0" and the "|" operator and are interpreted as

"failure" (for 0) and "non-deterministic branching" (for |).

(B) Continuity (or Distributivity)

If A and B are subsets of Q, then

sum AB = (sum A) (sum B)

where AB is the set { ab: a in A, b in B }.

Sums of special interest are:

0 = sum {}

a+b = sum {a,b}

a* = sum {1,a,aa,aaa,aaaa,...}.

Other types of algebras, more restricted than the Quantale, can be

arrived at by limiting the scope of (A) and (B) to more specific

families of subsets, such as:

Finite subsets Q -- dioids = idempotent semiring

Rational subsets Q -- *-continuous Kleene algebras

Context-free subsets Q -- (no name exists).

Turing subsets of Q -- (no name exists).

Countable subsets Q -- "omega-continuous" Kleene algebras

A suitable algebra for representing context-free expressions need only

have context-free subsets covered by (A) and (B).

Dioids can be defined by a purely algebraic axiom set:

(ab)c = a(bc); a1 = a = 1a; a0d = 0; a(b+c)d = abd+acd

(a+b)+c = a+(b+c); a+0 = a = 0+a; a+a = a; a+b = b+a

The ordering is defined by

a >= b <==> a = a + b.

*-continuous Kleene algebras have the additional operator a* and can be

represented with the additional axiom:

sum { ad, abd, abbd, abbbd, ... } = a b* d.

*>From this, alone, it's possible to prove that (A) and (B) hold for all*

rational subsets of Q.

The family R(Q) of rational subsets of Q is, itself, a *-continuous

Kleene algebra. It's defined as the smallest family of subsets of Q

representable by regular expressions; i.e., the closure of the family

of finite subsets of Q under the 3 Kleene operations

A union B; AB = {ab:a in A, b in B}

A* = {1} union A union AA union AAA union ...

Likewise, the family C(Q) of context-free subsets is, itself, the

archetypical (context-free-closed) algebra.

The family C(X*) is just the family of context-free languages over X,

and C(X* x Y*) is the family of simple syntax directed transductions

(SSDT's) over input alphabet X and output alphabet Y.

The family 2^X* of ALL subsets of X* is the free quantale generated by X;

likewise 2^M is the free extension of the monoid M to a quantale.

A suitable algebra for C(M) for ANY monoid is obtained by taking the

quantale 2^M, and adding to it an operator algebra powerful enough to

represent a stack machine. A more direct representation of a stack

machine yields a notation more amenable to parsing methods.

The algebra I've described elsewhere is the formal Dirac "Bra-Ket"

algebra C_n, given by the 2n symbols:

<0|, <1|, ..., <n-1|, |0>, |1>, ...., |n-1>,

subject to the identities:

Orthogonality: <i| |j> = 1 if i = j; 0 else

Completeness: |0><0| + |1><1| + ... |n-1><n-1| = 1.

The special case for n = 2 can be written as:

b, d, p, q; subject to the relations

bd = 1 = pq; bq = 0 = pd; db + qp = 1

Any of the other algebras C_n can be embedded within C_2, so that

C_2 alone suffices.

A stack-based interpretation can be posed as:

|0><0| = empty stack test

<i| = push symbol #i (i = 1,...,n-1)

|i> = pop and test for symbol #i (i = 1,...,n-1).

Since quantales have addition over unrestricted subsets, you can define

matrix algebras of quantales of any dimension (even uncountably infinite).

This gives you an algebra M_n(Q) of n x n matrices over a quantale Q.

The algebra C_2 is its own 2 x 2 matrix algebra

C_2 = M_2(C_2)

with the correspondence:

A B <--> dAb + dBp + qCb + qDp

C D

This also means that

C_2 = M_2(C_2) = M_2(M_2(C_2)) = M_4(C_2)

C_2 = M_4(C_2) = M_4(M_2(C_2)) = M_8(C_2)

etc.

so that C_2 contains ALL of its (finite-dimensional) matrix algebras.

The algebra formed by combining 2^M and C_n and subjecting it to the

relations

A q = q A; A in 2^M; q in C_n

gives you an algebra

C_n(2^M).

This can be represented as a matrix algebra over 2^M by identifying:

A <-> A I, A in 2^M

d <-> transpose of b

q <-> transpose of p

b <-> 1 0 0 0 0 0 ...

0 0 1 0 0 0 ...

0 0 0 0 1 0 ...

...

p <-> 0 1 0 0 0 0 ...

0 0 0 1 0 0 ...

0 0 0 0 0 1 ...

...

The subalgebra obtained by taking the rational subsets R(M) and C_n

gives you an algebra C_n(M) that gives you a notation for context-free

languages over M build directly on top of the algebra of regular

expressions R(M) over M.

The basic property if C_n(M) is that:

Fundamental Theorem of Context Free Expressions:

(A kind of "Schur's Lemma"):

An expression e in C_n(M) commutes with all the members

of C_n if and only if e has the form

e = A I,

some a context-free subset A in C(M).

This holds for any n > 1.

So, a general context free expression over a monoid M is a regular expression

combined with the extra notation for C_n subject to the commutativity rule

qw = wq, w = M, q in C_n.

A good example showing how this algebra works and showing its suitability

as a notation for context-free expressions (and parser output sets G(w))

comes by looking at the worst kind of contorted pathological grammar

you can think of:

S -> 1 | S S | x.

taken over the monoid {x}*.

In algebraic terms, a grammar is a system of inequalities whose least

solution is the expression for the language (or transduction or relation,

etc.) generated by the grammar:

S = least s for which: s >= 1 + s s + x.

In algebraic notation, the grammar is just:

S >= 1 + S S + x.

The corresponding system of inequalities would be a "context-free" system,

by analogous use of terms.

A parser actually works with an SSDT, not a grammar. It is concerned with

deriving output structures from a grammar base. The two canonical

extensions of this grammar to an SSDT are:

"Top-down": start >= Z S; S >= A + B S S + C x.

"Bottom-up": start >= S z; S >= a + S S b + x c.

The bottom up case is the one we're interested in here. The output

alphabet Y = {a,b,c,z} has the standard interpretation as tree-building

operations:

a = "form an S node from []"

b = "form an S node from [SS]"

c = "form an S node from [x]"

z = "form the root node from [S]"

The subset of X* x Y* corresponding to this SSDT commutes, in the

larger algebra C_n(X* x Y*), with all the operator symbols. So, you

can actually derive a regular expression (with operator symbols)

directly from the grammar itself.

The least solution to any system containing the inequality

x >= u + x v

is also the least solution to the system with the inequality replaced

by

x >= u v*.

So the above system can be replaced by:

start >= S z

S >= (a + x c) (Sb)*.

In general, all left-recursions can be eliminated from a context-free

system.

Define SF as the least solution to

SF >= |1> b (Sb)* SF + z |0>.

A similar rule allows inequlities of the form

x >= u + v x

to be replaced by inequalities of the form

x >= v* u

to yield equivalent systems.

Using the above property can also be written as the least solution

to

SF >= (|1>b (Sb)*)* z |0>

which is just

SF = (|1>b (Sb)*)* z |0>,

itself. Thus

SF = |1>b (Sb)* SF + z |0>.

In general, the least solution to ANY system of inequalities will

also satisfy the corresponding system of equalities. So, we can also

write

start = S z

S = (a + x c) (Sb)*

and obtain the desired SSDT as the least solution.

But now S commutes with the operators. So, define SS = S SF. Then,

<0| SS = S <0| SF = S z = start.

<1| SS = S <1| SF = S b (Sb)* SF.

So

start = <0| SS

SS = S SF = (a + xc) (Sb)* SF

and

(Sb)* SF = SF + Sb (Sb)* SF

= SF + <1| SS

= <1| SS + |1>b (Sb)* SF + z |0>.

Thus, the SSDT in question is the least solution to a right-linear

system:

start >= <0| A

A >= (a + xc) B

B >= <1| A + |1>b B + z |0>.

or equivalently,

start >= <0| (a + xc) B

B >= (<1| (a + xc) + |1>b) B + z |0>,

or

start >= <0| (a + xc) B

B >= (<1| (a + xc) + |1>b)* z |0>,

or just

start = <0| (a + xc) (<1| (a + xc) + |1>b)* z |0>.

To arrive at something more suitable for parsing, it's best to keep

with the right-linear system above and reduce until every item

on the right starts with an input word:

start >= <0| A

A >= a B + x c B

B >= <1| A + |1>b B + z|0>

The least solution to any system involving

A >= uB + a

B >= vA + wB + b

is that obtained by solving this subsystem first:

A >= a + u (vu + w)* (va + b)

B >= (vu + w)* (va + b)

Applying this to the above gives you the equivalent system:

start >= <0| A

A >= x c B + a U (<1| x c B + z |0>).

B >= U (<1| x c B + z |0>).

where

U = (<1|a + |1>b)*

or

start >= <0| A

A >= x [a U <1|] c B + aUz |0>

B = x U<1| c B + Uz |0>.

where

[z] = 1 + z = "optional z".

*>From this, you can directly read off the output sets G(x^n) corresponding*

to the words x^n, n = 0, 1, 2, ...

G(1) = <0| aUz |0> = a <0| (<1|a + |1>b)* |0> z = a D(a,b) z

where

D(p,q) is the context-free language (the Dyck language)

corresponding to properly nested sequences of p's & q's.

G(x) = <0| [aU<1|] c U z |0>

G(x^{n+1}) = <0| [aU<1|] c (U<1|c)^n U z |0>

= <0| [aU<1|] (cU<1|)^n cUz |0>, n = 1, 2, 3, ...

All of these subsets of Y* are infinite. The complexity of the

expression for G(x^n) is O(n), or O(log(n)) if exponential notation is

explicitly used.

Enumeration would amount to following all the paths of the right-linear

system in a fashion similar to Tomita.

Emptiness test amounts to reducing the pure C_n expression obtained by

replacing all X and Y symbols by 1:

G(1) -> empty[G(1)] = <0| u |0>

G(x^{n+1}) -> empty[G(x^{n+1})] = <0| [u<1|] (u<1|)^n u |0>

u = (<1| + |1>)*

There may be a way of directly implementing Valiant in purely algebraic

form as an algorithm for reducing expressions in C_n to normal form.

Normal form (like the analogue in Quantum Physics) consists of a sum

of terms where in each term all the |n>'s appear to the left of <n|'s.

The normal form of u is just

u = (<1| + |1>)* = |1>* <1|*.

so

empty[G(1)] = <0| |1>* <1|* |0>

= <0| |0> = 1

since

<0| |1>* = <0| (1 + |1> |1>*)

= <0| + <0| |1> |1>* = <0|,

similarly

<1|* |0> = |0>.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.