Parsing != (either 'recursive descent' or 'operator precedence')

hunk@alpha1.csd.uwm.edu (Mark William Hopkins)
16 May 1999 23:54:41 -0400

          From comp.compilers

Related articles
Re: Jack W. Crenshaw - Any clues how to optimize ? mallors@ips1.msfc.nasa.gov (1999-04-19)
Re: Jack W. Crenshaw - Any clues how to optimize ? andersh@maths.lth.se (Anders Holtsberg) (1999-05-03)
Re: Jack W. Crenshaw - Any clues how to optimize ? mikee@cetasoft.cog (1999-05-07)
Parsing != (either 'recursive descent' or 'operator precedence') hunk@alpha1.csd.uwm.edu (1999-05-16)
| List of all articles for this month |

From: hunk@alpha1.csd.uwm.edu (Mark William Hopkins)
Newsgroups: comp.compilers
Date: 16 May 1999 23:54:41 -0400
Organization: University of Wisconsin - Milwaukee, Computing Services Division
References: 99-04-067 99-05-010 99-05-021
Keywords: parse

On 3 May 1999 14:44:50 -0400, Anders Holtsberg <andersh@maths.lth.se> wrote:
><[Op precedence parsing more economical than rec. descent]>
> (Or am I saying something stupid here: for me "recursive
>decent" and "operator precedance" is mutually exclusive. Correct???)


You can algebraically derive a parser from the SDT that specifies it.


This, by far, transcends anything that even remotely resembles
recursive descent or operator precedence (or anything, in actual fact,
that is published or widely known).


Note: parsers operate on tranductions, not grammars. A parser for a
'context-free language' is actually processing a syntax directed
transduction, or SDT.


An SDT can be converted to an SSDT (simple SDT) by allowing, as output
actions, operations on a 'value stack' or 'tree-building operation'.
This is why you see these items commonly in relation to parsers. The
function actually being fulfilled by having these objects around is to
convert SDT's to SSDT's, because the latter are much more tractible --
as about to be described below.


An SSDT can be specified as a list of rules of the form


N -> RE(x,y,N)


where RE is a regular expression involving input symbols (the x's), output
actions (the y's) and non-terminals (the N's).


Every SSDT is a system of inequalities of the form


N >= RE(x, y, N)


over an algebra that is contains the x's, y's and involves regular
expression operations. The algebra, itself, has the following operators


            0: the 'fail' symbol
            1: the 'empty word'
            A+B: to indicate alteration (may also be denoted A|B).
            A B: to indicate concatenation
            A*: to indicate iteration (repetitions of 0, 1, 2, or more A's).


It's called a Kleene algebra and is axiomatized by the following:


(w + x) + y = w + (x + y) (wx)y = w(xy)
w + 0 = w = 0 + w w1 = w = 1w
w + x = x+w w(x + y)z = wxz + wyz
w + w = w w0z = 0


with inequality defined by
u >= v if and only if u = v + x for some x
                                                        equivalently: if and only if u = u + v


and by the following axioms for the star-operator:


x* >= 1 + x x*


if x >= a b^n c for all n = 0, 1, 2, ... then x >= a b* c


A weaker axiomatization replaces the last rule by the following:


if x >= ax + xb + c then x >= a* x b*


which is a general property deriveable from the last rule. However, the
stronger axiomatization is required for what follows.


----------


The most important, and basic, properties of the >= operator are that:


(A) A + B is the least upper bound of the set { A, B }: A + B = lub {A, B}
        Everything >= 0. Another way to say this: 0 = lub {}.
        Every finite set { A1, A2, ..., An } has, as its least upper bound
              lub { A1, A2, ..., An } = 0 + A1 + A2 + ... + An.
(B) lub(A) lub(B) = lub(A B), for all finite sets A, B


and as a consequence of the last rule:
(C) A* = lub { 1, A, A^2, A^3, ... } = lub {A}*


In general:


(D) Every rational subset RE of the algebra has a least upper bound, such
        that
lub(U V) = lub(U) lub(V), for all rational subsets U, V
                            lub(U*) = lub(U)*
lub(U + V) = lub(U) + lub(V)


The rational subsets of an algebra are those constructed from finite sets
using the set operations:


A B = { u v: u in A, v in B }
A* = {1} union A union A^2 union A^3 ...
A + B = A union B


So a Kleene algebra is actually an image of its own rational subsets.


----------


An SSDT over alphabets X (for inputs) and Y (for outputs) is a finite system
of inequalities of the form


N >= <Expression involving N's, X's, Y's>


with a set of variables (designated above by N's) called non-terminals,
and a specially designated variable (usually denoted S) called the 'start'
symbol.


The subset of (X+Y)* specified by an SSDT will also be called an SSDT.


The fundamental property of SSDT's with respect to this representation is
that if the transduction 'generated' by S is the set L (a subset of (X+Y)*)
then the least solution for the SSDT, in the main variable S is:


                                       S = lub L


So, every SSDT is represented by an object in the algebra: its "translation
expression".


The algebra this occurs in is defined by the following:


* It is generated by the symbols from X and Y
* xy = yx, for all x in X, y in Y
* It is closed under 'least solutions' to SSDT's.


A general property of such SSDT's is that they can be embedded in a larger
(but simpler) algebra consisting of the following:


* If is generated by the symbols from X and Y
* It includes the operator symbols from a set D:
D = { <0|, <1|, ..., <n-1|, |0>, |1>, ..., |n-1> }
                        where n > 1.
* xy = yx, xd = dx, yd = dy, for all x in X, y in Y and d in D
                    * <i| |j> = 1 if i = j, 0 else
* |0><0| + |1><1| + ... |n-1><n-1| = 0


For anyone familiar with Quantum Physics, the distinct (and actually deep)
analogy entailed by the last two properties is the reason the extra
symbols are denoted as they are.


Call this algebra C_n(X, Y). Then the following is true:


FUNDAMENTAL THEOREM OF TRANSDUCTION EXPRESSIONS:
      The SSDT's are precisely those subsets, L, of (X+Y)* which have least
      upper bounds in C_n(X, Y) for which


(lub L) d = d (lub L), for d in D


As such, C_n(X, Y) is a simple extension to the 'regular expression' notation
which happens to also be sufficiently powerful to represent all translation
expressions. The ones which commute with the operators in D are provably
equal to least upper bounds of the very subsets of (X+Y)* that happen to
be SSDT's.


The connection this has to parsers is via the following interpretation:


0 = error
1 = do nothing
AB = do A, then do B
A+B = do A or do B
A* = do A: 0, 1, 2, or more times
x = test the following input for equality to x, and shift if equal.
y = carry out action y
<i| = push i
|j> = pop and test for equality to j.


So, the act of solving an SSDT amounts, essentially, to writing down a
non-deterministic parser for the SSDT. Since every element of C_n(X, Y)
is the least upper bound of a rational subset of X+Y+D, then the solution
can always be expressed as a "regular expression" over the symbols X, Y
and D. This, in turn, can be expressed as (at the least solutuion to) a
finite system of inequalities of the form:


q >= sum of terms of the forms 1, z q, for z in X+Y+D
                            with only one inequality for each q.


This directly corresponds to a finite automaton with the interpretations


q = state
q >= ... 1 ... means q is a final state
q >= ... z q' ... means q has a transition to q' after doing z.


If the original SSDT is deterministic, this system can be written as
directly as deterministic code using the following correspondences:


* state = goto label
* state transition = goto
* final state = return (or exit)
* tests are ultimately incorporated in if-then-else statements.


To actually render tests as if-then-else statements, it may be necessary
to use the commutativity rule to pull up pop's and input symbols ahead
of everything else. For example, if


q = ... + y q' + ...
q' = x q''


then this is equivalent to:


q = ... + x y q'' + ...


The key is to get a sum of terms headed by distinct x's and |i>'s.


This corresponds closely to what are called "look-aheads" in parser
terminology.


The actual process of resolving the non-deterministic branching in favor
of if-then-else's, via 'look-ahead' substitution and other algebraic means
is where the actual work of the computation would be involved.


The other major task is to resolve is error-handling -- which can
actually be done quite effectively, once you have the derivation of
the rest of the SSDT on-hand, sitting before you.


Since all of the computation involved in transforming an SSDT to a
regular expression (or, equivalently, a finite system of equations)
more art than science this is not something that can be easily
automated. A corollary of this is that no regular or automated
process is going to come anywhere close to embodying any substantial
portion of expertise involved in doing these computations. To
actually get an automated process to do so amounts to nothing less
than writing an artificial intelligence expert system for SSDT
system-solving and Kleene algebra computation.


In comparison, the tools in the current state of the art are rather
blunt and inefficient by about a couple orders of magnitude --
especially given that the processing of a language (normally
considered 'difficult') like that of C is rather routine using the
methods based on this background information. Also, since they
generally do not embody any of the information here (because it is
still 'unknown' at this time), they represent backwards and obsolete
technology.


For a good example of a place where this has actually been applied,
look at the regular expression demo software in the regex directory of
the comp.compilers archive. The parsers used in processing the inputs
in these demos were derived by hand.


Also: listed in the free compilers archive is the CAS 8051 assembler,
which incorporates a large subset of C's expression syntax and a
fair-sized portion of its statement syntax (for doing assembler
directives). This parser was derived in a similar fashion, by hand.


In all cases, the derivation process is actually rather simple,
routine and very reliable.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.