Bottoms-up, another hour has come: Quantum Parsing

mark@freenet.uwm.edu (Mark Hopkins)
Thu, 17 Feb 1994 09:24:30 GMT

          From comp.compilers

Related articles
Bottom up hand-written parsers in under an hour mark@freenet.uwm.edu (1994-02-12)
Bottoms-up, another hour has come: Quantum Parsing mark@freenet.uwm.edu (1994-02-17)
| List of all articles for this month |

Newsgroups: comp.compilers
From: mark@freenet.uwm.edu (Mark Hopkins)
Keywords: parse
Organization: Compilers Central
References: 94-02-079
Date: Thu, 17 Feb 1994 09:24:30 GMT

(0) Introduction
      This is a follow up to that other article, "Bottom up parsing by
hand, in under an hour" (at which point I could have sworn I heard some
people saying D.O.C.). So if you just joined in, refer to that article
for details.


      A reference list in case you're interested:


Ref: "Bottom up parsing by hand...", me, comp.compilers, Feb 1994
          "The Equational Approach", me, comp.compilers,comp.theory, Oct 1993
          "Quantum Parsing", me, comp.compilers, Aug 1992
          "Introduction to Formal Language Theory", Harrison, M.A. 1978


Harrison discusses representation theorems in Chapter 10 of his book,
particularily the Chomsky-Schuetzenberger Representation Theorem, which
helps bring light to some of the stuff we're doing here. Schuetzenberger
really got off on this kind of stuff and on monoids and algebra, so he was
probably thinking along the same lines, but for some inexplicable reason
just never made the leap to the Context Free Expression notation we're
using here (nor for that matter did anyone else...)


(1) Operator Representations
      In the previous article, we introduced the operator Bra-Ket notation
(borrowed from Quantum Physics), <m|, |m>, and gave them the respective
interpretations:


                                The Stack Representation
                                <m|: Push m.
                                |m>: Pop and test for m.


A point to emphasize, however, is that these are not merely abbreviations
for stack operations, but are abstract algebraic operators in their own
right. This point can be brought to bear by pointing out other possible
representations:


                                The Counter Representation
                                <0|: Counter = 0
                                <m|: Counter = M*(Counter + 1) + m
                                where M is the largest operator number.


                                |0>: Test for Counter == 0
                                |m>: Test for (--Counter)%M == 0, and Counter /= M.


In a 32-bit representation, this leads to an equivalent stack depth of
2^32 for M = 1, and about 30 for M = 2. So this representation is
effectively limited in its application to M < 3.


      Another representation, of a theoretical nature, is the Automaton
Representation. Take the quantum grammar:


                              S = <0| T
                              T = <1| p T + |1> q T + <2| b T + |2> d T + |0>


Underlying this is a 2-state finite automaton:


                              S = T
                              T = p T + q T + b T + d T + 1


                                  [S] -----> [[T]]
                                                        | ^
                                                          \ / p, q, b, d


Replicate this automaton, creating an infinite M-ary tree of their copies.
Then interpret the operators as follows:


                              <0|: Go to the root level
                              |0>: Test for root level.
                              <m|: Go to branch #m below the current level.
                              |m>: Test for branch #m and return to parent level.


With this interpretation, the term <1| p T will become a transition from
state T to state T one level down on the left branch. The term |1> q T
will return from the left-branch back to T. The resulting automaton
looks like this:
                                                        [S]
                                                          |
                                                          v
                                                      [[T]]
                                    *-------- ^ ^ --------*
                                    | | | |
                                p | q | | d | b
                                    v | | v
                                    T --------* *-------- T
                      *---- ^ ^ ----* *---- ^ ^ ----*
                      | | | | | | | |
                  p | q | | d | b p | q | | d | b
                      v | | v v | | v
                      T ----* *---- T T ----* *---- T
                    ... ... ... ...


Notice that only one S state is in use, so none of the other copies of
it are shown.


      In earlier references, I called it an Infinite Automaton. A better
name for it, perhaps, would be: Quantum Automaton.


      Enough theory for today. In this article, we will be graduating from
the Toy language of the last article to a semi-respectable language: a
C subset. It's still somewhat of a toy (no function calls), but you
can actually use it as an interpreter (by the time we're through here).


      This language will have 2 types (int, char), dynamic unbounded arrays
(because I was too lazy to think about bounds checking), and pointers.
As for statements, it will have the usual complement, except for the
switch/case/default (same reason for omission). Expressions: all the
operators except type-casting and fields (because there's no structures
or unions).


      No functions. That you'll have to wait for in the next language.


      The language has the following grammar, presented in Equational form:


      start = (S + D)+


      D = Sp Dc ("," Dc)* ";"
            Sp = "int" + "char"
            Dc = X + "*" Dc + Dc "[" "]"


      S = "if" "(" E ")" S ["else" S]
          + "while" "(" E ")" S
          + "do" S "while" "(" E ")" ";"
          + "for" "(" [E] ";" [E] ";" [E] ")" S
          + "{" S* "}"
          + X ":" S
          + "goto" X ";"
          + "break" ";"
          + "continue" ";"
          + [E] ";"


      E = E "?" E ":" E
          + E "," E
          + E "||" E
          + E "&&" E
          + E b E
          + u E
          + E p
          + E "[" E "]"
          + "(" E ")"
          + X
          + NUM
          + CHAR
          + STR


      X: Identifier
  NUM: Integer number
CHAR: Character constant
  STR: String constant
      b: Binary, infix operators, including the folliowing:
            == != < > <= >= << >> | ^ & + - * / %
            = <<= >>= |= ^= &= += -= *= /= %=
      u: Unary, prefix operators, including the following:
            ! ~ + - ++ -- & *
      p: Postfix operators, including: ++ --


      Note that we're using regular expression notation on the right hand
sides of the rules:
                                            1 means empty string,
                                            [A] = 1 + A,
                                    and A* = 1 + A + A A + ...


This only serves to emphasise the point of view here that Context Free
Grammars are recursive systems of equations between Context Free
Expressions. Ironically, this view of grammars as equation systems was
the one originally adopted in the early days of Formal Language Theory,
but for some odd inexplicable reason it was largely dropped (with the
effect of setting the discovery of the methods described here back by over
30 years). Well, no use crying over spilt milk, I wasn't around so it
makes me no difference...


      As you might have guessed, D, S and E stand respectively for Declaration,
Statement, and Expression. In addition, you have Sp ("type specifier"),
and Dc ("declarator"). All the operators have the exact same precedence as
in C.


      As in the prior article, we'll argument these rules by an Action
Alphabet: Y = { y0, y1, ..., y42 }. This alphabet, also mentioned in the
article, commutes with the input alphabet, X: x y = y x for all x in X,
y in Y. The result of the augmentation:


      start = (S + D)+


      D = Sp y0 Dc y1 ("," Dc y1)* ";"
            Sp = "int" y2 + "char" y3
            Dc = X y4 + "*" Dc y5 + Dc "[" "]" y6


      S = "if" "(" E y7 ")" S [y8 + "else" y9 S y10]
          + "while" "(" y11 E y12 ")" S y13
          + "do" y14 S y15 "while" "(" E y16 ")" ";"
          + "for" "(" [E] y17 ";" (E y18 + y19) ";" [E] y20 ")" S y21
          + "{" S* "}"
          + X ":" y22 S
          + "goto" X ";" y23
          + "break" ";" y24
          + "continue" ";" y25
          + E ";" y26
          + ";"


      E = E y27 "?" E y28 ":" E y29
          + E "," y30 E
          + E "||" y31 E y32
          + E "&&" y33 E y34
          + E b E y35
          + u E y36
          + E p y37
          + E "[" E "]" y38
          + "(" E ")"
          + X y39
          + NUM y40
          + CHAR y41
          + STR y42


I stuck them where I did because I'm thinking ahead to the next article
(the Translation Grammar), where the y's will be defined and the source
code given. The effect of y14, for instance, will be to mark the place
for a "continue" statement inside the loop to go to. As mentioned in the
last article, this notation is powerful enough to represent any language
acceptable to a push down transducer: the Simple Syntax Directed
Translation Languages. Therefore, it constitutes a rather elegant means
of representing Translation Expressions as well, not just Context Free
Expressions (and I haven't even pulled out the big toys).


      Next, the operator alphabet, D = { <0|, ..., <m|, |0>, ..., |m> } is
introduced (D stands for Dyck, see the Harrison Reference). Note that
in the following, I didn't even bother to enclose any of the
non-terminals in the equation for D by operators. They all "inherit"
the <0|, |0> operator pair from D (which gets it from the first rule).
The area enclosed within an operator pair, <m|, |m> will be called
Context #m. When we carry out conflict resolution, our disambiguation
rules will be conditioned on the context they occur in (which almost
sounds like a tautology).


  start = (<0| (S + D) |0>)+


  D = Sp y0 Dc y1 ("," Dc y1)* ";"
        Sp = "int" y2 + "char" y3
        Dc = X y4 + "*" <1| Dc |1> y5 + Dc "[" "]" y6


  S = "if" "(" <2| E |2> y7 ")" <3| S |3> [y8 + "else" y9 <4| S |4> y10]
      + "while" "(" y11 <5| E |5> y12 ")" <6| S |6> y13
      + "do" y14 <7| S |7> y15 "while" "(" <8| E |8> y16 ")" ";"
      + "for" "(" [<9| E |9>] y17 ";"
                              (<10| E |10> y18 + y19) ";"
                              [<11| E |11>] y20 ")" <12| S |12> y21
      + "{" (<13| S |13>)* "}"
      + X ":" y22 S
      + "goto" X ";" y23
      + "break" ";" y24
      + "continue" ";" y25
      + E ";" y26
      + ";"


      E = E y27 "?" <14| E |14> y28 ":" <15| E |15> y29
          + E "," y30 <16| E |16>
          + E "||" y31 <17| E |17> y32
          + E "&&" y33 <18| E |18> y34
          + E b <19| E |19> y35
          + u <20| E |20> y36
          + E p y37
          + E "[" <21| E |21> "]" y38
          + "(" <22| E |22> ")"
          + X y39
          + NUM y40
          + CHAR y41
          + STR y42


      I skipped the left-recursion removal step performed first in the
previous article. Ideally, one should carry out an entire left-recursion
removal process AFTER the action symbols are inserted, but BEFORE the
operator symbols are added. So in actual fact, one will have effectively
converted the syntax to a Greibach normal form before operator symbols are
added. But it's not absolutely essential to do so.


      Roughly speaking, the operator symbols are inserted in such a way as to
guarantee that all occurrences of a given non-terminal with distinct
"follow-sets" will be enclosed within a distinct context, including the
left-recursive occurrences, which are considered to "inherit" their
contexts from their parent non-terminal. The one apparent exception to
the general rule is the two occurrences of Dc, which both inherit context
#0. However, they have the same follow set: y1 ("," Dc y1)* ";"


      Conversion to a Quantum Grammar is a little bit more difficult than in
the last article because this time we're faced with regular expressions on
the right hand sides. As before, we append N_F to each term on the right
hand side of the equation for the non-terminal N (which introduces new
non-terminals, D_F, Sp_F, Dc_F, S_F, E_F). But now, instead of chopping
off terms after the non-terminals, now we have to construct all the
possible non-terminal to non-terminal paths that occur with the regular
expression on the right hand side. For instance, in the equation for D,
these paths would correspond to the following:


                          D = Sp y0 Dc y1 ("," Dc y1)* ";" D_F
                          Paths: Sp y0 Dc
                                        Dc y1 "," Dc
                                        Dc y1 ";" D_F


Other than that, the basic rule would be the same, we'd place each of the
resulting pieces into their appropriate equations:


                                        Sp_F = ... + y0 Dc
                                        Dc_F = ... + y1 "," Dc
                                        Dc_F = ... + y1 ";" D_F


and the resulting equation for D becomes just: D = Sp.


      As a result of this process, we get the following Quantum Grammar:


  start = <0| (S + D)


  D = Sp = "int" y2 Sp_F + "char" y3 Sp_F
  D_F = |0><0| (S + D)
          + |0>
Sp_F = y0 Dc


Dc = X y4 Dc_F
      + "*" <1| Dc
Dc_F = y1 ("," Dc + ";" D_F)
          + |1> y5 Dc_F
          + "[" "]" y6 Dc_F


  S = "if" "(" <2| E
      + "while" "(" y11 <5| E
      + "do" y14 <7| S
      + "for" "(" (<9| E + F1
      + "{" S1
      + X ":" y22 S
S_F = |0>
        + |0><0| (S + D)
        + |3> y8 S_F + |3> "else" y9 <4| S + |4> y10 S_F
        + |6> y13 S_F
        + |7> y15 "while" "(" <8| E
        + |12> y21 S_F
        + |13> S1
        + "goto" X ";" y23 S_F
        + "break" ";" y24 S_F
        + "continue" ";" y25 S_F
        + E
        + ";" S_F


F1 = y17 ";" <10| E + y19 F2
F2 = ";" (<11| E + F3)
F3 = y20 ")" <12| S
S1 = <13| S + "}" S_F


E = u <20| E
    + "(" <22| E
    + X y39 E_F
    + NUM y40 E_F
    + CHAR y41 E_F
    + STR y42 E_F


E_F = |2> y7 ")" <3| S
        + |5> y12 ")" <6| S
        + |8> y16 ")" ";" S_F
        + |9> F1
        + |10> y18 F2
        + |11> F3
        + ";" y26 S_F
        + |14> y28 ":" <15| E
        + |15> y29 E_F
        + |16> E_F
        + |17> y32 E_F
        + |18> y34 E_F
        + |19> y35
        + |20> y36 E_F
        + |21> "]" y38 E_F
        + |22> ")" E_F
        + y27 "?" <14| E
        + "," y30 <16| E
        + "||" y31 <17| E
        + "&&" y33 <18| E
        + b <19| E
        + p y37 E_F
        + "[" <21| E


      The terms, F1, F2, F3, and S1 were added for convenience to
avoid duplications.


      All the spurious left-recursions have been removed, such as the
following:


                S_F = S_F: from S = ... + X ":" y22 S + ...
                E = E : from E = E p y37 + E b <19| E |19> y35 + ...


      In the following table, the contexts each of the non-terminals are
defined in, and their inherited contexts are listed:


                  explicit contexts inherited contexts (from)
            D 0 -
          Sp - 0 (D)
          Dc 1 0 (D), 1 (Dc)
            S 0,3-4,6-7,12-13 -
            E 2,5,8-11,14-21 0,3-4,6-7,12-13 (S)
                                                                              0,2-21 (E)


Except for the trivial left-recursions,


                                  Dc = Dc ... + ...,
                                    E = E ... + ...,


there are no conflicting contexts. But note that each of these
occurrences will be directly responsible for precedence conflicts.


      More to follow soon...
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.