Disambiguating an arbitrary CFL (was: disambiguating transformations)

markh@csd4.csd.uwm.edu (Mark)
Sat, 25 Sep 1993 17:22:19 GMT

          From comp.compilers

Related articles
disambiguating transformations MATT@waikato.ac.nz (Matt Melchert) (1993-09-22)
Re: disambiguating transformations rad@cs.ucsb.edu (1993-09-25)
Disambiguating an arbitrary CFL (was: disambiguating transformations) markh@csd4.csd.uwm.edu (1993-09-25)
How to... (was: disambiguating transformations) markh@csd4.csd.uwm.edu (1993-09-26)
Re: disambiguating transformations bart@skinner.cs.uoregon.edu (Bart Massey) (1993-09-26)
Re: disambiguating transformations thinx!jos@relay.nluug.nl (Jos Horsmeier) (1993-09-27)
| List of all articles for this month |

Newsgroups: comp.compilers
From: markh@csd4.csd.uwm.edu (Mark)
Keywords: parse, theory
Organization: Computing Services Division, University of Wisconsin - Milwaukee
References: 93-09-099 93-09-117
Date: Sat, 25 Sep 1993 17:22:19 GMT

rad@cs.ucsb.edu (Ron Dolin) writes:
>Look in Hopcroft and Ullman -- "Introduction to Automata Theory,
>Languages, and Computation": Theorem 8.9 states, "It is undecidable
>whether an arbitrary CFG is ambiguous." In fact, section 4.7 is titled,
>"The Existence of Inherently Ambiguous Context-Free Languages" -- any and
>all grammars for such languages are ambiguous because the LANGUAGE is
>ambiguous, so there's no way to transform the grammar to a non-ambiguous CFG.


An inherently ambiguous CFL is one given by the grammar (using an equational
EBNF notation):


                                      S = U c* + a* V
                                      U = 1 + a U b
                                      V = 1 + b V c


or the set: { a^i b^j c^k: i = j or j = k }.


One can proceed by completely disregarding the fact that this language is
ambiguous (or that it's even a CFL at all) and attempt to construct a
minimal DFA for it, which means constructing a series of linear equations,
with S, the main variable, representing the start state. The result is an
infinite DFA. In particular, define the following expressions:


                            A(n) = U b b^n c* + a* V
                            B(n) = V c^n
                            C = c*
                            D(n, p) = b^n c* + V c^p
                            E(n) = c^n


then, treating the grammar as a system of equations, the following is true:


                            S = U c* + a* V
                                = c* + a U b c* + a a* V + V
                                = 1 + a (U b c* + a* V) + b (V c) + c (c*)
                                = 1 + a A(0) + b B(1) + c C


Similarily,
                            A(n) = 1 + a A(n+1) + b D(n, 1)
                            B(n) = E(n) + b B(n+1)
                            C = 1 + c C
                            D(0, p) = 1 + b B(p+1) + c C
                            D(n+1,p) = E(p) + b D(n, p+1)
                            E(0) = 1
                            E(n+1) = c E(n)


This leads to a very elegant looking minimal infinite DFA for the ambiguous
language (and as a consequence, to a linear deterministic parsing algorithm),
derived from the one below by eliminating the lambda transitions (it looks
nicer if they're kept in):


                                    c c c c
                [[E(0)]] <- E(1) <--- E(2) <--- E(3) <--- ...


          c b b b
        +--+ B(1) ---> B(2) ---> B(3) ---> ...
        v | ^ ^ ^
      [[C]] <---+-----/-+------/--+-----/---+--- ...
                          | /b | / b | / b |
                          | / | / | / |
                          S D(0,1) D(0,2) D(0,3) ...
                          | ^ ^ ^
                    a | / b / b / b / b
                          v / / / /
                        A(0) D(1,1) D(1,2) D(1,3) ...
                          | ^ ^ ^
                    a | / b / b / b / b
                          v / / / /
                        A(1) D(2,1) D(2,2) D(2,3) ...
                          | ^ ^ ^
                    a | / b / b / b / b
                          v / / / /
                        A(2) D(3,1) D(3,2) D(3,3) ...
                          |
                        ...


Not shown: each of the states in a given column has a lambda transition to the
state, E(n), in the same column.


This system of equations:
                            S = 1 + a A(0) + b B(1) + c C
                            A(n) = 1 + a A(n+1) + b D(n, 1)
                            B(n) = E(n) + b B(n+1)
                            C = 1 + c C
                            D(0, p) = 1 + b B(p+1) + c C
                            D(n+1,p) = E(p) + b D(n, p+1)
                            E(0) = 1
                            E(n+1) = c E(n)


is what I call an Index Grammar. There are restrictions on how the
indices are allowed to vary (basically only the exprssions 0, x+1, x-1
allowed), and because the maximum number of indices is 2, I call this a
C[1, 1] Index Grammar. In general, C[1, 1] languages can be formed as
intersections of two C[1] languages (C[1] is also the subclass of CFL
languages that have counter automata).


The more general Index Grammar, C[m1, ..., mk] uses "indices" from sets
{0,..., m1-1}*, ... {0, ..., mk-1}*. Every CFG is an index grammar of
the form C[m], where m is the minimum number of stack symbols needed to
represent the language.


Also, it's actually more natural to deal with the class of languages
derived from CFL's by intersection. This is especially true in that all
these languages have parsing algorithms of equal complexity as CFL's.
Define the C_1 languages as the CFL's. The C_2 languages represent the
class of languages formed by intersections of CFL's. The class C_3, C_4,
and so on, are defined analogously. All these languages, collectively,
for the class of C* languages. A parsing automaton for a C_n language
consists of n CFL parsers running concurrently. So another way to think
of C* languages is as concurrent-CFL's.


These are a few results:
            The class of languages formed by intersection of
                  C[m1, ..., mk] and C[n1, ..., nk]
            languages is C[m1, ..., mk, n1, ..., nk]


            The class C_1 = CFL (by definition) = the union of all C[n]'s.
            The class C_2 = the union of all C[n1, n2]'s
            etc.


            The class C* is the class of languages with Index Grammars.


            There is no algorithm that can find the minimal Index Grammar for a
            C* language. But there are certainly plenty of conceivable ways to
            perform the optimization task with good results.


and a conjecture:


            Every C* language has an unambiguous Index Grmmar (but, as stated above
            finding the minimal one is an unsolveable problem in general).


or more specifically:


            Every C_n language has an unambiguous C_m grammar for some m >= n.


The construction performed above illustrates a deterministic C_2 grammar
for the inherently C_1 (or CFL) language.


Also, there's a way to generalize Regular Expressions to expressions that
represent C* languages, or C* Expressions. The relation to Regular
Expressions is much the same as the relation of Calculus to Algebra.


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.