Re: How to extract grammar from a program? (Mark William Hopkins)
15 Feb 1999 23:32:05 -0500

          From comp.compilers

Related articles
How to extract grammar from a program? (Rahul Jain) (1999-02-05)
Re: How to extract grammar from a program? (Chris F Clark) (1999-02-10)
Re: How to extract grammar from a program? (Derek Ross) (1999-02-12)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (David Bruce) (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (1999-02-15)
Re: How to extract grammar from a program? (1999-12-09)
| List of all articles for this month |

From: (Mark William Hopkins)
Newsgroups: comp.compilers
Date: 15 Feb 1999 23:32:05 -0500
Organization: University of Wisconsin - Milwaukee, Computing Services Division
References: 99-02-025 99-02-052
Keywords: parse

>In principle, dictionary based compression techniques work by creating a
>grammar from a body of text such as a set of programs. One recent
>compression technique which might be appropriate for your needs is Sequitur
>by Craig Nevill-Manning which can be used for "inferring hierarchies from
>sequences". Check the website at for more

The algorithm described there works on single inputs in an incremental
fashion by maintaining a grammar in which duplicate occurrences of pairs
are abstracted out as rules as soon as they are encountered. This creates
a grammar with rules of the form A -> B C -- that is: Chomsky Normal Form.

In addition, another process inserts rules in-line that are only used once
in the grammar. This goes on concurrently with the formation of the
grammar in lock-step with the rule generation. So rules whose right-hand
sides are longer than 2 items can appear.

A few issues concerning the process:

(A) Complexity
The asymptotic complexity of the process is quadratic.

An argument is made in the paper that if the input falls under a certain
size the processing is linear in complexity. But there is no such thing as
"non-asymptotic complexity".

(B) Incremental Processing
The algorithm is incremental in its generation of rules. This is a
critical feature for applications that are on-line, one-pass or real-time
(such as on-line compression). But it is a bad thing for grammar
induction, which is INHERENTLY non-monotonic.

The earliest samples in a presentation, when a process is incremental,
will force a committment to a set of rules which may have little or
no bearing on the full sample set. That is, the learning process is
fraught with "garden-path" presentation sequences, which is entirely
analogous to the concept of "garden-path" sentences in natural language

This isn't to say that no incremental method will work. Instead, it
implies that what should be incremented is not the set of rules which
are inferred, but the structures used to produce those rules.

This means that the structures have to keep track of items that may even
overlap, because at some point, one of the items in the overlap (that
may be inferred from the earliest samples in a presentation) could be
dropped in favor of the other item (which would be inferred from the
presentation as more samples appear).

That makes the required structure inherently quadratic in the size of the
largest input. And then the process of forming the rules from this
structure will be O(n^3), instead of just O(n^2).

The problem at hand, as related in my prior article, is that this
problem is being treated as an exercise in logical induction, whereas
in fact this is inherently a problem in statistical induction and
statistical modelling.

The simplest approach stores the counts of all the subsequences
generated. The subsequences can be stored in a trie. The storage is
asymptotically O(n^2) in the size, n, of the largest input.

Each subsequence is identified with a non-terminal and the optimum
splitting of the sequence into 2 components is found by optimizing
some measure, such as the product of the components' counts. This
process is linear in the size of the subsequence which, in the worst
case is n. So the generation of rules from the structure is O(n^3).

The rules can then be pruned by a top-down pass, which actually leads
directly to the next item:

(C) Cyclic Rules
The algorithm described in the Sequitur paper is strictly bottom-up. It will
form rules, starting at the symbol level, and work its way up. That means it
is impossible for it to infer stars, e.g.,

A -> B C*

or, even more, rules which involve self-embeddings directly, or once or
more removed, e.g.,
A -> B A
A -> C B
B -> D A

This is a directly related to the fact that it only processes single

To form stars and self-embeddings, you need a top-down pass, which
absolutely requires multiple samples, instead of one. For example,
if the samples are

                                    a aaa aaaaa aaaaaaaaa

the bottom-up pass could form the rules such as

A -> a a
B -> A a
C -> A B
D -> A C
E -> A D

But a top-down pass would be required to infer the stars or
self-embeddings. In the example above, it would start from the rule

S -> a | B | C | E

maybe pass through an intermediate stage

S -> a | A a | A B | A D

and might infer from this

S -> a | A S

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.