Re: Have I discovered something new?

"Steve Horne" <steve@lurking.demon.co.uk>
25 Jul 2002 23:24:06 -0400

From comp.compilers

Related articles
Have I discovered something new? steve.horne@iris.co.uk (Stephen Horne) (2002-07-15)
Re: Have I discovered something new? torbenm@diku.dk (Torben Ægidius Mogensen) (2002-07-21)
Re: Have I discovered something new? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Have I discovered something new? cfc@world.std.com (Chris F Clark) (2002-07-21)
Re: Have I discovered something new? kgw-news@stiscan.com (2002-07-21)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-24)
*Re: Have I discovered something new? steve@lurking.demon.co.uk (Steve Horne)* (2002-07-25)**
Re: Have I discovered something new? robert.corbett@sun.com (Robert Corbett) (2002-07-31)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-31)
Re: Have I discovered something new? cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-04)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-08-04)

| List of all articles for this month |

From:	"Steve Horne" <steve@lurking.demon.co.uk>
Newsgroups:	comp.compilers
Date:	25 Jul 2002 23:24:06 -0400
Organization:	Customer of PlusNet
References:	02-07-057
Keywords:	parse, summary
Posted-Date:	25 Jul 2002 23:24:06 EDT

Thanks for all your replies (especially Marks, which I suspect will
give me a great deal to think about when I finally understand it).

I did discuss my method with Chris Clark, and the end result was that
while it was an unusual approach and may be of some interest, it is
not entirely original and it has a fundamental flaw which prevents it
handling ambiguity in general grammars as well as I though it would. I
also have some significant doubt about whether it can really be called
parsing - Chris mentioned the term 'partial specialisation' and it
sounds about right, but it's been some years since college and I don't
remember exactly what it means.

Anyway, it seems bad form to not at least explain the idea given the
time some people have put into replies, so here goes.

The basic idea was that for each token recognised, the 'parser' should
form a new and more specialised grammar from the old one in which the
head part of all the strings in the grammar is restricted the the
sequence of tokens shifted before. Thus any ambiguity in the input is
simply maintained in the output grammar.

Although this would not generate a parse tree, strictly speaking very
few parsers do. For instance yacc-alikes use function calls that
*could* be used to build a parse tree, but very rarely are, and at
least one parser I've seen just inserts special notations into
appropriate points in a list of symbols. The important point is that
analysis of the result to gain useful information should be easy.

The pure approach would obviously not be linear because of all the
grammar modification operatioms, so the intention was to precompute a
state model with the states each having a derived grammar associated
with them.

To avoid the next obvious problem (infinite states) the intension was
to use a push-down automaton model to handle layers of rules, much
like an LR parser. Reduce/reduce conflicts and shift/reduce conflicts
would be handled by allowing ambiguous nonterminal symbols on the
stack and doing what I thought would be relatively simple
modifications with a strictly limited scope as the parse continues.

Ambiguity, including infinite ambiguity, could be handled easily
because at no point are all the possibilities stored or listed. An
infinite set of possibilities can be represented with a finite, very
small and very simple grammar if it has a very simple pattern.

However, the fundamental flaw is almost too obvious when the idea is
written out as above. Reduce/reduce conflicts can be handled fairly
simply, but shift/reduce conflicts are a very different beast. I had
in mind the idea that the parse would continue to the last possible
reduce and, after reducing, would go to a state which is a function of
the state reduced from and the state at the top of the stack after
reducing, which created enough haze to fool me for a while. But
clearly this very rapidly reaches a point where the derived grammar
needs to be manipulated at run time, and the time to handle each
successive symbol cannot be constant. Thus the 'parse' (if it can be
called that) is not linear either.

In my defence, I had an idea which is similar to what Chris calls
LR(inf) in mind as well, which I thought might be use to pick and
choose when to reduce an ambiguous rule-like thing in the state model.
I felt sure that this could be made to handle any degree of ambiguity
and that a quick look at the precompiled state model would allow all
definitive information about possible ambiguities to be derived. If
I'd known that the ambiguity of an arbitrary CF grammar cannot be
proved, I'd have examined that more closely.

At present I have an idea in which two passes are made through the
input - one forward and the other backward. The first pass could
purely handle ambiguity in single-rule layers (I'm not considering
shift/reduce conflicts yet). The second pass would need a run-time
representation of the grammar, but would gain from the different
conflicts occuring with the different directionality. However, I have
no illusions - at best this is a curiosity.

It is certainly no use for my applications. I have an idea which
basically treats the behaviour of a coroutine, process or object as
having a grammar. This needs to handle unbounded length sentences and
give the best information possible without waiting for lookaheads.

For example, consider two competing robots in a game. Each may
determine its best possible move directly from the current state, but
that implies quite a bit of analysis. If there are a well known set of
tactics that work well in this competition, each robot can use a
grammar to represent sequences of observed events and appropriate
reactions. The analysis to determine what to do in the next turn then
reduces to extending the parse by one symbol (ie observed event) into
the future, which (grammar and parsing method permitting) can be done
in essentially constant time (amortised constant time? - I've never
been entirely clear what the 'amortised' means, and just assume it has
something to do with averaging out occasional 'glitches' such as, in
this case, when a when a whole bunch of rules get reduced at once
after a single terminal symbol shift).

Anyway, derive the grammar automatically from the rules of the game
and you may well have a very efficient game playing robot. For
turn-taking games this is quite easy if you accept a digraph model
(think railroad diagram) as an alternative to EBNF in the
specification of a rule. Converting the digraph back to EBNF is a
hassle if its needed, but it probably never will be.

More likely, I suspect a small set of grammars will be 'parsed'
concurrently, with each grammar being 'triggered' by some recognised
pattern or entry into a particular state in some other grammar. This
avoids needing to incorporate perfect knowledge of the game into a
single grammar, but this only gives a gain if the knowledge embedded
in the set of mini-grammars is imperfect.

I'm confident a fairly good real-time AI could be derived using this
kind of technique.

My plan now is to return to the basic form of the idea I had before -
the one that was similar to LR(inf). Instead of variable length
lookahead, I'm using 'delayed recognition' - allowing shifts to
continue beyond the state where a reduce should have occurred until an
ambiguity is resolved, then shoving those extra shifts back to the
input so they can be re-shifted after the reduce. Each state will
therefore contain a complete description of the ambiguity in the
current rule, and the stack items (from which the current partial
parse trees are dangling) will give complete information about the
input so far.

Rather than dotted rules and lookaheads, I'll be using a 'tail
grammar' for the state definitions. I'll use a depth-first search to
extend this tail grammar sufficiently to remove conflicts from the
states, and as the recursion unwinds I'll change the definition of
each state to a minimal form (the set of actions and their destination
states) to optimise the number of states in the resulting grammar.
This means I can safely increase the tail grammar by layer (the tail
ends of a set of parent rules) at a time to keep the
depth-first-search time practical while not creating too many rules.
In this way, I should get close to LALR(1) sized tables without having
to remove the (admittedly very rare) LALR(1) conflicts from the
grammar manually.

Finally, I'll have a set of heuristics which can warn when the grammar
cannot be proven unambiguous, and when the possibility of the
depth-first search never terminating arises.

To be honest, though, I might just use a modified Tomita parser with
LR(0) as the underlying state model so I don't have to supply
lookaheads before they are available. It's probably perfectly usable
as long as the ambiguity stays within reasonable bounds at any time,
and I can probably set the grammar up so that it is fairly resilient
to having some stack duplications discarded to conserve resources.
Labelling states with priority values should allow the least important
stack duplicate to be discarded, and a rule such as...

    START -> ANYTERMINAL START // keep start state always in stack
                      | DANGERSIGNAL01 { reaction; }
                      | DANGERSIGNAL02 { reaction; } ...

with the first state of START at maximum priority should keep the
critical 'executive' stack duplicate (or perhaps one of several that
are linked in a ring) present at all times.

So I hope that explains the what my idea was, why I wanted to do it,
and what the fatal flaw was.

--
Steve Horne
steve@lurking.demon.co.uk

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Have I discovered something new?

"Steve Horne" <steve@lurking.demon.co.uk>25 Jul 2002 23:24:06 -0400

"Steve Horne" <steve@lurking.demon.co.uk>
25 Jul 2002 23:24:06 -0400