[Query] Tree-based parsing?

John Lilley <jlilley@empathy.com>
16 Apr 1997 00:32:19 -0400

          From comp.compilers

Related articles
[Query] Tree-based parsing? jlilley@empathy.com (John Lilley) (1997-04-16)
Re: [Query] Tree-based parsing? fjh@mundook.cs.mu.OZ.AU (1997-04-18)
Re: [Query] Tree-based parsing? bear@sonic.net (Ray Dillinger) (1997-04-22)
Re: [Query] Tree-based parsing? geert@sun3.iaf.nl (1997-05-04)
Re: [Query] Tree-based parsing? nkramer@cs.cmu.edu (Nick Kramer) (1997-05-04)
Re: [Query] Tree-based parsing? scotts@metaware.com (Scott Stanchfield) (1997-05-08)
Re: [Query] Tree-based parsing? scotts@metaware.com (Scott Stanchfield) (1997-05-08)
[1 later articles]
| List of all articles for this month |

From: John Lilley <jlilley@empathy.com>
Newsgroups: comp.compilers
Date: 16 Apr 1997 00:32:19 -0400
Organization: Nerds for Hire, Inc.
Keywords: parse, question

Hi,


I've had this idea recently that surely has been explored by those
more knowledgeable than I, so I want to know if there are schorarly
articles available on what I would call a "tree-based" approach to
source-code parsing. Essentially, I would take the following steps:


1) Lex the input stream into tokens as usual.


2) Convert a token stream directly into a degenerate "AST", which
would start out as a linked list of all tokens


3) Walk the flat "AST" using LL or LALR methods and match gross
features of the language that are invariant regardless of local
syntax. For C/C++, this would involve matching pairs of {} [] ().
The result would be a transformation of the linked list "AST" into a
real tree structure describing nested pairs and the tokens and
subtrees sontains therein.


4) Perform successive refinement steps of the AST. A logical next
step is to split the "leaf" token lists across global delimiters like
';' for C/C++. Then refine the tree further looking for control flow
constructs, and on to more detailed stuff like declarations...


Yes, this approach would definitely take a lot of memory and probably
be slower than a "normal" parser. I am hoping that despite these
drawbacks, it would have significant benefits in terms of:


1) Error recovery. Since the input is split according to gross
features first, and then successvely refined, syntactical error
avalanche would be localized e.g. to a single statement or
declaration.


2) Incremental processing. I would like to perform a quick minimal
reparse of the input as a user edits it. With the previous tree
available, one should be able to use simple heuristics to determine
that no syntactic reparse is necessary for the tokens enclosed by
e.g. {} if no tokens changed inside the {}. Of course, there are
problems with things like #define and typedef performing "action at a
distance", but ignoring that for a moment...


3) AST generation. The result of the process carried to its
conclusion is already an AST, so no further processing is necessary.


4) Decoupling of semantics and syntax. For ambiguous grammars like
C/C++, where typedef information must be used during the parse, there
are still many cases where an identifier can be assumed, based solely
on syntax, to be either a type or non-type. For example, in the
following example can be deduced as a cast without knowing that y is a
type, even though in a more general case the definition of y would be
required:
        x = (y)z;
If these cases are parsed first, leaving only the truly ambiguous
ones, then quality of error messages can be improved by moving them to
the semantic phase (e.g. "y is not a type" compared to a syntax
error).


5) I believe, but have no concrete evidence, that successive tree
rewrites where for example sequences of modifiers are collapsed into a
subtree before being used, will allow more powerful pattern-matching
to be applied than would be available from monolithic parsing of the
token stream.


Any ideas, experience, or references would be greatly appreciated.


john lilley
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.