|Can Coco/R do multiple tokenizations email@example.com (2005-08-13)|
|Re: Can Coco/R do multiple tokenizations firstname.lastname@example.org (George Neuner) (2005-08-16)|
|Re: Can Coco/R do multiple tokenizations DrDiettrich@compuserve.de (Hans-Peter Diettrich) (2005-08-16)|
|Re: Can Coco/R do multiple tokenizations email@example.com (Gene Wirchenko) (2005-08-16)|
|Re: Can Coco/R do multiple tokenizations cfc@shell01.TheWorld.com (Chris F Clark) (2005-08-21)|
|Re: Can Coco/R do multiple tokenizations firstname.lastname@example.org (Darius Blasband) (2005-08-21)|
|From:||Darius Blasband <email@example.com>|
|Date:||21 Aug 2005 00:21:57 -0400|
|Posted-Date:||21 Aug 2005 00:21:57 EDT|
George Neuner wrote:
>>Can Coco/R, (can any other parser/lexer generator ) do multiple
>>tokenizations & parser-tree-generations
> AFAIK, there are no existing lexer gen tools which allow alternate
> tokenizations for the same input text. You are free to write one of
> I don't know Coco/R, but what you want is possible by using deliberate
> backtracking and multiple lexers. It's a slow and painstaking process
> of trying a particular parse, saving the AST if the parse succeeds,
> then backtracking, switching lexers and trying the same parse again.
> If you end up with no ASTs, the parse failed, and if you end up with
> multiple ASTs, the code was ambiguous.
As a matter of fact, my PhD thesis
was just about that: a lexer and parser
generator that can backtrack, and in practice, this is convenient
enough. Most ambiguities are local, up to a point where backtracking
is a convenience feature with reasonable impact on performance, far
from the theoretical worst case. This technology is robust enough to
be used in RainCode (http://www.raincode.com) and has been used to
parse poorly defined languages such as COBOL and PL/1.
For instance, in COBOL, something such as:
could be a picture definition, recognized as a single token,
or an access to an array named X. Originally, the COBOL standard
specified that parenthesis should be surrounded by spaces, but most
modern compiler would accept the notation above anyway.
Our backtracking lexer will first recognize the above fragment
as a single "Picture" token (please note that there is no need for more
complex structures, as one cannot nest parenthesis in this context).
If the parser fails to do anything useful with this, it will then
recognize the same fragment as 4 tokens ("X", "(", "9", ")").
The same can be used to parse fortran (where white space is ignored
altogether) and PL/1 where keywords are not reserved and can be used
as user-defined identifiers.
Return to the
Search the comp.compilers archives again.