Re: Java Comment-Preserving Grammar

Chris F Clark <cfc@shell01.TheWorld.com>
15 Jun 2004 01:02:17 -0400

          From comp.compilers

Related articles
[6 earlier articles]
Re: Java Comment-Preserving Grammar jens.troeger@light-speed.de (2004-06-06)
Re: Java Comment-Preserving Grammar clint@0lsen.net (Clint Olsen) (2004-06-06)
Re: Java Comment-Preserving Grammar gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-06-09)
Re: Java Comment-Preserving Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2004-06-11)
Re: Java Comment-Preserving Grammar alexc@std.com (Alex Colvin) (2004-06-13)
Re: Java Comment-Preserving Grammar gah@ugcs.caltech.edu (glen herrmannsfeldt) (2004-06-13)
Re: Java Comment-Preserving Grammar cfc@shell01.TheWorld.com (Chris F Clark) (2004-06-15)
RE: Java Comment-Preserving Grammar quinn-j@shaw.ca (Quinn Tyler Jackson) (2004-06-21)
Re: PL/I syntax, was Java Comment-Preserving Grammar alexc@std.com (Alex Colvin) (2004-06-21)
Re: PL/I syntax, was Java Comment-Preserving Grammar Peter_Flass@Yahoo.com (Peter Flass) (2004-06-26)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 15 Jun 2004 01:02:17 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 04-05-075 04-06-004 04-06-022 04-06-041 04-06-057
Keywords: Java, parse
Posted-Date: 15 Jun 2004 01:02:17 EDT

In the on going whitespace discussion,
glen herrmannsfeldt <gah@ugcs.caltech.edu> posted:
> Maybe it is that languages designed around yacc/bison work
> that way?


Actually, the trend precedes Yacc by quite some time. Many Algol-60
compilers worked that way. In fact, I think the BNF for Algol-60 can
be interpreted that way (Burroughs and Univac did so)--I can't recall
for sure as the document also suggests that every Algol keyword is
effectively a separate special character and some compilers tried to
go that way (CDC tried to go this way by quoting the "reserved words",
making an almost useless implementation).


> At the time I was thinking about PL/I, where there
> are some places where spaces (or comments) are needed to separate
> tokens:
>
> DO I=J TO K BY L;


And, this PL/I case is exactly what I was referring to. You need
whitespace to separate tokens. However, the whitespace is not present
in a typical PL/I grammar. (For example, at Prime computer, where I
worked on PL/I compilers--neither the Freiberhouse compiler nor the
homegrown plp compiler used a grammar where whitespace was present in
the grammar.) Moreover, you do not want those places where one needs
whitespace (and/or) comments to separate tokens in a PL/I "parsing"
grammar. You want to apply the rule mentioned by our esteemed and
erudite moderator.


> [In most modern languages, whitespace is optional unless the two
> tokens next to each other look like one token, e.g. in 2+2 the
> whitespace is optional, but in int foo it's not. PL/I has extra
> excitement beyond tokenizing because it has no reserved words and
> you can write IF IF=THEN THEN IF=ELSE; ELSE IF=THEN; -John]


If you apply this rule and the typical "LEX" rule of longest match
defines a token, you get a natural lexing of most languages (except
legacy Fortran dialects, RPG, and a few other exceptions). And, once
you have lexed such languages, you don't need whitespace for
determining the parse.


One of the few cases you need whitespace at parsing time is in C
preprocessing (if you implement it in the parsing grammar), where in a
#define whitespace present or absent between name identifier being
defined and a following parenthesis determine whether the identifier
is a parameterized macro (and the parenthesis begins an argument list)
or not (and the parenthesis is part of the expansion). However, even
in this case, the problem can be solved lexically by returning two
different tokens sequences for "id(" and "id (".


Note, it was this specific whitespace problem, that prompted the
"ignore" extension in Yacc++, which specifcally allows one to omit
whitespace from all parts of the grammar where it isn't important for
the parsing (and not just the lexing) phase, but to include it where
it was important. The same problem prompted Quinn Tyler Jackson to a
different solution in meta-S.


Simiarly, the trying to recapture tokens which were only important for
lexing and are not present at all in the parsing grammar, inspired the
"special token" concept in JavaCC.


Finally, some very modern languages (e.g. Python and Haskell) have
introduce a new dependence on whitespace in terms of being indentation
sensitive--and that is a new and distinct problem for which such
compilers have their own solution. I believe such indentation issues
are solved lexically though and not present in the "parsing" grammar
as explicit whitespace tokens--however, I have not looked at these
grammars so I could be wrong.


Note, whatever the current issues are, it is likely over time that new
grammar extensions will be defined to treat them. The "feature" set
of Yacc++ is inspired by the "canonical" solutions learned to parse
PL/I, Pascal, C, and 4GL dialects. That means it works quite well for
parsing C++, Java, and C#. Some changes to improve it for parsing
HTML and certain binary formats can easily be forseen, as those are
also problems for which "canonical" solutions are known. I will be
truly surprised if someone doesn't write a parser generator that
includes a "good" solution for the "indentation problem".


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.