Re: Languages with optional spaces

Christopher F Clark <>
Sun, 1 Mar 2020 10:07:52 +0200

          From comp.compilers

Related articles
[6 earlier articles]
Re: Languages with optional spaces (Kaz Kylheku) (2020-02-26)
Re: Languages with optional spaces (awanderin) (2020-02-26)
Re: Languages with optional spaces (Ev. Drikos) (2020-02-28)
Re: Languages with optional spaces (Christopher F Clark) (2020-02-29)
Re: Languages with optional spaces (Ev. Drikos) (2020-02-29)
Re: Languages with optional spaces (Hans-Peter Diettrich) (2020-03-01)
Re: Languages with optional spaces (Christopher F Clark) (2020-03-01)
Re: Languages with optional spaces (Ev. Drikos) (2020-03-01)
Re: Languages with optional spaces (Christopher F Clark) (2020-03-02)
Re: Languages with optional spaces (Ev. Drikos) (2020-03-02)
Re: Languages with optional spaces (2020-03-02)
Re: Languages with optional spaces (Ev. Drikos) (2020-03-12)
Re: Languages with optional spaces (Gene) (2020-04-14)
[2 later articles]
| List of all articles for this month |

From: Christopher F Clark <>
Newsgroups: comp.compilers
Date: Sun, 1 Mar 2020 10:07:52 +0200
Organization: Compilers Central
References: 20-02-015 20-02-017 20-02-033 20-02-034
Injection-Info:; posting-host=""; logging-data="54277"; mail-complaints-to=""
Keywords: lex, design, comment
Posted-Date: 01 Mar 2020 12:52:31 EST

I'm sorry if my reply sounded like a criticism of Ev's post. I didn't
intend it to be. It was more pointing out that when one does things
by hand, one often makes things that are challenging to formalize.

I noticed this particular aspect as I was trying to imagine what would
need to add to the lexer generator in Yacc++ (you can use lex (or
flex) with it, but it also includes its own model which has extensions
so that you can write LR rules in your lexer too and a few other
things to minimize conflicts, integrate with a symbol table, etc).
One of our goals with writing it was to capture in a formalizable way
some of the important "hacks" one often used to solve problems (e.g.
how to do PL/I style keywords).

And this problem looked like one where if the hack could be distilled,
it would make sense to do so. Not that I am a big fan of languages
with optional spaces, but that doesn't mean that they might not
represent something people need to handle.

But John, our moderator, and Dodi are right in a particular sense.
Certain things were done informally by hand "in the old days". And
this actually points at another possible direction to a solution, PEGs
(parsing expression grammars). It's a different way of attacking the
ambiguous language problem. In PEGs, "or" constructs are not
un-ordered, but instead you take them in a specific order the grammar
writer chooses and the first one that applies is the correct one. You
get somewhat the same effect with lex, where if two rules match you
use the one specified lexically first. The only difference is that in
lex, the longest match rule overrides that, where in a PEG, the
ordering rule overrides (and you control the ordering at the caller
level, by how you write the "or" not by the order of the rules that
the "or" invokes).

In my ambiguous example, a PEG like construct where the first "TO" is
the rule that matches is probably not that hard to write.

And, PEG rules can actually be added to LL and LR grammars. They came
from syntactic predicates that Terence Parr invented (and we added
those to Yacc++ by using the idea of "Tunnel Automata" pioneered by
Josef Gro"lsch). If one goes down that path, it might be possible to
actually capture what people have done by hand. I don't recall
whether we added them to the lexer portion of the generator. So,
now it's on my to-do list.

On Sat, Feb 29, 2020 at 11:48 AM Christopher F Clark
<> wrote:
> "Ev. Drikos" <> posted an interesting albeit partial solution
> to the problem of keywords being part of identifiers in languages with
> optional spaces.
> I won't include it here.
> The problem is that some keywords can appear at places other than the
> beginning of an identifier.
> In fact, in the worst case scenario, the language can be ambiguous.
> Consider the following "BASIC" program extended with variables that
> are more than one letter long
> and spaces being optional.
> 10 LET ITO = 1
> 20 LET I = 2
> 30 LET JTOK = 3
> 40 LET K = 4
> 80 PRINT N;
> 90 NEXT N
> 100 END
> The problem with such solutions is one is tempted to "fix" them one by
> one as they are encountered.
> Maury Markowitz <> mentioned this in his post
> where ATO was considered.
> It could be A TO or AT O (presuming that TO and AT are both keywords)
> Note that this is even an issue with 1 letter variable names if one
> has both keywords.
> As one starts patching up these cases, the "grammar"
> (or its recursive descent implementation most likely)
> begins to become what I call "ad hack".
> With a GLR parser (or something equivalent in power, e.g. an Earley
> parser or CYK) and
> a lexer that returns all possible sets of tokenizations one can find
> all the relevant parse
> trees and then see if only 1 makes semantic sense.
> In the above example, that won't help as both interpretations are
> legal programs.
> One prints 2 3, the other 1 2 3 4.
> I cannot imagine a programmer being happy with the error message:
Chris Clark email:
Compiler Resources, Inc. Web Site:
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
[So how do you lex PL/I? It's not ambiguous, and I think you never need more than
one token lookahead to decide whether a string of letters that looks like a
keyword is instead a variable name as in



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.