Languages with optional spaces

Maury Markowitz <maury.markowitz@gmail.com>
Wed, 19 Feb 2020 07:35:59 -0800 (PST)

          From comp.compilers

Related articles
Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-19)
Re: Languages with optional spaces awanderin@gmail.com (Jerry) (2020-02-20)
Re: Languages with optional spaces drikosev@gmail.com (Ev. Drikos) (2020-02-23)
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25)
Re: Languages with optional spaces maury.markowitz@gmail.com (Maury Markowitz) (2020-02-25)
Re: Languages with optional spaces martin@gkc.org.uk (Martin Ward) (2020-02-25)
Re: Languages with optional spaces 493-878-3164@kylheku.com (Kaz Kylheku) (2020-02-26)
[15 later articles]
| List of all articles for this month |

From: Maury Markowitz <maury.markowitz@gmail.com>
Newsgroups: comp.compilers
Date: Wed, 19 Feb 2020 07:35:59 -0800 (PST)
Organization: Compilers Central
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="74218"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, question, comment
Posted-Date: 19 Feb 2020 11:24:00 EST

I'm trying to write a lex/yacc (flex/bison) interpreter for classic BASICs
like the original DEC/MS, HP/DG etc. I have it mostly working for a good chunk
of 101 BASIC Games (DEF FN is the last feature to add).


Then I got to Super Star Trek. To save memory, SST removes most spaces, so
lines look like this:


100FORI=1TO10


Here's my current patterns that match bits of this line:


FOR { return FOR; }


[:,;()\^=+\-*/\<\>] { return yytext[0]; }


[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
                            yylval.d = atof(yytext);
                            return NUMBER;
                        }


"FN"?[A-Za-z@][A-Za-z0-9_]*[\$%\!#]? {
                            yylval.s = g_string_new(yytext);
                            return IDENTIFIER;
                        }


These correctly pick out some parts, numbers and = for instance, so it sees:


100 FORI = 1 TO 10


The problem is that FORI part. Some BASICs allow variable names with more than
two characters, so in theory, FORI could be a variable. These BASICs outlaw
that in their parsers; any string that starts with a keyword exits then, so
this would always parse as FOR. In lex, FORI is longer than FOR, so it returns
a variable token called FORI.


Is there a way to represent this in lex? Over on Stack Overflow the only
suggestion seemed to be to use trailing syntax on the keywords, but that
appears to require modifying every one of simple patterns for keywords with
some extra (and ugly) syntax. Likewise, one might modify the variable name
pattern, but I'm not sure how one says "everything that doesn't start with one
of these other 110 patterns".


Is there a canonical cure for this sort of problem that isn't worse than the
disease?
[Having written Fortran parsers, not that I've ever found. I did a prepass
over each statement to figure out whether it was an assignment or something
else, then the lexing was straightforward if not pretty. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.