Re: Learning only one lexer made me blind to its hidden assumptions

Kaz Kylheku <480-992-1380@kylheku.com>
Fri, 15 Jul 2022 14:16:59 -0000 (UTC)

From comp.compilers

Related articles
Learning only one lexer made me blind to its hidden assumptions costello@mitre.org (Roger L Costello) (2022-07-07)
Re: Learning only one lexer made me blind to its hidden assumptions luser.droog@gmail.com (luser droog) (2022-07-12)
Re: Learning only one lexer made me blind to its hidden assumptions jvilar@uji.es (Juan Miguel Vilar Torres) (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions drikosev@gmail.com (Ev. Drikos) (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions antispam@math.uni.wroc.pl (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions gneuner2@comcast.net (George Neuner) (2022-07-14)
*Re: Learning only one lexer made me blind to its hidden assumptions 480-992-1380@kylheku.com (Kaz Kylheku)* (2022-07-15)**
Re: Learning only one lexer made me blind to its hidden assumptions antispam@math.uni.wroc.pl (2022-07-15)

| List of all articles for this month |

From:	Kaz Kylheku <480-992-1380@kylheku.com>
Newsgroups:	comp.compilers
Date:	Fri, 15 Jul 2022 14:16:59 -0000 (UTC)
Organization:	A noiseless patient Spider
References:	22-07-006
Injection-Info:	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="72062"; mail-complaints-to="abuse@iecc.com"
Keywords:	lex
Posted-Date:	15 Jul 2022 12:26:45 EDT

On 2022-07-07, Roger L Costello <costello@mitre.org> wrote:
> Hi Folks,
>
> For months I have been immersed in learning and using Flex. Great fun indeed.
>
> But recently I have been reading a book, Crafting a Compiler with C, and
> reading its chapter on lexers. The chapter describes two lexer-generators:
> ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
> assumptions in Lex/Flex. Without learning ScanGen I would have continued to
> think that the way things are done in Lex/Flex way is the only way.

It seems hard to find information about ScanGen. I found someone's
slides claim that it's something written by a Gary Sevitsky in 1979,
and then worked on over the next couple of years by Robert Gray
and Charles Fischer.

By "learning ScanGen" do you mean that you have an implementation
and have created some scanners?

> Below I have documented some of the differences between Lex/Flex and ScanGen.
>
> Difference:
> - Flex allows overlapping regexes. It is up to Flex to use the 'correct'
> regex. Flex has rules for picking the correct one: longest match wins, regex
> listed first wins.
> - ScanGen does not allow overlapping regexes. Instead, you create one regex
> and then, if needed, you create "Except" clauses. E.g., the token is an
> Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'
>
> Difference:
> - Flex regexes use juxtaposition for specifying concatenation.
> - ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls

That's verbose. Juxtaposition is used for catenation so that you can
simply use "foo" if you want to match the text "foo".

It behooves pattern matching notations to look like the problem space;
i.e. that expression which just match single terms rather than sets
just look like the terms they are matching.

> it 'catenation' not 'concatenation'

Careful speaker and writers of the English language avoid the
"concatenate" pleonasm.

"catena" is Latin for "chain" (e.g. "catenary curve: the curve formed by
a chain supported horizontally at its endpoints, under gravity"). To
catenate literally means to chain; the "con" semantics is implicit,
since chaining is always "together".

The Unix command is called "cat" not "concat"; the function is "strcat"
not "strconcat".

If you could think of a way to chain things apart, that could be
grounds for a functional prefix. For instance, chaining a pair of
agrressive dogs to opposite walls of a yard so they cannot get at each
other could plausibly be called "discatenation". :)

> Difference:
> - Flex regexes use | for specifying alteration in regexes
> - ScanGen uses ',' to specify alternation

That's just gratuitously weird. Commas are widely used for separating
groups in sequences: domain names, object member access,
date.month.year, integer.fraction, filename.suffix, ...

The | character is used in BNF.

I can see that in 1979, stringent character set limitations would still
have to be taken into account when developing something; there would
still be reasons to stick to a six bit character set.

> Difference:
> - With Flex, tossing out characters (e.g., toss out the quotes surrounding a
> string) may involve writing C code to reprocess the token
> - ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
> token reprocessing needed

That C code is just:

      char *interior = yytext + 1;
      yytext[yyleng - 1] = 0;

You could write a library a function like this, which you can reuse in
future lex jobs:

    /* chop "front" characters from front, "back" characters
          from back, of yytext, and return pointer to chopped
          lexeme. */

    char *yytrim(int front, int back)
    {
        assert (front >= 0);
        assert (back >= 0);
        assert (front <= yyleng);
        assert (back <= yyleng);

        {
            char *f = yytext + front;
            char *b = yytext + yyleng - back;
            if (b < f)
                b = f;
            *b = 0;
            return f;
        }
    }

If the string literal syntax supports the escaping of quotes, and other
features, stripping the quotes becomes insufficient.

There are plenty of situations in which it is handy to strip something
from either end of a lexeme, though; the above function would be useful
in the "-ll" lex library.

> Difference:
> - Flex deals with individual characters
> - ScanGen lumps characters into character classes and deals with classes. Use
> of character classes decreases (quite significantly) the size of the
> transition table

Obviously, Lex has explicit character classes, like [cxb]: the class
containing 'c', 'x' and 'b'.

Even without explicit classes in the syntax, the DFA subset constrution
implicitly identifies character classes. A regex like (c|x|b) should not
generate more states than [cxb].

Subset construction will squeeze out things like, say (adcdx|abcdy),
which should produce the same DFA as a[bd]cd[xy].

A state in a DFA can have many transitions leading out of it, triggered
by different input symbols: that constitutes a character class.
This class is essentially discovered by the algorithm.

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Learning only one lexer made me blind to its hidden assumptions

Kaz Kylheku <480-992-1380@kylheku.com>Fri, 15 Jul 2022 14:16:59 -0000 (UTC)

Kaz Kylheku <480-992-1380@kylheku.com>
Fri, 15 Jul 2022 14:16:59 -0000 (UTC)