Learning only one lexer made me blind to its hidden assumptions

Roger L Costello <costello@mitre.org>
Thu, 7 Jul 2022 17:49:44 +0000

From comp.compilers

Related articles
*Learning only one lexer made me blind to its hidden assumptions costello@mitre.org (Roger L Costello)* (2022-07-07)**
Re: Learning only one lexer made me blind to its hidden assumptions luser.droog@gmail.com (luser droog) (2022-07-12)
Re: Learning only one lexer made me blind to its hidden assumptions jvilar@uji.es (Juan Miguel Vilar Torres) (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions drikosev@gmail.com (Ev. Drikos) (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions antispam@math.uni.wroc.pl (2022-07-13)
Re: Learning only one lexer made me blind to its hidden assumptions gneuner2@comcast.net (George Neuner) (2022-07-14)
Re: Learning only one lexer made me blind to its hidden assumptions 480-992-1380@kylheku.com (Kaz Kylheku) (2022-07-15)
[1 later articles]

| List of all articles for this month |

From:	Roger L Costello <costello@mitre.org>
Newsgroups:	comp.compilers
Date:	Thu, 7 Jul 2022 17:49:44 +0000
Organization:	Compilers Central
Injection-Info:	gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="18181"; mail-complaints-to="abuse@iecc.com"
Keywords:	lex, question, comment
Posted-Date:	11 Jul 2022 20:26:04 EDT
Content-Language:	en-US

Hi Folks,

For months I have been immersed in learning and using Flex. Great fun indeed.

But recently I have been reading a book, Crafting a Compiler with C, and
reading its chapter on lexers. The chapter describes two lexer-generators:
ScanGen and Lex. Oh my! Learning ScanGen opened my eyes to the hidden
assumptions in Lex/Flex. Without learning ScanGen I would have continued to
think that the way things are done in Lex/Flex way is the only way.

Below I have documented some of the differences between Lex/Flex and ScanGen.

Difference:
- Flex allows overlapping regexes. It is up to Flex to use the 'correct'
regex. Flex has rules for picking the correct one: longest match wins, regex
listed first wins.
- ScanGen does not allow overlapping regexes. Instead, you create one regex
and then, if needed, you create "Except" clauses. E.g., the token is an
Identifier, except if the token is 'Begin' or 'End' or 'Read' or 'Write'

Difference:
- Flex regexes use juxtaposition for specifying concatenation.
- ScanGen uses '.' to specify concatenation. And oh by the way, ScanGen calls
it 'catenation' not 'concatenation'

Difference:
- Flex regexes use | for specifying alteration in regexes
- ScanGen uses ',' to specify alternation

Difference:
- With Flex, tossing out characters (e.g., toss out the quotes surrounding a
string) may involve writing C code to reprocess the token
- ScanGen has a 'Toss' command to toss out a character, e.g, Quote(Toss). No
token reprocessing needed

Difference:
Flex regexes use ^ for specifying 'not', e.g., [^ab] means any char except a
and b
ScanGen regexes uses 'Not', e.g., Not(Quote)

Difference:
- Flex deals with individual characters
- ScanGen lumps characters into character classes and deals with classes. Use
of character classes decreases (quite significantly) the size of the
transition table

Difference:
- Flex regexes use the ? meta-symbol
- ScanGen doesn't have that. Instead, it has 'Epsilon'

Difference:
- ScanGen has something called a Major number and a Minor number for each
token
- Flex doesn't have that concept
[For the same reason, I don't think it's a good idea to learn only one programming langage. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Learning only one lexer made me blind to its hidden assumptions

Roger L Costello <costello@mitre.org>Thu, 7 Jul 2022 17:49:44 +0000

Roger L Costello <costello@mitre.org>
Thu, 7 Jul 2022 17:49:44 +0000