Re: lexer speed, was Bison

"BartC" <>
Tue, 21 Aug 2012 17:39:38 +0100

          From comp.compilers

Related articles
=?UTF-8?Q?Bison_determinis=E2=80=8Btic_LALR=281=29_parser_for_Java=2FC (2012-08-17)
Re: Bison =?UTF-8?B?ZGV0ZXJtaW5pc+KAi3RpYyBMQUxSKDEpIHBhcnNlciBm?= =?U (Hans-Peter Diettrich) (2012-08-18)
Re: lexer speed, was Bison (Hans-Peter Diettrich) (2012-08-20)
Re: lexer speed, was Bison (Hans-Peter Diettrich) (2012-08-20)
Re: lexer speed, was Bison (BGB) (2012-08-20)
Re: lexer speed, was Bison (Hans-Peter Diettrich) (2012-08-21)
Re: lexer speed, was Bison (BartC) (2012-08-21)
| List of all articles for this month |

From: "BartC" <>
Newsgroups: comp.compilers
Date: Tue, 21 Aug 2012 17:39:38 +0100
Organization: A noiseless patient Spider
References: 12-08-005 12-08-006 12-08-008
Keywords: parse, lex, performance
Posted-Date: 21 Aug 2012 20:58:29 EDT

"Hans-Peter Diettrich" <> wrote in message
>> [Compilers spend a lot of time in the lexer, because that's the only
>> phase that has to look at the input one character at a time. -John]
> When the source code resides in a memory buffer, the time for reading
> e.g. the characters of an identifier (in the lexer) is neglectable vs.
> the time spent in lookup and entering the identifier into a symbol table
> (in the parser).
> Even if a lexer reads single characters from a file, most OSs maintain
> their own file buffer, so that little overhead is added over the
> program-buffered solution.
> I really would like to see some current benchmarks about the behaviour
> of current compilers and systems.

I was recently testing a new language with a bare lexer for a near-identical

This test, which excluded file input and any identifer lookups (just
tokenising), ran at 17M chars/second, 3M tokens/second, and some 800K
lines/second, on a low-end desktop PC. At that rate, tokenising the entire
compiler might take 25msec.

If that represented the bulk of the compilation time, then I'd be quite
happy! But I suspect it will be just a fraction.

While lexers do have to deal with a character at a time, the processing
might be quite simple (eg. merely six instructions for each character of an
identifier in my version).

Oh, I should mention that this lexer is written in an interpreted,
dynamically typed language. That means a native code version might process
some 20M tokens per second (3 or 4 msec to scan the entire compiler). That
doesn't really suggest it's going to be a bottleneck!

> [The benchmarks I did were a while ago, but they showed a large
> fraction of time in the lexer.

Testing an older compiler (which doesn't lend itself to isolating just the
bare tokeniser), showed it spent about 50% of compilation time in
tokenising, lookups and parsing. This generates simple byte-code, so with a
more sophisticated one with more passes, optimisation and so on, it would
likely be even less.

(But it's also possible my compilers are highly inefficient apart from the
tokenising parts..)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.