Re: what scanner scheme is efficient? (Chris Clark USG)
24 Oct 1996 22:29:27 -0400

          From comp.compilers

Related articles
what scanner scheme is efficient? (1996-10-12)
Re: what scanner scheme is efficient? (1996-10-15)
Re: what scanner scheme is efficient? (Ram Bhamidipaty) (1996-10-16)
Re: what scanner scheme is efficient? (Peter Brueckner) (1996-10-18)
Re: what scanner scheme is efficient? (1996-10-20)
Re: what scanner scheme is efficient? (Jan Gray) (1996-10-20)
Re: what scanner scheme is efficient? (1996-10-24)
Re: what scanner scheme is efficient? (James Mansion) (1996-10-24)
Re: what scanner scheme is efficient? (1996-10-30)
Re: what scanner scheme is efficient? (1996-11-12)
Re: what scanner scheme is efficient? (1996-11-15)
Re: what scanner scheme is efficient? (1996-11-19)
Re: what scanner scheme is efficient? (1996-11-21)
[2 later articles]
| List of all articles for this month |

From: (Chris Clark USG)
Newsgroups: comp.compilers
Date: 24 Oct 1996 22:29:27 -0400
Organization: Digital Equipment Corporation - Marlboro, MA
References: 96-10-076 96-10-081 96-10-097
Keywords: lex, performance

Ram Bhamidipaty <> wrote:
>Is scanning/parsing _speed_ ever reason to go toward using a hash
>table for identifying keywords? I would venture to say no: If one's
>scanner has explicit patterns for each keyword then the "minimal"
>amount of work is done to recognize the keyword itself, whereas with
>the hash table approach, not only must the IDENT be recognized, but
>now a hash value must be computed for _every_ word and then the hash
>table probed, probably this last step is very cheap. (Jerry Roth) replied (citing a reply by Vern
Paxton the author of FLEX):
> The response then was that if *speed* is the issue it is better to put
> your keywords into your grammar and have the scanner recognize them
> than it is to use a hash table. Of course other issues besides speed
> (ie, maintainability or table size) may dictate a different approach.

Vern commented on ignoring secondary effects (page faults etc.) in his
analysis. However, in the days of fast risc machines with high memory
latencies, these cannot always be ignored.

In particular, for each character that the lexer processes it must do
(the equivalent of, see 1 below) a table lookup to find the
appropriate action for that character in that state. That guarantees
a memory read and if the lexer transition is not in the cache, a cache
miss. Thus, for an n letter keyword, there may be as many as n cache

A shift and add (xor) hash function does only a few register to
register operations to advance the character, only doing a lookup on
the final hashed value, resulting in only one potential cache miss.

Therefore, if your machine has multiple instruction units, it is
possible that the shift and add hash function can be done entirely in
spare instruction slots and thus be "free". If cache misses are cheap
and the instructions issues are expensive, the hashing solution loses.
Thus, before deciding whether to incorporate the keywords into the
lexer tables or use a hash function, you should time the results on
your target machine. [Kudos go to Ram for his doing measurements.]

The second consideration to make before deciding not to use a hash
function depends on whether you will do symbol table lookup on
identifiers normally anyway. If your identifier processing expects
each identifier to be represented by a symbol node, then you can
simply use those nodes for your keyword hashing and again have your
keyword hashing for free. (By the way, I would recommend a much
larger hash table (than 256), so that normal identifiers do not clash,
say 8K, when using hashing for symbol table lookup.)

A third consideration is the lexer state table size related issues.
Because lexer transition tables can get quite large, there are various
technologies used for compressing them. If adding keywords to your
lexer changes the size of the lexer table, such that a more powerful
compressing scheme is (or must be) used to compact the tables, then
the overhead of that compressing scheme will affect the speed of all
states which it compresses. (It is worth mentioning that many
generators do not allow the user to select the table compression
mechanism, always making the same default choice, rendering the issue
moot. However, the tool I use Yacc++(r) does (I believe PGS does
also), so I consider it worth mentioning.)

The most speed efficient but space inefficient method of representing
the transition tables is a 2-d array indexed by state and input
character. However, this can result in very large arrays, since each
state has 256 entries when processing normal (non-multi-byte)
characters (or 127 if you still use 7-bit ASCII). The comb-vector
approach to packing can significantly cut the table size, but adds an
extra check to each table access (potentially halving the lexer's
performance). This situation is similar for the use of character
classes. There are also list packing methods which can entail using a
binary (or worse linear) search once the appropriate row or column is

(Note 1: It is also possible that the FSA is implemented in direct
code, similar to recursive decent parsing, rather than table lookups.
The tradeoffs there change to I-cache misses (and program size) rather
than D-cache ones (and data space size).)

The final point worth mentioning is that depending on your
application, the cost of choosing hashing or incorporating keywords
into the lexer may not be a significant performance hit or boost.
Again, measure before you optimize! On one consulting job I looked at
the performance of the lexer and doubled its running speed, only to do
some measurements to find that the resulting tokens were being
searched in a database which took 25 times the entire lexing and
parsing process.


I had a hand in writing Yacc++, so I am biased.
Chris Clark Internet :
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.