Re: Is Minimal Perfect Hashing the wrong algorithm

chase@centerline.com (David Chase)8 Apr 1996 15:47:26 -0400

From comp.compilers

Related articles
Is Minimal Perfect Hashing the wrong algorithm? cef@geodesic.com (Charles Fiterman) (1996-04-02)
Re: Is Minimal Perfect Hashing the wrong algorithm chase@centerline.com (1996-04-08)
Re: Is Minimal Perfect Hashing the wrong algorithm bobduff@world.std.com (1996-04-08)
Re: Is Minimal Perfect Hashing the wrong algorithm? dave@occl-cam.demon.co.uk (Dave Lloyd) (1996-04-08)
Minimal Perfect Hashing and Cichelli's algorithm CSPT@giraffe.ru.ac.za (Pat Terry) (1996-04-10)
| List of all articles for this month |

 From: chase@centerline.com (David Chase) Newsgroups: comp.compilers Date: 8 Apr 1996 15:47:26 -0400 Organization: CenterLine Software References: 96-04-012 Keywords: theory, symbols

Charles Fiterman <cef@geodesic.com> () writes:
> Many compiler writers use Minimal Perfect Hashing to generate some
> hash tables such as a key word table.

> But this algorithm minimizes table size. Shouldn't it minimize lookup
> time with a cap on table size? That is shouldn't it look at the time
> required to create a hash from an entry?

Nowadays, whenever I need a hash table, any hash table, I use the one
that I wrote back in 1986. The reason for this is that there is no
hash table out there whose marginal benefit (in terms of speed,
assurance that it will work) exceeds that of this table. This is both
because I took the time to build it right in the beginning (lots of
profiling to figure out where the time went) and because it has been
so thoroughly used over the years.

Each hash table entry has a list of buckets hanging off of it. The
"hash" function is supposed to return a 32-bit number, which is also
stored in the bucket. (The hash table is parameterized by hash
function, compare function, copy function, and freeing function --
I've used it for everything from strings, to trees, to sets of trees,
to stack backtraces, to pointers.) The table index is derived from
this number, but it is also used for a rapid inequality check when
scanning the list of entries, and to allow more-rapid table
rearrangement if it must grow (not a problem for compiler keywords, of
course). Whenever there's a hit, it also does a swap-to-front on the
bucket list. (Resizing is automatic; the table monitors its
probe-to-call ratio, and if it goes bad, it resizes. The resizing
heuristic grows the table in one of two ways -- if the table is very
full and the probe ratio is bad, then it grows it a lot, but if the
table is not full and the probe ratio is bad, then it grows it just a
little, on the assumption that this will redistribute the buckets in
the offending busy table entries).

The rapid inequality check is an enormous win, at least judging from
the profiling results when I originally put this together (most times,
a lookup costs you one "hash", one "compare", a remainder operation, plus a
little bit of pointer-chasing). Swap-to-front turns out to be pretty
useful for many problems (temporal locality of reference) and not very
expensive for the rest. The one thing to beware of is failure to
provide a good hash function -- the design here assumes that the hash
function provided will have good statistical properties -- it should
act like a random number generator. The fast-inequality-check and the
table growth heuristics both depend on this.

I noodled around a little bit with perfect hashing once, but the
benefits didn't seem to be worth the effort.

speaking for myself,

David Chase
--

Post a followup to this message