Mon, 17 Dec 90 15:43:56 +0100

Related articles |
---|

hash specifics: GOOD NEWS te@prosun.first.gmd.de (Thilo Ernst) (1990-12-17) |

Re: hash specifics: GOOD NEWS jeffj@cs.umr.edu (1990-12-19) |

Newsgroups: | comp.compilers |

From: | Thilo Ernst <te@prosun.first.gmd.de> |

Keywords: | design |

Organization: | Compilers Central |

Date: | Mon, 17 Dec 90 15:43:56 +0100 |

Hello hash-fellows,

In a 1985 paper in Comm. of the ACM, T.J. Sager of the University of Missouri

presented the so-called mincycle algorithm for computing perfect (collision-

free) hash functions.

Although the problem is nasty (NP-complete) in general, the algorithm

was shown to work in polynomial time for virtually all cases.

Sager's hash functions have the general form

H(w) = ( h0(w) + g (h1 (w)) + g (h2 (w)) ) mod M

which doesn't seem very efficient on first glance, but have a closer look:

h0, h1, h2:

three functions which map the string into a triple of (bounded)

integers - usually ord (w[j]) mod k, with k some power of 2.

The only restriction on their simplicity is that there be no

fatal conflicts, i.e. two keywords with the same (h0, h1, h2)

triple.

M: equals N, the number of keywords to hash, or slightly greater

(may be forced to be a power of 2, too)

g: a stored mapping, i.e. an integer array.

This is the main output of the mincycle algorithm. Its size will

be around N/3.

As you see, all operations occuring in the definition can be implemented very

efficiently on machine level. h0, h1, h2 may be simple bitwise AND's on

single bytes of the string to hash; for small keyword sets, one of them

can be set to zero, that is, omitted in practice.

M is an output of Sager's algorithm, with the property M>=N. However (good

news continued), in all his (and our) tests, M was equal to N. That means,

there are no gaps in the hashing table; Sager's hash functions are *minimal*

(although he had no proof). BTW, Sager computed MPHF's for keyword sets up to

a size of 512.

I don't know if this is the best result about hash functions, but in my

opinion it is a significant breakthrough compared to the work listed in the

posting of J. R. Levine, which falls in two categories: bad or unknown

algorithm complexity (restricting the keyword set to small sizes) and bad or

inefficient resulting hash functions (non-perfect/non-minimal HF's,

large-number divisions required, etc.).

And now for the best: we have done it, Adriaan. We included an implementation

of the mincycle algorithm (with some improvements and optimizations) in our

compiler compiler to speed up the lexical analyzers generated and were very

satisfied with the results. Our largest example had 350 keywords and took

about ten minutes to compute a MPHF on a PC/AT.

The implementation was done in Modula-2, and I don't know if it would be of

use for you. I'd rather suggest you to e-mail me your keyword set(s) and to

look forward to my reply, including MPHF, in January.

(Sorry for errors/inexact information - unfortunately, I don't have Sager's

paper here; neither my implementation and the tech. rep. about it.

However, to meet Adriaan Degroot's mail deadline, I had to write here & now.

Also, sorry for abuse of the English language, if any.)

Regards, Thilo.

Thilo Ernst ( te@gmdtub.uucp, te@prosun.first.gmd.de )

Institute of Informatics and Computing Techniques Berlin

Rudower Chaussee 5, D-O-1199 Berlin, Germany

Phone: (+37-2-) 6 74 51 52, (+49-30-) 25 49 91 16

Fax: (+37-2-) 6 76 22 00

[Thanks for pointing this one out, I'd completely missed it. The reference is

Sager, Thomas J., "A Polynomial Time Generator for Minimal Perfect Hash

Functions," CACM 28:5, May 1985, pp. 523-532. The article includes a listing

of a Pascal version of his algorithm. -John]

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.