Mon, 17 Oct 1994 21:28:42 GMT

Related articles |
---|

Question: Table Compression Methods in Dragon Book SCHMIDTG@iccgcc.cs.hh.ab.com (1994-10-09) |

Re: Question: Table Compression Methods in Dragon Book janpeter@mpi.nl (Jan Peter de Ruiter) (1994-10-10) |

Re: Table Compression Methods in Dragon Book nandu@cs.clemson.edu (1994-10-17) |

Newsgroups: | comp.compilers |

From: | nandu@cs.clemson.edu |

Keywords: | yacc |

Organization: | Compilers Central |

References: | 94-10-071 94-10-080 |

Date: | Mon, 17 Oct 1994 21:28:42 GMT |

Greg Schmidt wrote:

> There is a section titled "Table-Compression Methods" at the end of the

> chapter discussing lexical analysis in the newer dragon book. The intent

> is to implement efficient transition tables by leveraging off of the

> similarities among states, but the example is unclear to me. I don't

> understand what is meant by "default" transitions or "special" states.

> I would be grateful if some kind soul could provide a simple example

> showing the appropriate structures.

Jan Peter de Ruiter wrote:

!I too have problems understanding this mechanism from the dragon book.

!Even my compiler guru admits he doesn't understand it. I'd like to see an

!algorithm (in pseudocode or something equivalent) that describes how one

!*builds up* this table. Once you have the table it's easy to apply the

!pseudocode from the dragon book, but how the h*ll do you create it is my

!problem.

Let us consider a DFA to accept the following regular expressions and

return tokens

end {return END}

else {return ELSE}

[endls]+ {return IDENTIFIER}

The keywords are "end" and "else" and everything else is an

identifier. The state transition table is

state | e n d l s

----------------------------------------

0 | 1 7 7 7 7

1 | 7 2 7 2 7

2 | 7 7 3 7 7

3 | 7 7 7 7 7

4 | 7 7 7 7 5

5 | 6 7 7 7 7

6 | 7 7 7 7 7

7 | 7 7 7 7 7

Such a 2-dimensional table can be compressed so fewer entries are

actually stored.

An obvious scheme is to store an adjacency list instead of the

adjacency matrix. Each state stores a list of transitions on certain

inputs. for example, state 1 will carry 3 transitions namely, to state

2 on an input "n", to state 2 on an input "l" and if the two cases

fail, a third, "default" transition to state 7. "default" because that

is the transition on any other input. (This representation could be

further compressed by exploting the fact that on both inputs "n" and

"l", the next state is 2. Adjacency lists conserve space but the

overhead is in searching through the adjacency lists for an

appropriate transition.

The default-base-next-check method has the advantage of direct access

since all the transitions are stored in arrays. The disadvantage (the

worst case scenario in my opinion) is that the next-check arrays can

each end up as large as the adjacency matrix. I have trouble believing

the first part of discussion on page 145 in the dragon book. The

authors write "...Here we use a data structure consisting of four

arrays indexed by state numbers...". The fact is that the next-check

arrays can be much larger than the number of states! (we will see this

in the following example). Anyways...

The basic idea of this method is to store the two dimensional

adjacency matrix as a linear array. In other words, to lay out the

rows of the matrix linearly. Simply linearizing the rows would mean a

space utilization equal to that of the adjacency matrix. Instead, if

some of the entries could be overlapped...? well, that is where the

default transitions help. If we look at the transitions out of state 1

of the above table from a different angle, we can think of all

transitions out of state 1 as being "default" and the transitions on

inputs "n" and "l" as being "special" since the default transitions

outnumber the special cases in the above example. So, the "next" table

is nothing but a linear representation of the adjacency matrix with

some of the entries overlapped (hence table compression)

At this point, we know that we have to linearize the rows but we also

know that a dumb approach will not buy us any _compression_. One of

the heuristics that I have given some thought is to identify some sort

of an "inheritance" tree. In other words, to identify rows such that

the special transitions in the rows do not clash, when overlapped

(informally, a row inherits another if all the entries in the

inherited row is the same as the base row except for special

transitions). For example, if the rows corresponding to rows 0 and 5

are overlapped, the special transitions on "e" will clash but

overlapping 0 and 3 would not result in any clashes. Similarly, 0, 1

and 4 can be overlapped without the special cases clashing with each

other. So we have to identify such combinations so that the rows can

be stored within the same "space" in the linear organization. This

heuristic will buy us storage space in most cases. Once we identify

such combinations, we start filling the next-check arrays starting at

the least numbered index possible and laying out the rows one by one

(overlapping as much as possible). The "next" array's entries directly

fall out of the adjacency matrix and the "check" array's entries are

based on the construction of the "next" array.

In the table above, we notice that states 3, 6 and 7 can be treated as

default states since (1) they are all identical anyway and (2) all

other states can be treated as "inheriting" from them. so let us lay

them out first in the next-check table. Next, let us try to insert

rows 0 and 1 overlapping them as much as possible. At this point the

tables look as follows

state default base index next check

---------------------------------------------------------------

0 3 5 0 7 3

1 3 5 1 7 3

2 2 7 3

3 X 0 3 7 3

4 4 7 3

5 5 1 0

6 3 X 6 2 1

7 3 X 7 X X

8 2 1

9 X X

we will assume that the offsets from the base index in the next-check

arrays are computed using the values 0 through 4 corresponding to

e,n,d,l,s respectively. For this table and subsequent ones, treat X to

mean some index that is invalid and does not exist in the table. We

see that we already need 10 entries in the next-check tables though

some are not used. This is because the base index to states 0 and 1

in the next-check table is 5, but each can be faced with 5 different

inputs so an offset of 5 is required from the base index. For

instance, if, at state 0, an input e is seen, the next state is

computed directly as state 1. If the input "s" is seen, the next state

at index 9 in the check table is not 0 (the current state), so the

default state (3) is assumed as the current state and the next state

computed as 7. And so on. The final table after inserting the rows in

the order 3, 6, 7, 0, 1, 4, 5, 2 is

token state default base index next check

----------------------------------------------------------------------

IDENT 0 3 5 0 7 3

IDENT 1 3 5 1 7 3

IDENT 2 3 8 2 7 3

END 3 X 0 3 7 3

IDENT 4 3 5 4 7 3

IDENT 5 3 7 5 1 0

ELSE 6 3 X 6 2 1

IDENT 7 3 X 7 6 5

8 2 1

9 5 4

10 3 2

11 X X

12 X X

space usage = 2*8 + 2*13 = 42 = default-base-size + next-check-size

size of Adjacency matrix = 8*5 = 40

Although we have ended up using up 2 entries more than the adjacency

matrix in this example, this method would compress the table in any

typical scanner when the number of states is far more than 8 and

almost all possible inputs have to be considered. The order of

inserting the rows also matters. For instance, inserting the rows in

the order 6, 7, 3, 1, 4, 2, 0, 5 will use 46 entries.

I have also included a "pattern" entry with each of the states so

that, if the end of the current token is recognized, the value

returned is the token at the final state. (The dragon book mentions

this in a footnote)

Bibliography

--------------

@book{AAVUJD77,

author = "Aho, A. V. and Ullman, J. D.",

title = "Principles of compiler Design",

publisher = AW,

address = "Reading, {MA}",

year = 1977,

*}*

@book{AVARSJDU86,

author = "A. V. Aho, R. Sethi and J. D. Ullman",

title = "Compilers: Principles, Techniques and Tools",

year = "1986",

publisher = "Addison-Wesley",

*}*

@article{RETACY2211,

author = "Robert E. Tarjan and A.C. Yao",

title = "Storing a sparse table",

journal = cacm,

volume = 22,

number = 11,

pages = "606--611",

year = 1979,

*}*

@article{dencker:84,

title="Optimization of Parser Tables for Portable Compilers",

author="Peter Dencker and Karl {D\"urre} and Johannes Heuft",

pages="546--572",

journal=toplas,

year=1984,

month=oct,

volume=6,

number=4

*}*

@article{fredman:84,

author = "M.L. Fredman and J. Komlos and E. Szemeredi",

title = "Storing a sparse table with {O}(1) worst case access time",

journal = jacm,

volume = 31,

number = 3,

pages = "538--544",

year = 84,

*}*

---------------------------------------------------------------------------

Nandakumar Sankaran, G34 Jordan, Comp. Sci. Dept., Clemson Univ. SC 29634

311-8 Old Greenville Hwy. Clemson SC 29631-1651 (803)653-7749

http://www.cs.clemson.edu/~nandu/nandu.html nandu@cs.clemson.edu

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.