Tue, 16 Mar 1993 23:08:26 GMT

Related articles |
---|

regular expressions (bug-report) megatest!plethorax!djones@uu4.psi.com (1993-03-14) |

Re: regular expressions (bug-report) Paul.Klint@cwi.nl (1993-03-15) |

Re: regular expressions (bug-report) mab@wdl39.wdl.loral.com (1993-03-15) |

regular expressions (bug-report) jos@thinx.convex.com (Jos Horsmeier) (1993-03-16) |

Re: regular expressions (bug-report) pardo@cs.washington.edu (1993-03-19) |

Re: regular expressions (bug-report) henry@zoo.toronto.edu (1993-03-25) |

Re: regular expressions (bug-report) kanze@us-es.sel.de (1993-03-26) |

Newsgroups: | comp.compilers |

From: | Jos Horsmeier <jos@thinx.convex.com> |

Originator: | daemon@toonder.rivm.nl |

Keywords: | lex |

Organization: | AND Software Inc, Texas |

References: | 93-03-046 |

Date: | Tue, 16 Mar 1993 23:08:26 GMT |

megatest!plethorax!djones@uu4.psi.com (Dave Jones) writes:

[ Transforming an NFA into a DFA: ]

|However it is not immediately clear to me how to handle character range

|matches. For example, most reg-exp packages allow you to indicate a range

|of matches with a construct such as this:

| [a-zA-Z]

|The above means, match any letter, either upper or lower case. There is

|potential ecconomy for the implementation as well as for the user, because

|the processor need not check for a match against every letter in the

|range. It can take advantage of the fact that in ASCII, the letters a-z

|are contiguous, and merely check to see if the current character is in

|range.

|It is not immediately obvious to me how to handle these character-ranges

|when converting an NFA to a DFA.

|The simplest way would be to split the construct [a-zA-Z] into 52

|single-character transitions. But the character-range tests should be

|preserved in the implementation when possible. When it is not possible,

|perhaps a subrange or a couple of subranges could be used in the DFA.

It's just a matter of implementation. One implementation of the DFA simply

pops those character ranges into existence. After transforming the NFA

into a DFA in the normal way, i.e. build a separate transition for all

separate characters, simply ignore those character classes, one can

represent the DFA as a matrix, where all the rows are labeled with the

state labels of the DFA and all columns are labeled with all the

characters in the input alphabet. As known, these matrices are usually

very sparse. Now, have a look at any row: if this row contains two or more

cells, all adjacent, and their contents is the same, we've found a range

of characters that all cause the DFA to do a transition from the state

labeled at that particular row, to the state stored in all those cells.

An example:

a b c d e f g h i j k l m n

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

S | T | T | T | T | | | U | U | U | | | V | | |

+---+---+---+---+---+---+---+---+---+---+---+---+---+---+

This row represents all transitions starting from state S. This row could

easily be represented as:

S: R#(a,d)T, R#(g,i)U, S#(l)V

Here R#(x,y) represents a range [x-y] and S#(x) represents a single

character x. The last characters (T, U and V) represent the resulting

state after the transition.

If long ranges exist in those row, this `range representation' is quite a

compact method of storing the entire DFA. In practice, as I have noticed,

if one represents those ranges as strings, lots of rows in the matrix are

(proper) prefixes of other rows. So, taking advantage of that, one could

easily reduce the amount of memory needed for the storage of a DFA.

So, concluding: let go of the idea of preserving character ranges, while

doing the transformation of the NFA into a DFA, because those character

classes (and maybe others) will show up again when the DFA construction is

ready.

kind regards,

Jos aka jos@and.nl aka thinx!jos@convex.com

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.