2 Nov 1997 23:10:41 -0500

Related articles |
---|

Some Finite State Machine and Parsing problems rboland@csi.uottawa.ca (Ralph Boland) (1997-11-02) |

Re: Some Finite State Machine and Parsing problems clark@quarry.zk3.dec.com (1997-11-03) |

From: | Ralph Boland <rboland@csi.uottawa.ca> |

Newsgroups: | comp.compilers,comp.theory |

Date: | 2 Nov 1997 23:10:41 -0500 |

Organization: | Department of Computer Science, University of Ottawa |

Distribution: | inet |

Keywords: | DFA, theory, question |

I am quite interested in algorithms relating to

Finite state machines (FSMs) and regular expressions (REs).

Here are two problems that perhaps others will find

interesting. If you have solutions or references of

use to me I would very much appreciate hearing from you.

Problem 1)

I have recently discovered FSTs (finite state tranducers).

That is FSM like machines that output one character for each

character of input. I wondered if there exists expressions

that correspond to FSTs the way REs correspond

to FSMs. I came up with the following:

Let <a,b> denote the operation of inputting a and

outputting b. Then using the RE operators of concatenation, or,

kleene closure, and paranthesis we can write expressions like:

<a,b><d,f>*(<g,h>|<i,j>)

I call such expressions regular expression tranducers (RETs).

Clearly for each FST there EXISTs a (actually more than one) RET.

Q1.1 Are there useful applications where we would want to

represent an FST by a RET?

Unlike REs-FSMs there does not necessarily exist an FST corresponding

to a given RET. Things can go wrong three ways:

a) The RET can be ambiguous.

b) The RET can require a finite lookahead.

c) The RET can require an infinite (arbitrarily large) lookahead.

An example of c) is: (<a,b>*<c,d>) | (<a,e>*<f,g>)

In this example we may input an arbitrarily large number of a's before

we input an c or an f. So output must be delayed while an

arbitrarily large amount of input is processed. Yet this RET is

not ambiguous.

Q1.2 Are there algorithms for detecting if an RET has

any of the above problems? References please.

Q1.3 If a RET has problem b) then (I claim without proof that)

LR(k) parsing can be used to process the RET where the RET

requires a lookahead of k elements. What is the simplest

parsing technowlogy that can handle such RETs?

RRP(k) parsing is a generalization of LR(k) parsing in which regular

expressions are allowed on the right hand side of productions.

We can extend RRP(k) parsing to allow RETs on the right hand side

of each production. It seems to me that this can then be parsed

using RRP(k) parsers. That is we can construct a RRP(k) like

parser to parse the grammar if a RRP(k) parser can parse the

same grammar with the output information removed.

Q1.4 Can we really do this and if so what are the applications?

References please.

RRP(k) parsers work by processing input twice, once in each direction.

Q1.5 If we try the same trick with a RET can we build a machine

with a finite number of states corresponding to the RET?

Clearly we can process the input from left to right with

a finite number of states. Can we then process the input

a second time in the reverse direction so that we can

output a character for each character of input read?

I'm ignoring here the problem that the output will then be

in reverse order.

Problem 2)

We can easily write a grammar cooresponding to a reqular expression

and such grammars are easy to parse. Consider the following

generalization of this problem.

First we write the regular expression in the form

X = (a regular expression).

This provides a label for the regular expression.

Second, we allow the label to be used in the regular

expression itself. For example we can write an

expression for a palidrome this way:

Y = aYa | bYb | (empty set)

I call such expressions recursive regular expressions (RREs).

Q2.1 Can we translate such expressions into a grammar?

I assume that answer is yes since regular expressions

can easily be converted into a grammar.

Q2.2 Are there parsing techknowlogies that can parse such

grammars efficiently, preferably in linear time.

We say that a RRE is simple if, when we remove the kleene

star operator from the expression and treat the expression variable

as simply an input character, the expression variable occurs

at most once in any input. For example:

X = (a(bXc)*d) | (eXf)* | (gh*i)

is simple because the simplified expression is

"(a(bXc)d) | (eXf) | (gh*i)"

and the only legal inputs are then

"abXcd", "eXf", and "ghi"

each of which has X occur at most one time.

However the expression

X = aXbXc

is not simple since the the only accepted input in this case is

"aXbXc"

which has two X's in it.

Q2.3 (Questions Q2.1 and Q2.2 but for simple RREs)?

Q2.4 Assume that we allow input to be read from either end of

the input stream. Can we build a machine with a finite

number of states to process any given simple RRE?

Can we minimize the number of states of such a machine?

Note that the best solution that I have here requires a

doublely exponential amount of time (Exponential in the

size of the machine and the machine may have an exponential

number of states) even if the expression is a RE.

Note that, though LR(k) parsing cannot handle a palidrome

we can construct a simple finite state machine (that reads from both

ends of the input stream) that can. Consider:

Y = aYa | bYb | (empty set)

The machine is

state 1 (start) forward input a: go to state 2.

forward input b: go to state 3.

empty input : accept.

state 2 backward input a: go to state 1.

backward input b: reject.

empty input : reject.

state 3 backward input a: reject.

backward input b: go to state 1.

empty input : reject.

I hope some of you find these problems interesting

and can provide me with some answers or directions for

future research.

Thanks in advance for your reading this (rather long)

posting.

Ralph Boland

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.