# Some Finite State Machine and Parsing problems

## Ralph Boland <rboland@csi.uottawa.ca>2 Nov 1997 23:10:41 -0500

From comp.compilers

Related articles
Some Finite State Machine and Parsing problems rboland@csi.uottawa.ca (Ralph Boland) (1997-11-02)
Re: Some Finite State Machine and Parsing problems clark@quarry.zk3.dec.com (1997-11-03)
| List of all articles for this month |
 From: Ralph Boland Newsgroups: comp.compilers,comp.theory Date: 2 Nov 1997 23:10:41 -0500 Organization: Department of Computer Science, University of Ottawa Distribution: inet Keywords: DFA, theory, question

I am quite interested in algorithms relating to
Finite state machines (FSMs) and regular expressions (REs).
Here are two problems that perhaps others will find
interesting. If you have solutions or references of
use to me I would very much appreciate hearing from you.

Problem 1)

I have recently discovered FSTs (finite state tranducers).
That is FSM like machines that output one character for each
character of input. I wondered if there exists expressions
that correspond to FSTs the way REs correspond
to FSMs. I came up with the following:

Let <a,b> denote the operation of inputting a and
outputting b. Then using the RE operators of concatenation, or,
kleene closure, and paranthesis we can write expressions like:

<a,b><d,f>*(<g,h>|<i,j>)

I call such expressions regular expression tranducers (RETs).
Clearly for each FST there EXISTs a (actually more than one) RET.

Q1.1 Are there useful applications where we would want to
represent an FST by a RET?

Unlike REs-FSMs there does not necessarily exist an FST corresponding
to a given RET. Things can go wrong three ways:

a) The RET can be ambiguous.
b) The RET can require a finite lookahead.
c) The RET can require an infinite (arbitrarily large) lookahead.

An example of c) is: (<a,b>*<c,d>) | (<a,e>*<f,g>)
In this example we may input an arbitrarily large number of a's before
we input an c or an f. So output must be delayed while an
arbitrarily large amount of input is processed. Yet this RET is
not ambiguous.

Q1.2 Are there algorithms for detecting if an RET has
any of the above problems? References please.

Q1.3 If a RET has problem b) then (I claim without proof that)
LR(k) parsing can be used to process the RET where the RET
requires a lookahead of k elements. What is the simplest
parsing technowlogy that can handle such RETs?

RRP(k) parsing is a generalization of LR(k) parsing in which regular
expressions are allowed on the right hand side of productions.
We can extend RRP(k) parsing to allow RETs on the right hand side
of each production. It seems to me that this can then be parsed
using RRP(k) parsers. That is we can construct a RRP(k) like
parser to parse the grammar if a RRP(k) parser can parse the
same grammar with the output information removed.

Q1.4 Can we really do this and if so what are the applications?

RRP(k) parsers work by processing input twice, once in each direction.

Q1.5 If we try the same trick with a RET can we build a machine
with a finite number of states corresponding to the RET?
Clearly we can process the input from left to right with
a finite number of states. Can we then process the input
a second time in the reverse direction so that we can
output a character for each character of input read?
I'm ignoring here the problem that the output will then be
in reverse order.

Problem 2)

We can easily write a grammar cooresponding to a reqular expression
and such grammars are easy to parse. Consider the following
generalization of this problem.

First we write the regular expression in the form

X = (a regular expression).

This provides a label for the regular expression.

Second, we allow the label to be used in the regular
expression itself. For example we can write an
expression for a palidrome this way:

Y = aYa | bYb | (empty set)

I call such expressions recursive regular expressions (RREs).

Q2.1 Can we translate such expressions into a grammar?
I assume that answer is yes since regular expressions
can easily be converted into a grammar.

Q2.2 Are there parsing techknowlogies that can parse such
grammars efficiently, preferably in linear time.

We say that a RRE is simple if, when we remove the kleene
star operator from the expression and treat the expression variable
as simply an input character, the expression variable occurs
at most once in any input. For example:

X = (a(bXc)*d) | (eXf)* | (gh*i)

is simple because the simplified expression is

"(a(bXc)d) | (eXf) | (gh*i)"

and the only legal inputs are then

"abXcd", "eXf", and "ghi"

each of which has X occur at most one time.
However the expression

X = aXbXc

is not simple since the the only accepted input in this case is

"aXbXc"

which has two X's in it.

Q2.3 (Questions Q2.1 and Q2.2 but for simple RREs)?

Q2.4 Assume that we allow input to be read from either end of
the input stream. Can we build a machine with a finite
number of states to process any given simple RRE?
Can we minimize the number of states of such a machine?
Note that the best solution that I have here requires a
doublely exponential amount of time (Exponential in the
size of the machine and the machine may have an exponential
number of states) even if the expression is a RE.

Note that, though LR(k) parsing cannot handle a palidrome
we can construct a simple finite state machine (that reads from both
ends of the input stream) that can. Consider:

Y = aYa | bYb | (empty set)

The machine is

state 1 (start) forward input a: go to state 2.
forward input b: go to state 3.
empty input : accept.

state 2 backward input a: go to state 1.
backward input b: reject.
empty input : reject.

state 3 backward input a: reject.
backward input b: go to state 1.
empty input : reject.

I hope some of you find these problems interesting
and can provide me with some answers or directions for
future research.