|Limits of regular languages firstname.lastname@example.org (e0726905) (2010-02-11)|
|Re: Limits of regular languages email@example.com (Matthias-Christian Ott) (2010-02-13)|
|Re: Limits of regular languages firstname.lastname@example.org (Kaz Kylheku) (2010-02-13)|
|Re: Limits of regular languages cfc@shell01.TheWorld.com (Chris F Clark) (2010-02-13)|
|Re: Limits of regular languages email@example.com (Max Hailperin) (2010-02-13)|
|Re: Limits of regular languages firstname.lastname@example.org (e0726905) (2010-02-27)|
|From:||Chris F Clark <cfc@shell01.TheWorld.com>|
|Date:||Sat, 13 Feb 2010 16:13:10 -0500|
|Organization:||The World Public Access UNIX, Brookline, MA|
|Posted-Date:||13 Feb 2010 19:35:04 EST|
e0726905 <email@example.com> writes:
> Is there a way implement the usual longest-match behaviour of a scanner
> with a DFA without any backtracking mechanisms, meaning is that
> behaviour strictly regualr or not?
If I understand the question you are tyring to ask, the answer is "no".
consider the following regular expression:
a | a+ b
and the input where you see a string of a characters. You cannot by
looking at it one a character at a time determine whether to return
the single a case, or the a+ b case, until you see something that
isn't an a character. Thus, for any fixed lookahead, which of the two
choices is matched is ambiguous.
Now, if one looks at the above issue, one wonders how all regular
languages can be LL(1), which they are. How do we resolve that
seeming conflict? The answer is perspective. The theory of regular
expressions is forumlated in terms of what strings the expression
generates (or recognizes). However, when dealing with recognition, it
is assumed that the string is the entire input. In some places, this
is made explicit by discussing end-markers. The key point is that the
machine knows where the end of the string is during recognition and if
it is in a terminal state at the end of the string, the string is
Now, compare this to the situation where we are tokenizing text. We
don't know where the end of the token is (apriori, before hand), we
are trying to find it. Thus, if there are extra a characters after
the first a, we don't know whether we should terminate the token or
not. If those a characters are followed by a b character, then we
should not have terminated the token. However, if those a characters
are followed by anything other than a b character (including the end
marker), then a single a character is the correct token.
This subtlety is skipped over in all the texts that I have read on the
topic. And at some level, it is what separates theory from practice.
We have a perfectly good theory of regular expressions, NFAs, and
DFAs. However, when we apply those theories in practice, we don't
always meet all the requirements of the theory and those discrepancies
are often swept under the rug.
So, while regular languages have a simple theory, their use in
tokenizing a text, doesn't strictly meet the requirements of the
theory. Thus, when tokenizing, one needs the ability to keep the
longest match seen so far and back track to it if the string doesn't
match a longer potential match. However, worth noting is that you
don't need a stack of these when using a DFA matcher, one longest
match seen suffices. I will leave that as an exercise for you to
prove that to yourself.
And, therein lies some insight. The problem isn't the theory. It
isn't even the practice. The key is that the theory is close enough
to the practice, that the details that get swept under the rug don't
invalidate the theory entirely. They just require slight adaptations
when being used in practice. The amount of adaptation the theory can
be bent to accept is limited though. For example, the theoretical
equivalence of DFAs and NFAs has certain requirements that need to be
However, many of the extensions to regular expressions (captures and
backreferences, greedy, lazy, and possesive operators) stress that
theory farther than we know how to bend it. At that point the
equivalence between DFAs and NFAs break down, and the reliance on back
tracking methods to get the "desired" result become significant.
> I know that backreferences make a language non-regular (more about that
> later), but what forms of capture are regualr? Are there any tools out
> there who always answer membership in O(n) that support some kind of
> subexpression capture? (Without having to hack around yourself with
> embedded actions.)
There are forms of capture that are regular, and forms corresponding
to the various classes of languages, LL, LR, etc. I believe that
general capture is equivalent to a TM, even without back references.
Ville Laurikari's work examines this fairly thoroughly and others have
taken it even further.
> I also have two rather unrelated questions that have been bugging me for
> quite some time:
> Is every regular language LL(1)? I know that every regular language is a
> context-free language, and that every LL(1) language is also
> context-free. But are all regular languages contained in LL(1)?
Yes. All regular languages can be rewritten as a right recursive
grammar that is LL(1). That said, you need to treat that with the
proper prespective. If you are writing an LL(1) tokenizer, you still
have to deal with the end of token issue described above.
> What class of language do regexes with backreferences fall into?
> Membership is NP-Complete so they are not in CFG (or are they? assuming
> thanks, fabian
BTW, the theory gets stressed even further if you don't know where the
match should start, i.e. one is looking for a regular expression as a
subset of a string and attempting to find both the start and end
boundaries. Solving that problem efficiently with fixed hardware is
interesting to me. It is more subtle than it looks.
Hope this helps,
Chris Clark email: firstname.lastname@example.org
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
Return to the
Search the comp.compilers archives again.