Re: '^' and '$' in Regular expression

Chris F Clark <cfc@shell01.TheWorld.com>
Thu, 22 Apr 2010 11:39:21 -0400

From comp.compilers

Related articles
'^' and '$' in Regular expression march1896@gmail.com (Tangel) (2010-04-21)
Re: '^' and '$' in Regular expression cfc@shell01.TheWorld.com (Chris F Clark) (2010-04-21)
Re: '^' and '$' in Regular expression march1896@gmail.com (Tangel) (2010-04-21)
Re: '^' and '$' in Regular expression mailings@jmksf.com (mailings@jmksf.com) (2010-04-22)
Re: '^' and '$' in Regular expression armelasselin@hotmail.com (Armel) (2010-04-22)
*Re: '^' and '$' in Regular expression cfc@shell01.TheWorld.com (Chris F Clark)* (2010-04-22)**
Re: '^' and '$' in Regular expression quinn_jackson2004@yahoo.ca (Quinn Tyler Jackson) (2010-04-22)
Re: '^' and '$' in Regular expression cfc@shell01.TheWorld.com (Chris F Clark) (2010-04-22)

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	Thu, 22 Apr 2010 11:39:21 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	10-04-052 10-04-055 10-04-057
Keywords:	lex, DFA
Posted-Date:	22 Apr 2010 12:53:07 EDT

Tangel <march1896@gmail.com> writes:

> Thanks very much, the problem I have is like this.
> the regexp is /^a*$/, and I try to test if it accept string "aaaaa",
> which becomes "\naaaaa\n" after pre-processing. And the regexp accept
> the string.
> But when the regexp is /a*/, and the buffer is "aaaaa",
> after the pre-precessing, "\naaaaa\n" is not acceptable for /a*/.

You have to decide what semantics you want for regular expressions.
There are slight subtle differences in regular expressions when used
in different contexts or for different purposes.

For example, if you want the semantics to be does the buffer exactly
match the string, add your begin-of-buffer and end-of-buffer (\n)
bytes to the buffer and ^ and $ to the ends of the pattern (if they
aren't already there). Thus, you rewrite the buffer "aaa" to
"\naaa\n" and the pattern a* to ^a*$.

However, if you are asking if the pattern is "in" the buffer, you will
rewrite it differently, adding ^.* to start and .*$ to end. Thus, a*
becomes ^.*a*.*$.

Now, this rewriting doesn't have to be explicit. It can implicitly
happen as part of your implementation. As Jan suggested, you can get
the same semantics, simply by making some extra checks when the
machine terminates. If you check that the start and end of your match
are at the start and end of the buffer, then you can check for exact
string matches, whereas if you don't check, you may accept strings
that are contained in the buffer.

Notably, you may make checking at the start or end of the buffer the
semantics of processing an edge (transition, arc) that is labelled by
a ^ or $. If you do that and if your matcher implicitly finds matches
contained in the buffer, then you can get both semantics depending
upon how a rule is written. However, the details of that are not
covered in the Dragon book to my knowledge.

Hope this helps,
-Chris

******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: '^' and '$' in Regular expression

Chris F Clark <cfc@shell01.TheWorld.com>Thu, 22 Apr 2010 11:39:21 -0400

Chris F Clark <cfc@shell01.TheWorld.com>
Thu, 22 Apr 2010 11:39:21 -0400