LL(1) grammar for regex

"James Brown \[MVP\]" <not@www-no.ucsd.edu>
22 Mar 2006 23:41:45 -0500

From comp.compilers

Related articles
*LL(1) grammar for regex not@www-no.ucsd.edu (James Brown \[MVP\])* (2006-03-22)**
Re: LL(1) grammar for regex schmitz@i3s.unice.fr (Sylvain Schmitz) (2006-03-27)
Re: LL(1) grammar for regex cfc@shell01.TheWorld.com (Chris F Clark) (2006-03-27)

| List of all articles for this month |

From:	"James Brown \[MVP\]" <not@www-no.ucsd.edu>
Newsgroups:	comp.compilers
Date:	22 Mar 2006 23:41:45 -0500
Organization:	Compilers Central
Keywords:	lex, question
Posted-Date:	22 Mar 2006 23:41:45 EST

All,

I'm currently trying to build a regex engine by hand (using C++). The aim is
to build a top-down parser for regex expressions. Therefore I am attempting
to build a LL(1) grammar for regex's. I am at the early stage of building a
syntax-tree from my regex grammar, but I am having a hard time with regex
'concatenation':

I am basically copying the code (and grammar) found in the following regex
tutorial:

http://www.gamedev.net/reference/programming/features/af5/

Here's the BNF grammar:

//
// expr ::= concat '|' expr
// | concat
//
// concat ::= rep . concat
// | rep
//
// rep ::= atom '*'
// | atom
//
// atom ::= chr
// | '(' expr ')'
//
// char ::= alphanumeric character

The grammar above represents regex concatenation by a '.' (dot) operator.
This means that regular expressions of the form "abcde" must be written as
"a.b.c.d.e". I can appreciate that it is much simpler to parse in this form,
however, I'd like to remove the reliance on the concatenation operator.

The current pseudo-code for parsing the concatenation is:

node *concat()
{
      node *left = rep();

      if(ch == '.')
      {
              ch = nextch();
              node *right = concat();
              return new node(CONCAT, 0, left, right);
      }

      return left;
}

I briefly tried to describe concatenation as:

concat ::= rep concat
                      | rep

node *concat()
{
      node *left = rep();

      if(ch) // note I don't test for '.'
      {
              node *right = concat();
              return new node(CONCAT, 0, left, right);
      }

      return left;
}

Of course, this destroys the parsing because I've introduced a recursion
which consumes the rest of the input-stream. I think my version of the
'concat' rule is no longer LL(1)..??

Can anyone advise me on where I should look for help with this? The article
I am using is perfect for me because the grammar is understandable (to me)
and the source-code is nice and simple also....but I just can't figure out
how to proceed from here.

thanks,
James

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

LL(1) grammar for regex

"James Brown \[MVP\]" <not@www-no.ucsd.edu>22 Mar 2006 23:41:45 -0500

"James Brown \[MVP\]" <not@www-no.ucsd.edu>
22 Mar 2006 23:41:45 -0500