LL(1) grammar for regex

"James Brown \[MVP\]" <not@www-no.ucsd.edu>
22 Mar 2006 23:41:45 -0500

          From comp.compilers

Related articles
LL(1) grammar for regex not@www-no.ucsd.edu (James Brown \[MVP\]) (2006-03-22)
Re: LL(1) grammar for regex schmitz@i3s.unice.fr (Sylvain Schmitz) (2006-03-27)
Re: LL(1) grammar for regex cfc@shell01.TheWorld.com (Chris F Clark) (2006-03-27)
| List of all articles for this month |

From: "James Brown \[MVP\]" <not@www-no.ucsd.edu>
Newsgroups: comp.compilers
Date: 22 Mar 2006 23:41:45 -0500
Organization: Compilers Central
Keywords: lex, question
Posted-Date: 22 Mar 2006 23:41:45 EST

All,


I'm currently trying to build a regex engine by hand (using C++). The aim is
to build a top-down parser for regex expressions. Therefore I am attempting
to build a LL(1) grammar for regex's. I am at the early stage of building a
syntax-tree from my regex grammar, but I am having a hard time with regex
'concatenation':


I am basically copying the code (and grammar) found in the following regex
tutorial:


http://www.gamedev.net/reference/programming/features/af5/


Here's the BNF grammar:


//
// expr ::= concat '|' expr
// | concat
//
// concat ::= rep . concat
// | rep
//
// rep ::= atom '*'
// | atom
//
// atom ::= chr
// | '(' expr ')'
//
// char ::= alphanumeric character


The grammar above represents regex concatenation by a '.' (dot) operator.
This means that regular expressions of the form "abcde" must be written as
"a.b.c.d.e". I can appreciate that it is much simpler to parse in this form,
however, I'd like to remove the reliance on the concatenation operator.


The current pseudo-code for parsing the concatenation is:


node *concat()
{
      node *left = rep();


      if(ch == '.')
      {
              ch = nextch();
              node *right = concat();
              return new node(CONCAT, 0, left, right);
      }


      return left;
}


I briefly tried to describe concatenation as:


concat ::= rep concat
                      | rep


node *concat()
{
      node *left = rep();


      if(ch) // note I don't test for '.'
      {
              node *right = concat();
              return new node(CONCAT, 0, left, right);
      }


      return left;
}


Of course, this destroys the parsing because I've introduced a recursion
which consumes the rest of the input-stream. I think my version of the
'concat' rule is no longer LL(1)..??


Can anyone advise me on where I should look for help with this? The article
I am using is perfect for me because the grammar is understandable (to me)
and the source-code is nice and simple also....but I just can't figure out
how to proceed from here.


thanks,
James



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.