Re: LR-parser-based lexical analysis - does it work?

"Chris F Clark" <cfc@shell01.TheWorld.com>
18 Oct 2002 23:04:46 -0400

From comp.compilers

Related articles
LR-parser-based lexical analysis - does it work? soenke.kannapinn@wincor-nixdorf.com (=?iso-8859-1?Q?S=F6nke_Kannapinn?=) (2002-10-13)
*Re: LR-parser-based lexical analysis - does it work? cfc@shell01.TheWorld.com (Chris F Clark)* (2002-10-18)**
Re: LR-parser-based lexical analysis - does it work? vmakarov@redhat.com (Vladimir N. Makarov) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? vbdis@aol.com (VBDis) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? brian-l-smith@uiowa.edu (Brian Smith) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? grosch@cocolab.de (Josef Grosch) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? zackw@panix.com (Zack Weinberg) (2002-10-20)

| List of all articles for this month |

From:	"Chris F Clark" <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	18 Oct 2002 23:04:46 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	02-10-030
Keywords:	LR(1), lex
Posted-Date:	18 Oct 2002 23:04:46 EDT

Sönke Kannapinn wrote:
> In a compiler generator project we are thinking about building
> scanners using LR parser techniques.
...
> * Does that attempt work (at least for languages the lexical
> structure is "not too complicated")? Or does one really need GLR
> parsing?

Lexing using pure LR techniques definitely works for a reasonable
class of languages. The lexers generated by Yacc++ are definitely LR
based (with only 1 character of lookahead allowed in all the released
versions of Yacc++). So, one can write a lexer using an LR grammar
and one doesn't need backtracking to do so.

BTW, LR lexing has several advantages, such as allowing "natural"
definitions of comments even if the language allows nested ones. The
lexer for Yacc++ has a token for action code, which is C++ code in
nested braces. The LR rules for that token handle not only nested
braces, but also comments and strings inside the braces without
thinking that the closing brace inside a C style comment is
significant and should be matched to an opening brace, see lexer
grammar fragment below.

We still have cases where lexer states (which we do as classes) are
necessary, but generally one does not need states just to handle
complex tokens. One only needs them to handle places where the parser
wants to restrict the tokens it can see at certain points (e.g. in
Verilog idenitifiers and hex constants are ambiguous, but hex
constants can only occur in specific contexts).

Note the speed of the resulting lexer is not dependent on the grammar
class, but more dependent on little details like buffering schemes.
The fact is that Yacc++ lexers generally run as fast as Flex ones for
the same set of tokens--the reason being that both Yacc++ and Flex
have roughly the same overhead in buffer operations.

> There should be rules to eliminate some LR conflicts "automatically"
> a) realizing what is called "longest match";
> b) realizing the fact that usually keywords override identifiers.
> There should also some clean way to deal with whitespace&comment.

Longest match is roughly equivalent to resolving all shift/reduce
conflicts by picking shift.

Keywords conflicts are generally reduce/reduce conflicts and you can
use the ordering rule to resolve them (e.g. always reduce the rule
that appears first (or last) in the grammar in a r/r conflict). In
Yacc++, we sidestepped this problem, by putting keywords into a string
(or symbol) table and allowing the symbol type to change the grammatic
type of the resulting rule.

The whitespace, comment, etc. problem is solved by allowing the user
to specify which tokens get sent onto the parser when the lexer
reduces them and which get dropped on the floor (discarded).

----------------------------------------------------------------------

Actual defintion of token representing balanced C++ code embedded in
Yacc++ grammars. @ is special symbol meaning match any characters not
otherwise matched, sort of like [^xyz] in Lex dialects.

EMBEDDED_CODE :

        "{" { wr_act_bgn(); }
                  (matching_braces | @ | embedded_comment | embedded_string | "/" @
                    | "/\n" // newline
                        { ++yy_buf_lineno(); }
                    | "\n" // newline
                        { ++yy_buf_lineno(); }
                  )*
        "}"
                                ;
        {
                /* remove trailing brace from spelling */
                yy_lex_rslt().as_sym_ptr = yy_lookup(yy_lex_token(), yy_lex_len()-1,
                        EMBEDDED_CODE_);
                wr_act_end(yy_lex_rslt().as_sym_ptr);
        }

matching_braces : // this is done, because Yacc++ disallows rules for tokens
                                    // (e.g. things returned to the parser) as recursive only
                                    // fragments within a token can have recursive rules, so
                                    // one simply splits off 1 level of the recursion as the
                                    // token rule

        "{"
                (matching_braces | @ | embedded_comment | embedded_string | "/" @
                  | "/\n" // newline
                        { ++yy_buf_lineno(); }
                  | "\n" // newline
                        { ++yy_buf_lineno(); }
                )*
        "}"
                                  ;

embedded_comment :

          "/*" (( "\n"
                { ++yy_buf_lineno(); }
          | @ )* "*")+ "/"

    | "//" @ * "\n"
                { ++yy_buf_lineno(); }
                                  ;

embedded_string : '"'
                                            ( "\\\""
                                            | "\\\\"
                                            | "\\" @
                                            | "\\\n" { ++yy_buf_lineno(); }
                                            | "\n" { ++yy_buf_lineno(); }
                                            | @ )*
                                    '"'
                                | "'"
                                            ( "\\'"
                                            | "\\\\"
                                            | "\\" @
                                            | "\\\n" { ++yy_buf_lineno(); }
                                            | "\n" { ++yy_buf_lineno(); }
                                            | @ )*
                                    "'"
                                ;

------------------------------------------------------------------------

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: LR-parser-based lexical analysis - does it work?

"Chris F Clark" <cfc@shell01.TheWorld.com>18 Oct 2002 23:04:46 -0400

"Chris F Clark" <cfc@shell01.TheWorld.com>
18 Oct 2002 23:04:46 -0400