Re: DFAC, DFA lexer generator for C/C++

Paul Mann <paul@paulbmann.com>
Wed, 7 Jul 2010 13:20:35 -0700 (PDT)

          From comp.compilers

Related articles
DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-05-13)
Re: DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-07-01)
Re: DFAC, DFA lexer generator for C/C++ paul@paulbmann.com (Paul Mann) (2010-07-07)
| List of all articles for this month |

From: Paul Mann <paul@paulbmann.com>
Newsgroups: comp.compilers
Date: Wed, 7 Jul 2010 13:20:35 -0700 (PDT)
Organization: Compilers Central
References: 10-05-074 10-07-002
Keywords: lex, tools
Posted-Date: 08 Jul 2010 10:27:05 EDT

In case anybody is confused, let me say:


DFAC is a complete package, which includes the 'dfac.exe' program
and C/C++ source code for a 'main.cpp', 'lexer.cpp' and 'parser.cpp'.


Also included is a C Lexical Grammar which is a good example of
the lexical-grammar notation used by DFAC. When compiled, the
program is a good speed tester for lexical analyzers.


The lexical grammar notation is different than regular-expressions
because it was borrowed from the type of BNF used to create parsers.
Other compiler compiler systems (ANTLR, SableCC) use this kind of
notation for defining lexers. Here is a small example:


      <identifier> => IDENTIFIER // A defined constant (returns a number
to the parser).


      <identifier> -> letter (letter|digit)*


      letter -> 'a'..'z' | 'A'..'Z' | '_'


      digit -> '0'..'9'


Here is the C Lexical Grammar included in the product.
(You might want to view this in a fixed font, such as Courier).


/* C Lexical Grammar by CompilerWare, July 2010. */


// Tokens:


      <eof> => T_EOF
      <identifier> => T_IDENTIFIER
      <number> => T_NUMBER
      <literal> => T_LITERAL
      <string> => T_STRING


      `auto` => T_AUTO
      `break` => T_BREAK
      `case` => T_CASE
      `cdecl` => T_CDECL
      `char` => T_CHAR
      `const` => T_CONST
      `continue` => T_CONTINUE
      `default` => T_DEFAULT
      `do` => T_DO
      `double` => T_DOUBLE
      `else` => T_ELSE
      `enum` => T_ENUM
      `extern` => T_EXTERN
      `far` => T_FAR
      `float` => T_FLOAT
      `for` => T_FOR
      `goto` => T_GOTO
      `huge` => T_HUGE
      `if` => T_IF
      `int` => T_INT
      `interrupt` => T_INTERUPT
      `long` => T_LONG
      `near` => T_NEAR
      `pascal` => T_PASCAL
      `register` => T_REGISTER
      `return` => T_RETURN
      `short` => T_SHORT
      `signed` => T_SIGNED
      `sizeof` => T_SIZEOF
      `static` => T_STATIC
      `struct` => T_STRUCT
      `switch` => T_SWITCH
      `typedef` => T_TYPEDEF
      `union` => T_UNION
      `unsigned` => T_UNSIGNED
      `void` => T_VOID
      `volatile` => T_VOLATILE
      `while` => T_WHILE


      '+' => T_PLUS
      '-' => T_MINUS
      '*' => T_ASTERISK
      '/' => T_SLASH
      '%' => T_PERCENT
      ',' => T_COMMA
      ';' => T_SEMICOLON
      '=' => T_EQUALS
      '{' => T_LEFTBRACE
      '}' => T_RIGHTBRACE
      ':' => T_COLON
      '(' => T_LPAREN
      ')' => T_RPAREN
      '[' => T_LBRACKET
      ']' => T_RBRACKET
      '...' => T_ELIPSIS
      '!' => T_EXCLAMATION
      '^' => T_BITEXOR
      '|' => T_BITOR
      '&' => T_BITAND
      '*=' => T_MULEQ
      '/=' => T_DIVEQ
      '%=' => T_MODEQ
      '+=' => T_ADDEQ
      '-=' => T_SUBEQ
      '<<=' => T_SHLEQ
      '>>=' => T_SHREQ
      '&=' => T_ANDEQ
      '^=' => T_EXOREQ
      '|=' => T_OREQ
      '++' => T_PLUSPLUS
      '--' => T_MINUSMINUS
      '~' => T_TILDE
      '.' => T_DOT
      '->' => T_ARROW
      '#' => T_HASHMARK
      '\' => T_BACKSLASH
      '?' => T_QUESTION
      '||' => T_OR
      '&&' => T_AND
      '==' => T_EQ
      '!=' => T_NOTEQ
      '<' => T_LT
      '>' => T_GT
      '<=' => T_LTEQ
      '>=' => T_GTEQ
      '<<' => T_SHL
      '>>' => T_SHR


      <whitespace> [] // ignore this
      <comment1> []
      <comment2> []


// Lexical rules:


      <eof> -> \z


      <identifier> -> letter (letter|digit)*


      <number> -> digits
                                    -> float


      <literal> -> ''' lchar+ '''
      lchar -> '\' '\'
                                    -> '\' 't'
                                    -> '\' 'n'
                                    -> '\' '''
                                    -> '\' '0'
                                    -> lany


      <string> -> '"' '"'
                                    -> '"' schar+ '"'
      schar -> '\' '\'
                                    -> '\' 't'
                                    -> '\' 'n'
                                    -> '\' '"'
                                    -> '\' '0'
                                    -> sany


      <whitespace> -> space+


      float -> rational
                                    -> digits exp
                                    -> rational exp
      rational -> digits '.'
                                    -> '.' digits
                                    -> digits '.' digits
      exp -> 'e' digits
                                    -> 'E' digits
                                    -> 'e' '-' digits
                                    -> 'E' '-' digits
                                    -> 'e' '+' digits
                                    -> 'E' '+' digits


      <comment1> -> '/' '*' EndInAst '/'


      EndInAst -> '*'+
                                    -> NA+ '*'+
                                    -> EndInAst NANS '*'+
                                    -> EndInAst NANS NA+ '*'+


      NA -> 0..127 - \z - '*'
      NANS -> 0..127 - \z - '*' - '/'


      <comment2> -> '/' '/'
                                    -> '/' '/' NEOL+


      NEOL -> 32..127 | \t


      digits -> digit+


      letter -> 'a'..'z' | 'A'..'Z' | '_'
      digit -> '0'..'9'


      lany -> any - ''' - '\' - \n
      sany -> any - '"' - '\' - \n


      space -> \t | \f | \n | ' '


      any -> 0..127 - \z


      \t -> 9
      \n -> 10
      \v -> 11
      \f -> 12
      \r -> 13
      \z -> 26 // End-of-file character


/* End of C Lexical Grammar. */


The main reason that I use the angled brackets for <identifier>
is because of consistency. The parser grammar uses this notation
and the LALR parser generator synchronizes very well with the
DFAC lexer generator.


For more information and to download the product, see:


      http://compilerware.com


Paul B Mann



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.