Early access availability of a C++/Java PG

"Evangelos Drikos" <drikosv@otenet.gr>
26 Jan 2006 14:14:08 -0500

          From comp.compilers

Related articles
Early access availability of a C++/Java PG drikosv@otenet.gr (Evangelos Drikos) (2006-01-26)
| List of all articles for this month |

From: "Evangelos Drikos" <drikosv@otenet.gr>
Newsgroups: comp.compilers
Date: 26 Jan 2006 14:14:08 -0500
Organization: An OTEnet S.A. customer
Keywords: tools, available
Posted-Date: 26 Jan 2006 14:14:08 EST

Hi all,


I've developed a Parser/Scanner Generator which I am planning
to make publicly available. Its supported notation is described
here and any feedback is welcome.


Also, if someone wants to try it, I give some details about which
components of the tool are currently available and which ones will be
soon available for commercial use.


At the end of the message there is a comprehensive example which
shows in a nutshell the capabilities of the program.


"Syntaxis.jar" is a zero installation visual Parser/Scanner Generator
written in pure Java, and requires JRE 1.5 or above.


The notation accepted by the Parser Generator is common with that
of the Scanner Generator (Lexical Analyzer).


One defines a BNF Grammar and can construct a Lexical Analyzer
or a Parser, with the following restrictions:


(1) The Scanner Generator accepts lexical conventions expressed as
grammar productions without recursive definitions and cycles.


(2) To build a Parser one can choose any of the three fast parsing
algorithms, LALR(1), SLR(1) and CLR(1), if she/he use pure BNF.


(3) In alternative, one can define an epsilon free grammar and choose
the powerful hybrid/modified GLR to build a parser fast. This Parser
can carry out all possible reductions as soon as it shifts each terminal
without examining the next terminal.




NOTATION
---------------
Initially, I had implemented the following BNF notation:


    1) "< >" A character string enclosed in angle brackets is the name
                                            of a syntactic element. Brackets are optional.


    2) "::=" The definition operator is used in a production rule to
                                            separate the element defined by the rule from its
                                            definition. The element being defined appears to
                                            the left of the operator.


    3) "[ ] " Square brackets indicate optional elements in the
                                            formula.


    4) "{ } " Braces group elements in a formula. The portion of the
                                            formula within the braces shall be explicitly
                                            specified.


    5) " | " The alternative operator. The vertical bar indicates
                                            that the portion of the formula following the bar is
                                            an alternative to the portion preceding the bar. If
                                            the bar appears at a position where it is not
                                            enclosed in braces or square brackets, it specifies
                                            a complete alternative for the element defined by
                                            the production rule.


    6) "..." The ellipsis indicates that an element to which
                                            it applies in a formula may be repeated any number
                                            of times.


    7) \u0020 Unicode Literals. It can be stated also as \x0020.


The LALR/SLR/CLR Parser does not accept (3), (4), and (6).


Further, Semantic Actions are enclosed as C++/Java comments at the
end of each complete alternative of a production rule. (see 5)


Finally, there are no operators for associativity and precedence. As an
alternative solution the program supports two techniques of elimination
of reductions by single productions. This technique combined with the
operator (6) above result to very packed parse trees in the GLR.


Now, I'm extending the notation to support, among others, character
classes and fixed length occurrences/sequences.


Probably, the notation should be consistent with the above BNF and
I cannot use in example "[a-z]" for character classes. Further, since I
can logically OR two character classes I find no reason to support,
sth like "[a-z,A-Z,_]". Thus, my first approach is as follows:




    8) " a .. z " Character Class
    9) "r[m..n]" m to n occurrences of r; m and n should be integers.
10) " r[m] " exactly m occurrences of r; if m is not an integer
                                              it means "r followed optionally by m". (See 3 above)




The following two operators are supported only in the GLR parser, the
current implementation is experimental, and have an additional runtime
overhead but I think are very powerful:




  11) " -= " Binary Operator "but not".
                                              "a::= b -= c "
                                                means, a is defined as b, unless it is c.


  12) " += " It is the opposite of (11) above.
                                                It offers clarity and reusability.


The following two operators enable the construction of parsers
based on fully formal specifications. In other words, these
two binary operators enable scanner-less GLR's.


    13) "-:" Binary Operator "not follows".
                                                <id> ::= <id start> <id part>... -: <id part>
                                                <id part> ::= <letter> | <digit>


    14) "+:" Binary Operator "follows".
                                                <id> ::= <id start> <id part>... +: <delimiter>
                                              <delimiter> ::= <wchar_t> -= <id part>
                                              <wchar_t> ::= \u0001 .. \uffff


In example, there are grammars that state in plain text that
consecutive letters should be examined as one word. This is
exactly how Lexical Analyzers function.


Usually, Lexical Analyzers return tokens that match the longest
sequence of the input. With operators "follows" and "not follows"
this can be stated in a formal grammar specification.


The implementation currently supports as the right operand, either a
terminal or a non-terminal that is simply a grouping of terminals.
In General, the non-terminal should have "width" exactly one
terminal. (see examples of 13 & 14)




At the end of this message I give a comprehensive example. Its
purpose is to demonstrate the use of operators 1 to 12.


I tested it works or better say what I tested worked. It is a scanner-less
Parser that validates a date according to the Gregorian calendar
and I hope it will be easily understood. There is a windows/C++
build(108KB) of this grammar.




AVAILABILITY
-----------------
The tool is currently available without the productive C++ & Java
Parser Engines, Templates and Extractors. Those components
will be soon available for commercial use.


The operators 8-14 listed above have been implemented in an
embedded Simulator that does not execute semantic checks and
semantic actions and I am in the phase of updating the templates.


Yet, the tool has a value even without the Parser Engines &
Templates. One can define a grammar with the user friendly BNF
Editor or import it from a text file, build the parsing decision tables
and then try the grammar by using the Simulator.


Within the Simulator parse trees are visualized (with a JTree in
the GLR). In addition, the tool constructs state-based error
messages with a button click.


The tool comes with three examples:
- A simple Arithmetic Calculator.
Defined in pure BNF, it can be used with any Parser provided.


- An SQL Syntax Checker.
Defined with extended BNF it requires the GLR Parser.


- A Date Validation Parser.
    It is the example provided at the end of this message.


For those three examples, builds (C++) are provided.


Since there is no WEB Site hosting the tool yet, a limited number of
early access requests, if any, could be possibly served. In such case,
priority will be given to those for commercial use. Also on demand
I could probably develop a Parser Engine in a different target language.
Such requests should be submitted directly to my private e-mail and I
will try to answer promptly.


With regard to the notation, any feedback that will help me to improve
it will be highly appreciated.




Ev. Drikos






(Syntax Rules of a Date Validation Parser)
---------------------------------------------------------
<grm> ::=
                      <date value>


<date value> ::=
                      <date format> += <valid date> /* printf("Valid Date"); */
          | <date format> -= <valid date> /* printf("Invalid Date"); */
          | <wchar_t>[10] -= <date format> /* printf("Format Error"); */


<date format> ::=
                      <years value> - <months value> - <days value>
          | <years value> / <months value> / <days value>


<years value> ::=
                      <digit>[4]


<digit> ::=
                      0 .. 9


<months value> ::=
                      <digit>[2]


<days value> ::=
                      <digit>[2]


<valid date> ::=
                      <valid values> -= <invalid date>


<valid values> ::=
                      { <years value> <-> <valid month> <-> <valid day> }


<-> ::=
                      -
          | /


<valid month> ::=
                      0 1
          | 0 2
          | 0 3
          | 0 4
          | 0 5
          | 0 6
          | 0 7
          | 0 8
          | 0 9
          | 1 0
          | 1 1
          | 1 2


<valid day> ::=
                      { { 0 | 1 | 2 } 1 .. 9 }
          | 1 0
          | 2 0
          | 3 0
          | 3 1


<invalid date> ::=
                      <non leap year> <-> 0 2 <-> 2 9
          | <years value> <-> 0 2 <-> 3 0
          | <years value> <-> 0 2 <-> 3 1
          | <years value> <-> 0 4 <-> 3 1
          | <years value> <-> 0 6 <-> 3 1
          | <years value> <-> 0 9 <-> 3 1
          | <years value> <-> 1 1 <-> 3 1
          | <years value> <-> <invalid month> <-> <valid day>
          | <years value> <-> <valid month> <-> <invalid day>
          | <years value> <-> <invalid month> <-> <invalid day>


<non leap year> ::=
                      <years value> -= <leap year>


<leap year> ::=
                      <divisible by 4> -= { <divisible by 100> -= <divisible by 400> }


<divisible by 4> ::=
                      { [ [ <digit> ... ] <even> ] { 0 | 4 | 8 } }
          | { [ <digit> ... ] <odd> { 2 | 6 } }


<even> ::=
                      0
          | 2
          | 4
          | 6
          | 8


<odd> ::=
                      1
          | 3
          | 5
          | 7
          | 9


<divisible by 100> ::=
                      <digit> ... 0 0


<divisible by 400> ::=
                      <divisible by 4> 0 0


<invalid month> ::=
                      <months value> -= <valid month>


<invalid day> ::=
                      <days value> -= <valid day>


<wchar_t> ::=
                      \u0001 .. \uffff


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.