Re: Compiling expressions

James Harris <james.harris.1@gmail.com>
Thu, 3 Jan 2013 13:33:50 -0800 (PST)

          From comp.compilers

Related articles
[2 earlier articles]
Re: Compiling expressions mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2012-12-30)
Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-02)
Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-02)
Re: Compiling expressions matzebraun@googlemail.com (matzebraun@googlemail.com) (2013-01-03)
Re: Compiling expressions torbenm@diku.dk (2013-01-03)
Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-03)
Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-03)
Re: Compiling expressions mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2013-01-04)
Re: Compiling expressions james.harris.1@gmail.com (James Harris) (2013-01-06)
Re: Compiling expressions vonbrand@inf.utfsm.cl (Horst von Brand) (2013-01-14)
Re: Compiling expressions james.harris.1@gmail.com (James Harris \(es\)) (2013-03-07)
| List of all articles for this month |

From: James Harris <james.harris.1@gmail.com>
Newsgroups: comp.compilers
Date: Thu, 3 Jan 2013 13:33:50 -0800 (PST)
Organization: Compilers Central
References: 12-12-035
Keywords: parse
Posted-Date: 04 Jan 2013 11:41:07 EST

On Dec 29 2012, 1:11 pm, James Harris <james.harri...@gmail.com>
wrote:


...


> 1. Hand-written, not the output of a parser generator.
> 2. Efficient and without backtracking.
> 3. Precedences (and possibly associativities) defined in tables.
> 4. Output to be a tree structure.
> 5. Parenthesised subexpressions allowed.
> 6. Some operator families are *not* to associate with each other. See
> below.
> 7. Monadic prefix, dyadic infix and monadic postfix operators are all
> allowed.
> 8. Prefix and infix operators can use some same symbols (e.g. minus
> sign).
>
> Infix and postfix operators use distinct symbols.


...


Here is an idea for an expression parser to try to address the points
mentioned. I am not sure if it covers all the bases yet. For some
reason, although the requirement "parse an expression" is ostensibly
straightforward, when precedence and three types of operators are
thrown in I have been finding it very hard to reduce to easily
understandable concepts.


I should say what concepts the proposal is based on:


* An expression has exactly two key types of elements which I am
calling values and operators. The operators are the normal types of
operator symbols: prefix, infix and postfix. The values are
identifiers and literals. A subexpression in parentheses is also a
value.


* A value can have an arbitrary number of prefix operators to its left
and an arbitrary number of postfix operators to its right.


* A prefix operator applies not just to the value immediately to its
right but to an arbitrary expression to its right as far as an
operator with lower precedence. For example,


    -a.b+c


The unary minus in this case applies not just to the variable, a, but
to the subexpression a.b and terminates at the + sign. The dot here is
a binding operator which has a higher precedence than unary minus.


* Parentheses can appear in two contexts. Before a value (i.e. when
scanning for a value) a left paren is regarded as a grouping
parenthesis and starts the enclosure of a subexpression. Conversely,
after a value a left paren indicates a function call or similar. For
example,


    a + (...) //a grouped expression
    a(...) //a function call


So, here is an attempt to produce such an expression parser. This one
is partially recursive and written in fairly low-level pseudocode. Of
the approaches I have tried I think this is the easiest to understand.
It is certainly short (which is cool!) but it may not be completely
right. Just about all the issues I have had have been to do with
applying precedence properly. Discussion and/or corrections would be
welcome.


Parsing of an expression is intended to start with a call to
expression_parse which takes one parameter. On the initial call this
is a dummy start-of-expression symbol which has a lower precedence
than any real operator. Once the parse is complete expression_parse
should return a tree representing the expression.


This function is essentially in two parts: first obtain a value and
then iterate over as many pairs of (infix operator, value) as have
higher precedence than that of the initial symbol.


function expression_parse(op1)
    op2 = token //a copy of the current token
    if op2 is lparen //*grouping* parens
        v1 = expression_parse(lparen)
        consume rparen
    else if op2 is a prefix operator
        v1 = expression_parse(prefix version of op2)
        v1 = node(PREFIX_OP, op2, v1)
    else
        v1 = value_parse(op1)
    op2 = current token (or dummy end of expression)
    while prec(op2) > prec(op1)
        if incompatible(op1, op2)
            raise "incompatible - grouping parens needed"
        v2 = expression_parse(op2)
        v1 = node(INFIX_OP, op2, v1, v2)
        op2 = current token (or dummy end of expression)
    return v1


I won't insult your intelligence by explaining it all but some notes
are in order. Each call to this routine ends leaving the current token
pointer pointing at the next unconsumed operator. The line


      v1 = expression_parse(prefix version of op2)


is intended to deal with such things as a prefix (unary) minus having
a different precedence than an infix minus. When such a symbol is seen
in prefix context the "prefix version" of it (which has the higher
precedence) is used on the recursive call.


When expression_parse gets to a value - an identifier or a literal -
it calls value_parse, below. The value_parse function is intended not
just to read the value but to accumulate all the following operations
- postfix and infix - which bind tighter than the precedence in force
at the time value_parse was called.


function value_parse(op1)
    v1 = value() //identifier or literal
    while true
        op2 = current token or dummy end of expression
        if lparen //parens of a function call or similar
            v1 = args_parse(op2)
            consume rparen
        else if prec(op2) > prec(op1)
            if incompatible(op1, op2)
                raise "incompatible - grouping parens needed"
            v1 = node(POSTFIX_OP, v1)
    return v1


There is a need to accumulate a list of arguments. That is the job of
args_parse, below.


function args_parse()
    v1 = empty list
    while true
        append to v1 expression_parse(lparen)
        if current token is a comma
            consume comma
        else
            break loop
    return v1


I am aware of some little things I would change to render it into a
programming language but the main issue is whether it would handle
precedences generally and correctly (or could be made easier to
understand). As I say, comments and corrections - especially on the
bigger issues - would be very welcome.


James


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.