Re: Concrete to Abstract syntax

greg@cee.hw.ac.uk
24 Jun 1996 11:00:19 -0400

          From comp.compilers

Related articles
Concrete to Abstract syntax rkanagy@erols.com (Ron Kanagy) (1996-06-21)
Re: Concrete to Abstract syntax bwb@concentra.com (Brent Benson) (1996-06-23)
Re: Concrete to Abstract syntax greg@cee.hw.ac.uk (1996-06-24)
Re: Concrete to Abstract syntax kadhim@VNET.IBM.COM ((Basim Kadhim)) (1996-06-27)
| List of all articles for this month |

From: greg@cee.hw.ac.uk
Newsgroups: comp.compilers
Date: 24 Jun 1996 11:00:19 -0400
Organization: Dept of Computing and Electrical Engineering, Heriot-Watt University
References: 96-06-079
Keywords: parse, syntax
Originator: greg@thor.cee.hw.ac.uk

The following is from I course I used to give on language definition.
Greg Michaelson
--------------------------------------------------------------------


Abstract syntax


When we come to define semantics we want to associate meanings
with syntactic constructs. However, concrete syntax contains
information which is not directly relevant to semantics.


Consider the grammar for expressions once more:


          <expression> ::= <term> | <term> + <expression> | <term> - <expression>
          <term> ::= <factor> | <factor> * <term> | <factor> / <term>
          <factor> ::= <base> | - <base>
          <base> ::= <integer> | ( <expression> )


This grammar is chosen to reflect the operator precedence:


          () before
          - before
          * / before
          + -


Thus, the structure of the rules imposes a structure on symbol
sequences: "+" and "-" in <expression>s qualify the preceding
<term>s and following <expression>s; "*" and "/" in <term>s
qualify the preceding <factor>s and following <term>s; "-" in
<factor>s qualify the following <base>s; "(" and ")" in <base>s
qualify the enclosed <expression>s.


Consider parsing:


          (3+4)*5-6


using the above grammar for <expression>s:


                                                                    <expression>
                                                                              |
                                              +---------------+----------------+
                                              | | |
                                        <term> - <expression>
                                              | |
                                +------+---------+ <term>
                                | | | |
                        <factor> * <term> <factor>
                                | | |
                          <base> <factor> <base>
                                | | |
                +-------+-------+ <base> <integer>
                | | | | |
                ( <expression> ) <integer> 6
                                | |
                  +------+------+ 5
                  | | |
            <term> + <expression>
                  | |
          <factor> <term>
                  | |
            <base> <factor>
                  | |




                                            Monday, June 24, 1996






                                                            - 2 -




          <integer> <base>
                  | |
                  3 <integer>
                                              |
                                              4


This structure reflects the order in which processing is to
proceed: "3+4" before "(3+4)*5" before "(3+4)*5-6". However, from
the point of view of that processing, much of the structure is
irrelevant. Thus, to calculate "3+4" we don't need to know that
"3" is a <integer> is a <base> is a <factor>: knowing that "3" is
a <integer> rather than a sub-expression is sufficient. Similarly,
to calculate "(3+4)" the brackets are irrelevant. Similarly, to
calculate "(3+4)*5" we don't need to know that "(3+4)" is a <base>
is a <factor> is a <term>: knowing that it is a sub-expression
rather than a <integer> is sufficient.


In general, provided that we know the order in which processing is
to take place, all we need to distinguish are <integer>s, sub-
expressions and operators. Thus, we can simplify the above
structure to:


                                                                          <expression>
                                                                                    |
                                                    +---------------+----------------+
                                                    | | |
                                        <expression> - <expression>
                                                    | |
                                      +------+---------+ <integer>
                                      | | | |
                        <expression> * <expression> 6
                                      | |
                      +-------+-------+ <integer>
                      | | | |
          <expression> + <expression> 5
                      | |
              <integer> <integer>
                      | |
                      3 4


This corresponds to the abstract syntax:


          <expression> ::= <integer> | <expression> <op> <expression> | - <expression>
          <op> ::= + | - | * | /


The most significant change is that we have dropped chains of
single rules. Thus:


          <expression> -> <term> -> <factor> -> <base> -> <number> -> 6


has become:


          <expression> -> <number> -> 6


This abstract syntax has lost precedence structure and brackets.
However, its purpose is not to recognise symbol sequences but to
identify the structure of symbol sequences relevant to their
meaning.


From a compiling point of view, concrete syntax tells us how to
recognise the overall structure of symbol sequences; abstract
syntax tells us what aspects of that structure need to be
retained, say in a parse tree, for semantic processing.


We can find concrete syntax from abstract syntax by folding rules
together. Consider expressions yet again:


          <expression> ::= <term> | <term> + <expression> | <term> - <expression>
          <term> ::= <factor> | <factor> * <term> | <factor> / <term>
          <factor> ::= <base> | - <base>
          <base> ::= <integer> | ( <expression> )


First of all, replace the single <base> in <factor> with the
options from <base> and then replace the qualified <base> with
<factor>:


          <factor> ::= <integer> | ( <expression> | - <factor>


Next, replace the single <factor> in <term> with the options from
<factor> and replace all qualified <factor>s with <term>:


          <term> ::= <integer> | ( <expression> ) | - <term> | <term> * <term> |
                                <term> / <term>


Next, replace the single <term> in <expression> with the options
from <term> and replace all qualified <term>s with <expression>:


          <expression> ::= <integer> | ( <expression> ) | - <expression> |
                                            <expression> * <expression> | <expression> / <expression> |
                                            <expression> + <expression> | <expression> - <expression>


Next, generalise the options with an operator between
<expression>s:


          <expression> ::= <integer> | ( <expression> ) | - <expression> |
                                            <expression> <op> <expression>
          <op> ::= * | / | + | -


Next, drop the brackets from the bracketed <expression> as to
process it we just process the <expression>:


          <expression> ::= <integer> | <expression> | - <expression> |
                                            <expression> <op> <expression>
          <op> ::= * | / | + | -


Finally, drop the redundant single <expression>:


          <expression> ::= <integer> | - <expression> | <expression> <op> <expression>
          <op> ::= * | / | + | -


In general, for each rule Ri:


i) for an option which is a single rule Rj, replace it with the
          options for that rule


ii) replace any other references to Rj with Ri


These two rules remove the chains of single rules from parse
trees.


iii) drop any symbols which are not significant for distinguishing
          meaning


This rule is partly a matter of style. In principle we can drop
all symbols except one to distinguish the construct. In practise,
we may leave some "redundant" symbols in place to ease
understanding of the final abstract syntax.


iv) drop any options which are just Ri


This rule removes redundant circularities like:


          <expression> ::= <expression>


For example, for statements:


          <statements> ::= <statement> | <statement> ; <statements>
          <statement> ::= <assign> | <input> | <output>
          <assign> ::= <name> := <expression>
          <input> ::= READ <name>
          <output> ::= WRITE <name>


First of all:


          <statement> ::= <name> := <expression> | READ <name> | WRITE <expression>


Next:


          <statements> ::= <name> ::= <expression> | READ <name> |
                                            WRITE <expression> | <statements> ; <statements>


Note that we do not just fold the grammar into one enormous
abstract syntax rule. We need to keep separate rules for
structures which are semantically distinct. Thus, here we might
keep separate rules for expressions, which return values,
statements, which modify variables, and declarations, which
introduce variables.


It is usual to use a different notation for abstract syntax:


i) use simple identifiers instead of full names in <>


ii) use "->" instead of "::="


Thus, for expressions:


          e -> i | - e | e op e
          op -> + | - | * | /


Thus, for statements:


          s -> n := e | READ n | WRITE e | s ; s


We will use this notation when we come to describe semantics.


We also need to indicate how abstract syntax constructs correspond
to concrete syntax constructs. For example, for expressions:


          e in <expression>
          op in <operator>
          i in <integer>


          e -> i | - e | e op e
          op -> + | - | * | /


For example, for statements:


          s in <statement>
          e in <expression>
          n in <name>


          s -> n := e | READ n | WRITE e | s ; s
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.