Re: Infinite look ahead required by C++?

"Ira Baxter" <idbaxter@semdesigns.com>
Sat, 6 Feb 2010 12:37:17 -0600

          From comp.compilers

Related articles
Infinite look ahead required by C++? ng2010@att.invalid (ng2010) (2010-02-05)
Re: Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark) (2010-02-06)
Re: Infinite look ahead required by C++? idbaxter@semdesigns.com (Ira Baxter) (2010-02-06)
Re: Infinite look ahead required by C++? thurston@complang.org (Adrian Thurston) (2010-02-08)
Re: Infinite look ahead required by C++? sh006d3592@blueyonder.co.uk (Stephen Horne) (2010-02-09)
Re: Infinite look ahead required by C++? gah@ugcs.caltech.edu (glen herrmannsfeldt) (2010-02-10)
Re: Infinite look ahead required by C++? sh006d3592@blueyonder.co.uk (Stephen Horne) (2010-02-10)
Re: Infinite look ahead required by C++? cfc@shell01.TheWorld.com (Chris F Clark) (2010-02-10)
Re: Infinite look ahead required by C++? martin@gkc.org.uk (Martin Ward) (2010-02-11)
[23 later articles]
| List of all articles for this month |

From: "Ira Baxter" <idbaxter@semdesigns.com>
Newsgroups: comp.compilers
Date: Sat, 6 Feb 2010 12:37:17 -0600
Organization: Compilers Central
References: 10-02-024
Keywords: C++, parse
Posted-Date: 10 Feb 2010 11:00:00 EST

"ng2010" <ng2010@att.invalid> wrote in message
> What elements of C++ make it so hard to parse? Is it a weakness of
> compiler designs rather than a weakness of the language design? I've read
> somewhere that the language requires potentially infinite look ahead.
> Why? And how do compilers handle it?
> [It's ambiguous syntax. Others can doubtless fill in the details. -John]


Well, there's "hard to parse" and then there's "hard to handle".


As John has observed, C++ has an ambiguous grammar.


I really don't like the phrase "hard to parse", because it's relative
to your parsing technology. GLR parsers are capable of parsing C++
easily in spite of the ambiguous grammar.


People that insist on claiming C++ is hard to parse are those that
insist on using LL or LALR parser generators. Their problem is the
classic one that comes from looking under the lamppost for one's keys.


The DMS Software Reengineering Toolkit uses GLR parsing machinery, and
has an industrial strength C++ parser
(http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html).


DMS's parser generator accepts a grammar definition that is very close
to what's in the standard, with deviations, some because even the
standard grammar isn't always convenient, but mostly because our
parsers handle a wide variety of C++ dialects (ANSI, GCC, Microsoft,
...) and contain exensions to support those extensions, too,
(including what we call "dialect conditionals" [you'll see some below
in an attribute grammar rule example])..


One convenient property of DMS's parser generator is it will *tell*
you where (some) ambiguities are. [Some say this computation is
impossible; in the abstract they are correct, you can't compute *all*
possible ambiguities for any possible grammar. But you can compute
*some* by doing a depth-limited search. over abstract sequences of
tokens for nonterminals]:


Here's the output for DMS's GCC3 dialect of C++,
for a pretty shallow search:


================================================================


What next? ambiguities
Searching for ambiguous rules...
Ambiguous Rules:
postfix_expression = 'typeid' '(' expression ')' ;
postfix_expression = 'typeid' '(' type_id ')' ;


Ambiguous Rules:
pseudo_destructor_name = '::' identifier_or_template_id '::' '~'
identifier_or_template_id ;
pseudo_destructor_name = '::' nested_name_specifier '~'
identifier_or_template_id ;


Ambiguous Rules:
pseudo_destructor_name = identifier_or_template_id '::' '~'
identifier_or_template_id ;
pseudo_destructor_name = nested_name_specifier '~' identifier_or_template_id
;


Ambiguous Rules:
unary_expression = postfix_expression ;
unary_expression = '~' cast_expression ;


Ambiguous Rules:
unary_expression = 'sizeof' unary_expression ;
unary_expression = 'sizeof' '(' type_id ')' ;


Ambiguous Rules:
unary_expression = '__alignof__' unary_expression ;
unary_expression = '__alignof__' '(' type_id ')' ;


Ambiguous Rules:
unary_expression = '__alignof' unary_expression ;
unary_expression = '__alignof' '(' type_id ')' ;


Ambiguous Rules:
statement = expression_statement ;
statement = declaration_statement ;


Ambiguous Rules:
for_init_statement = expression_statement ;
for_init_statement = simple_declaration ;


Ambiguous Rules:
simple_declaration = init_declarator_list ';' ;
simple_declaration = decl_specifier_seq ';' ;


Ambiguous Rules:
type_specifier = simple_type_specifier ;
type_specifier = cv_qualifier ;


Ambiguous Rules:
simple_type_specifier = 'typeof' '(' expression ')' ;
simple_type_specifier = 'typeof' '(' type_id ')' ;


Ambiguous Rules:
simple_type_specifier = '__typeof' '(' expression ')' ;
simple_type_specifier = '__typeof' '(' type_id ')' ;


Ambiguous Rules:
simple_type_specifier = '__typeof__' '(' expression ')' ;
simple_type_specifier = '__typeof__' '(' type_id ')' ;


Ambiguous Rules:
enum_specifier = enum_key '{' enumerator_list '}' ;
enum_specifier = enum_key '{' pp_enumerator_list_prefix_seq '}' ;


Ambiguous Rules:
enum_specifier = enum_key identifier '{' enumerator_list '}' ;
enum_specifier = enum_key identifier '{' pp_enumerator_list_prefix_seq '}' ;


Ambiguous Rules:
pp_enumerator_list = pp_enumerator_list_prefix_seq pp_enumerator_seq ;
pp_enumerator_list = pp_enumerator_seq pp_enumerator_list_postfix_seq ;


Ambiguous Rules:
pp_enumerator_seq = enumerator_definition ;
pp_enumerator_seq = if_directive enumerator_list else_directive
enumerator_list endif_directive ;


Ambiguous Rules:
declarator_id = id_expression ;
declarator_id = identifier_or_template_id ;


Ambiguous Rules:
direct_abstract_declarator = '(' parameter_declaration_clause ')' ;
direct_abstract_declarator = '(' gcc_attributes ')' ;


Ambiguous Rules:
pp_parameter_declaration_list = pp_parameter_declaration_seq ;
pp_parameter_declaration_list = pp_parameter_declaration_list_prefix_seq
pp_parameter_declaration_seq ;


Ambiguous Rules:
pp_parameter_declaration_list = pp_parameter_declaration_list_prefix_seq
pp_parameter_declaration_seq ;
pp_parameter_declaration_list = pp_parameter_declaration_seq
pp_parameter_declaration_list_postfix_seq ;


Ambiguous Rules:
member_declaration = member_declarator_list ';' ;
member_declaration = '::' nested_name_specifier 'template' unqualified_id
';' ;


Ambiguous Rules:
member_declaration = member_declarator_list ';' ;
member_declaration = nested_name_specifier 'template' unqualified_id ';' ;


Ambiguous Rules:
base_specifier = identifier_or_template_id ;
base_specifier = macro_invocation ;


Ambiguous Rules:
base_specifier = 'virtual' identifier_or_template_id ;
base_specifier = 'virtual' macro_invocation ;


Ambiguous Rules:
mem_initializer_id = identifier_or_template_id ;
mem_initializer_id = macro_invocation ;


Ambiguous Rules:
template_argument = assignment_expression ;
template_argument = type_id ;


Ambiguous Rules:
template_argument = assignment_expression ;
template_argument = id_expression ;


Ambiguous Rules:
template_argument = type_id ;
template_argument = id_expression ;


Ambiguous Rules:
assert_token = keyword ;
assert_token = '__asm' ;


=========================================================================


This pair:


=================================================
statement = expression_statement ;
statement = declaration_statement ;
=================================================


is hiding the classic example of the ambiguous C++ code fragment
        A*B;


A fair number of these ambiguities are induced by people's classic interest
in
overloading semantics on syntax, mostly by defining identifiers in the
grammar
that carry in implied type.


You can attempt to resolve ambiguities while parsing (to avoid the need
to capture multiple ambiguous parses) by providing the parser with type
information
as it runs. When you do that, you tangle parsing and semantic information
collection,
and you get the classic mess that we see in most real C and C++ parsers.


By using a parser that simply captures the ambiguities during parsing (such
as GLR),
you can avoid that awful tangle completely and thus get a parser directly
from the grammar.
The raw syntax for the DMS's GCC3 C++ parser is **2318** lines according to
wc.


What you get out of a GLR parser is an abstract syntax DAG, with subtrees
for separate derivations of various nonterminals, Ambiguity nodes where
multiple parses can occur, and subtree sharing under the ambiguity nodes.


Now we get to the distinction between "hard to parse" and "hard to handle".
C++ has a complex type system, and arguably an even more complex scoping
mechanism requires rather ugly lookup rules. Much of the C++ reference
manual is devoted to explaining the interactions between identifiers
and lookups.


With DMS, we encode "name and type" resolution using an attribute grammar
(AG)
which you can think of as a functional program coded in terms of syntax
rules.
The AG defines how information is passed up and down instance parse trees,
and mostly what is passed are symbol table scopes and identifier lists. Our
AGs are augmented by procedural code that actually build, insert-into,
or inspect specific scopes and scope links.


Our AGs also have a nice property to support handling ambiguities:
if an AG-rule, when executed,
declares an error, then the syntax tree in which it triggers is simply
deleted from that tree upward to any parent ambiguity node.
Viola, inconsist interpretations simply vanish from the tree.
Any remaining tree is consistent. So our C++ name resolver
simply does name resolution on all the variant interpretations
of each ambiguous rule, and wrong interpretations simply vanish.
What's left is the "parse" tree you really wanted.


The AG-decoration of the
grammar rule "statement = expression_statement" is:


===============================================================================
statement = expression_statement; -- needed to allow "macro_invocation;"
IfExpressionIsNotMacroInvocation
    <<CollectMacrosAndEstablishMicroContext>>:
        { -- inherit whether to collect macro define/undef
            expression_statement[1].add_symbols_to_symbol_table =
statement[0].add_symbols_to_symbol_table;
            -- inherit and derive micro context
            expression_statement[1].micro_context_in =
statement[0].micro_context_in;
            statement[0].micro_context_out =
expression_statement[1].micro_context_out;
        }
    <<ResolveSymbols>>:
        { -- inherit dynamic ancestors
            expression_statement[1].dynamic_ancestors =
statement[0].dynamic_ancestors;
            -- inherit type of current class
            expression_statement[1].current_type_declaration =
statement[0].current_type_declaration;
            -- inherit current symbol space
            IF statement[0].ambiguity_status == NonAmbiguous \/
statement[0].ambiguity_status == AmbiguousTryAlternative THEN
  expression_statement[1].lookup_symbol_space =
statement[0].lookup_symbol_space;
  expression_statement[1].label_symbol_space =
statement[0].label_symbol_space;
            ELSE
  IF statement[0].ambiguity_status == AmbiguousIgnoreAlternative THEN
      expression_statement[1].lookup_symbol_space = VoidSymbolSpace;
      expression_statement[1].label_symbol_space = VoidSymbolSpace;
  ELSE
      << ThisCannotHappen("Unexpected ambiguity status for statement.");
      -- mark inherited attributes as being written
      ASSUME WRITTEN expression_statement[1].lookup_symbol_space;
      ASSUME WRITTEN expression_statement[1].label_symbol_space;
  ENDIF;
            ENDIF;
            -- inherit argument type for function or method
            expression_statement[1].implicit_argument_type =
statement[0].implicit_argument_type;
            -- inherit whether statement occurs in template declaration
            expression_statement[1].is_in_template_declaration =
statement[0].is_in_template_declaration;
            -- inherit whether symbols have to be resolved
            expression_statement[1].resolve_symbols =
statement[0].resolve_symbols;
            -- inherit result type of function or method
            expression_statement[1].result_type = statement[0].result_type;
            -- derive number of for statements
            statement[0].number_of_declarations_or_for_statements = 0;
            -- derive type if last statement is expression statement
            statement[0].derived_type = expression_statement[1].derived_type;
            -- derive next unqualified lookup symbol space
            statement[0].next_unqualified_lookup_symbol_space =
statement[0].lookup_symbol_space;
            -- derive errors
#IF ECMA372c2005 | GCC3 | ISO14882c1998
            statement[0].errors = expression_statement[1].errors;
#ELSIF VectorXBox360 | Union | VisualCpp6 | VisualStudio2005
            IF statement[0].is_in_template_declaration THEN
  IF expression_statement[1].errors == NoErrors THEN
      statement[0].errors = NoErrors;
  ELSE
      IF ErrorsAreNotReported(expression_statement[1].errors) THEN
          << DestroyErrors(expression_statement[1].errors);
          statement[0].errors = NoErrors;
      ELSE
          statement[0].errors = expression_statement[1].errors;
      ENDIF;
  ENDIF;
            ELSE
  statement[0].errors = expression_statement[1].errors;
            ENDIF;
#ELSE
    #FAIL
#ENDIF
            -- derive information about last declaration specifier
            statement[0].last_declaration_specifier = NoDeclarationSpecifier;
            -- mark attributes that are accessed for ambiguity resolution as being
read
            ASSUME READ statement[0].ambiguity_status; -- is also (over)written
for ambiguity resolution
            ASSUME READ statement[0].lookup_symbol_space;
            ASSUME READ statement[0].insert_symbol_space;
            ASSUME READ statement[0].is_in_template_declaration;
            ASSUME READ statement[0].last_declaration_specifier;
            ASSUME READ statement[0].errors;
        }
    <<DeriveTypeOfExpression>>:
        { -- derive type if last statement is expression statement
            statement[0].derived_type = expression_statement[1].derived_type;
        }


===============================================================================


To give you a sense of the difference in complexity between the grammar and
the full
name an type resolver, the AG for C++ is **281295** lines. The relative
differences
between this and the grammar size isn't quite what fair,
as the full AG handles *all* the dialects of C++ and I only counted the size
the
grammar rules specific to GCC3.


But two orders of magnitude difference handling the basic semantics vs. the
syntax I think
clearly makes the point that "C++ is hard to parse" vs. "C++ is hard to
handle" correctly.


If you believe that writing any kind of code/specification takes linear time
in its size,
this suggests that what will hurt you by far the most isn't the parsing, but
rather the
name and type resolution for a langauge like C++.


Other languages aren't as complex IMHO, but have much of the same
differential
between "syntax" and "semantics". So what this suggests is that building
a parser
is easy, and building the support to understand the language is a lot
harder.
Most people don't seem to understand this; I continually hear "if I just had
a parser
for X I could..."


Beyond name and type resolution, if you really want to manipulate programs,
are control and data flow analyses, ... These aren't small either.


The argument for a tool like DMS is that the community in general can't
afford
to build all this standard machinery, and by doing it once that cost can be
amortized.


(One can argue that cost has been amortized by GCC, but if you want to build
a general purpose program analysis and transformation tool, GCC isn't the
answer by a long shot).


--
Ira Baxter, CTO
www.semanticdesigns.com



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.