Re: Parsing partial sentences

Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Tue, 11 Apr 2017 10:31:30 +0200

From comp.compilers

Related articles
Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-03)
Re: Parsing partial sentences pronesto@gmail.com (Fernando) (2017-04-04)
Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-07)
Re: Parsing partial sentences gneuner2@comcast.net (George Neuner) (2017-04-07)
Re: Parsing partial sentences mail@slkpg.com (mail) (2017-04-07)
Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-07)
Re: Parsing partial sentences gneuner2@comcast.net (George Neuner) (2017-04-10)
*Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich)* (2017-04-11)**
Re: Parsing partial sentences martin@gkc.org.uk (Martin Ward) (2017-04-11)
Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-11)
Re: Parsing partial sentences martin@gkc.org.uk (Martin Ward) (2017-04-11)
Re: Parsing partial sentences gneuner2@comcast.net (George Neuner) (2017-04-11)
Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-12)
Re: Parsing partial sentences DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2017-04-20)
[7 later articles]

| List of all articles for this month |

From:	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Newsgroups:	comp.compilers
Date:	Tue, 11 Apr 2017 10:31:30 +0200
Organization:	Compilers Central
References:	17-04-001 17-04-002 17-04-003 17-04-004 17-04-006 17-04-007
Injection-Info:	miucha.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="51455"; mail-complaints-to="abuse@iecc.com"
Keywords:	C, parse, comment
Posted-Date:	11 Apr 2017 10:08:33 EDT

Am 11.04.2017 um 02:22 schrieb George Neuner:
> On Fri, 7 Apr 2017 22:24:04 +0200, Hans-Peter Diettrich

>> I already wrote the parser, but it's LL. For code snippets a LR parser
>> looks like the only solution?
>
> You could do it in either LL or LR ... LR would be more runtime
> efficient, but I think it would not be any easier to create the parser
> in the first place.

Perhaps I should clarify: with LL I mean top-down, with LR a bottom-up
parser.

Such an LR parser will stop with one or more tokens, which then should
fit (reduce to) one of the expected non-terminals. This last step should
be automatic, if the goal is the set of all expected non-terminals. If
it's a statement, it can become a function. If it's an expression, it
can become a named constant or a function, depending on the operands.

> To do what you want you'd need non-terminals not just for every legal
> (sub)expression in C, but for every individual keyword, operator and
> symbol, and also for any quasi-legal combination of them. The parser
> would be enormous.
>
>
> Building on John's example, consider what you'd do with
>
> # define FOO +
> # define BAR + 42
> # define BAZ + c /* note 'c' is undefined */

All these can not be reduced into constants or functions. Eventually a
problem may arise for "+ 42", where the '+' could be interpreted as the
sign of a constant value. In this case the following code

> int a, b;
>
> :
> a = a FOO b BAR BAZ;

would raise a parser error "expecting operator between BAR and BAZ".

A more practical example were windows.h. I expect much more than 50%
named constants in it, which could be detected and converted easily.
Then this automated handling could reduce the manual inspection and
classification of many hundreds (thousands?) of #defines to a few
unhandled or not easily translatable macros.

> And then consider what you'd need to handle sh..stuff like this for
> every keyword, operator, variable, etc.

The identifier lists are already implemented, for both the preprocessor
and parser. The current implementation translates all MS (msvc), GNU
(gcc) and BCC style C source files. Further compilers can be added by
the user, who writes a header file containing all predefined symbols.
The output is Delphi (OPL) code, further back-ends could be added, like
C++, Java or whatever compatible language you prefer.

One use of such a translation into a strictly typed language are checks
for invalid type casts or wrong enumerated constants, as found e.g. in
the Windows SDK samples. Every inspected source file, compiled with BCC
after removal of excess type casts, contained more than 50 bugs. Even in
GNU sources a translation into OPL did reveal bugs, which had not been
found or fixed since many years.

DoDi

[I think you may be trying to solve the wrong problem. Really, you
cannot preparse C #define statements unless you're prepared to handle
irregular parse tree fragments. In most compilers, the lexer takes
vastly more time than the parser, so if you store tokens rather than
text, you'll get most of the performance. Also, one of the reasons C
and C++ added const and inline is that they do most of what #define
does while being normal syntax.

Here's a thought that sometimes works: try parsing the #define text,
if it succeeds store the parsed version, otherwise store the text.
But be prepared for that to fail, e.g.:

#define FOO a + b

d = FOO * c;

be sure that your preparsed FOO doesn't expand into (a+b)*c. That
may well be what the programmer meant, but it's not what she said.
-John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Parsing partial sentences

Hans-Peter Diettrich <DrDiettrich1@netscape.net>Tue, 11 Apr 2017 10:31:30 +0200

Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Tue, 11 Apr 2017 10:31:30 +0200