Re: Parsing techniques for generating sentences from grammar

Peter Wilson <>
28 Jul 1999 01:48:04 -0400

          From comp.compilers

Related articles
Parsing techniques for generating sentences from grammar (Amit Rekhi) (1999-07-10)
Re: Parsing techniques for generating sentences from grammar (James Jones) (1999-07-14)
Re: Parsing techniques for generating sentences from grammar (1999-07-19)
Re: Parsing techniques for generating sentences from grammar (Peter Wilson) (1999-07-28)
| List of all articles for this month |

From: Peter Wilson <>
Newsgroups: comp.compilers
Date: 28 Jul 1999 01:48:04 -0400
Organization: The Boeing Company
References: 99-07-036 99-07-050
Keywords: testing, parse

James Jones wrote:
> Seems to me that it would be easy to turn a recursive descent parser
> into a sentence generator of the sort desired; where the parser
> expects to see a token, it should instead print it. Wherever there's
> a * or +, you'd need some way to keep it from going on forever (flip a
> coin or extend the grammar syntax to let the user attach probabilities
> to alternatives, the latter being of wider applicability than just
> preventing endless recursion). Not that this is new stuff; check out
> Charles Wetherell's paper on the subject in a _Computing Surveys_
> issue back in the early 1980s.

        Some time ago I did something like the suggestion above as part of
a larger project. I generated a parser for a grammar specification
language based on Wirth Syntax Notation and organised the symbol
table/parse tree with a view to supporting various back ends e.g., to
determine if a grammar was LL(1), LL(2), or worse; pretty print the
grammar together with a cross-reference table; generate sentences
corresponding to the grammar; and put random typos into previously
valid sentences.

        The sentence generator could be started at any production and
would churn away generating strings corresponding to tokens until it
had worked its way (recursively) through this `start' production. One
important part to get right was the random generation of things like
identifiers (e.g., variable names, function names), numbers (integers,
reals, complex), string values, truth values, etc. It was also useful
to be able to limit the number of different variable names that were
generated or to be able to input a list of names to be picked
from. The format of identifiers, strings, etc., also had to be

        The most difficult part was figuring out how to try and get the
generator to write a reasonably comprehensive set of sentences before
running out of stack space (it was programmed in Java). If the grammar
was recursive (which is typical of LL grammars for mathematical
expressions or parenthesised lists of parenthesised lists) it could
easily keep burrowing down, never to usefully return. As a heuristic
solution the generator kept track of its current depth of recursion
and this was used in a non-linear fashion to control the output of the
random number generators used when deciding which production to pick
if there were multiple choices, to bias against recursing on optional
productions, etc. Further, when the recursion depth reached a
particular level the generator tried to do the minimal amount of work
to complete.

        I used a configuration file to set the various parameters that
controlled the generator. At first I had built them into the program
but I rapidly found that different grammars often needed wildly
different parameters to generate sentences of interest. Generally
speaking, the tokens corresponding to the leftmost expressions in a
production were most likely to appear in the output. Biasing of the
random number generator could modify this tendency, or sometimes the
ordering of the expressions in a production could be changed.

        I found the sentence generator to be a useful tool. At the time I
was working on a compiler for a language whose grammar as given was
LL(N>4), and with no examples of either valid or invalid
sentences. The generator was initially used to generate example
sentences (the sentence generator doesn't care about lookahead
problems). Then as the grammar kept on being tweaked to reduce the
lookahead the generator was used as an aid in checking that the
language remained constant. Typically, the generated sentences looked
nothing like those that a real user would write and did produce some
very stressful cases for the real parser (and perhaps some surprises
for the language designers).

        The generator produces syntactically correct sentences but it is
highly unlikely that any of these will be semantically sensible ---
these do, though, give the compiler's error reporting code a good work

Peter W.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.