|Error recovery in PEG parser email@example.com (Dustin Voss) (2008-11-29)|
|From:||Dustin Voss <firstname.lastname@example.org>|
|Date:||Sat, 29 Nov 2008 17:00:05 -0800|
|Posted-Date:||01 Dec 2008 06:59:19 EST|
I'm working on adding error recovery to my PEG parser.
Your basic PEG parser will try alternate parses (using backtracking)
and if it can't find one, fails. Instead, I want the parser to skip
ahead in the stream to some point where it might be able to resume
The only technique I've seen so far is to add a final ordered choice
that acts as a catch-all, advancing the parser to a point where
parsing can resume. It seems to me that this catch-all should only be
used in cases where the preceding grammar matched and there aren't any
alternate interpretations of the following grammar. For example, if
your declaration production is:
'define' 'constant' variable '=' expression
you know that if you see 'define' and 'constant', the remainder must
be a constant-definition and if there is some parse failure, there's
no point in backtracking; instead you should just note the error and
skip ahead to the next top-level definition (using some
heuristic). Thus, a production that includes recovery might look like
'define' 'constant' (variable '=' expression / skip-to-next-tld)
where skip-to-next-tld notes the error as a side-effect.
This looks fine, but I don't find it acceptable because the error
message is too high-level. Consider a function definition:
(name '(' argument-list? ')' body 'end' / skip-to-next-tld)
argument (',' argument)*
If you place the catch-all in the function-definition production, then
if it fails the best you can do is report "syntax error in function;
skipping." To get better error messages, you have to push the
catch-all rule farther down the parse tree.
However, if you place the catch-all lower, the best skip-ahead
heuristic becomes contextual. For example, say I have the following
'define' 'function' name '(' arguments? ')' body 'end'
'define' 'constant' variable default
'=' expression / skip-to-next-???
In an argument-list, you can just skip to the next right-paren
(ignoring nested pairs) and try to continue parsing the function body;
but in a constant-definition, you have to skip to the next top-level
Now, normally, I'd just punt on this issue. I'd place the catch-all
high in the tree and let the user figure it out. But that really isn't
an option for my current project.
I'm writing a documentation generator for the Dylan language. It scans
source code looking for definitions to document. The Dylan language
has an extensible grammar. Without writing most of a real compiler, I
can't know what is or is not a valid expression; I will have to skip
certain edge cases, but I want to capture all I can of every
This means I need my catch-alls as low on the parse tree as possible.
Does anyone have any advice or suggestions?
Here's what I've come up with so far.
Since the problem with low catch-alls is choosing the right skip-ahead
algorithm, I thought I'd just have the catch-all signal that a
skip-ahead is needed, and let a higher-level production supply that
skip-ahead, perhaps via an exception handler or attribute of the
higher-level production (my parser does support inherited & synthesized
But I think there might be an issue when the low-level production
finishes skipping ahead and exits recursion if an upper-level production
expects some additional tokens which got skipped over. This would cause
the upper-level production to fail, and then to backtrack, undoing the
skip-ahead and then some.
I also thought about an alternative to the catch-all method, where the
parser is resynchronized with a recovery point later in the stream.
You would have your skip-ahead production that finds a place in the
stream where a target production or productions might be able to resume
parsing. That place would be memorized. Then, parsing would continue.
But all productions would forcibly fail until, in the course of
backtracking, one of the target productions is tried. Right before the
production is tried, the stream position would be set to the memorized
place where the target production might work.
That solution seems easy to get wrong, but also easy to conceptualize
with respect to a grammar.
Return to the
Search the comp.compilers archives again.