Re: First vs Predict and LL(*)

Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Wed, 07 Jan 2015 23:04:47 +0100

          From comp.compilers

Related articles
First vs Predict and LL(*) alexander.morou@gmail.com (2015-01-07)
Re: First vs Predict and LL(*) alexander.morou@gmail.com (2015-01-07)
Re: First vs Predict and LL(*) DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2015-01-07)
First vs Predict and LL(*) slkpg4@gmail.com (SLK Mail) (2015-01-07)
Re: First vs Predict and LL(*) alexander.morou@gmail.com (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) alexander.morou@gmail.com (Alexander Morou) (2015-01-09)
Re: First vs Predict and LL(*) DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2015-01-09)
Re: First vs Predict and LL(*) alexander.morou@gmail.com (Alexander Morou) (2015-01-22)
Re: First vs Predict and LL(*) DrDiettrich1@netscape.net (Hans-Peter Diettrich) (2015-01-22)
[1 later articles]
| List of all articles for this month |

From: Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Newsgroups: comp.compilers
Date: Wed, 07 Jan 2015 23:04:47 +0100
Organization: Compilers Central
References: 15-01-003
Keywords: LL(1), parse
Posted-Date: 07 Jan 2015 20:41:37 EST

alexander.morou@gmail.com schrieb:


> In my process of regaining my legs in the project, I re-read some common
> articles, and noticed: I don't see the point of differentiating First vs
> Predict sets.
>
> The type of project I'm writing is a LL(*) parser builder , which aims to scan
> ahead and yield a state which denotes which path is the most logical path to
> follow next.


Here some mostly practical considerations:


"Traditional" parsers distinguish between an lexer/tokenizer and the
parser. The lexer accepts characters and outputs terminal tokens, which
then are interpreted by the parser. That's the way to go in e.g.
programming languages, where the language construction should prevent
any unwanted/unexpected misinterpretation. Remember that grammars and
regular expressions are fine for telling a machine how to interpret
input, but not for humans *preparing* that input. LL(1) is perfect for
communication between humans and machines, higher degrees make things
more complicated for both; machines then have to use stacks or parallel
processing, while humans may be annoyed by unexpected syntax errors.


Merging both, into an "scannerless parser", opens a can of worms, at
least when the input is prepared by humans and should be readable
(parseable) also by other people. One of the most annoying "features" of
a scannerless parser IMO is the handling of token separators. You can
find out easily, how complicated the grammar for some well known
programming language will become, as soon as whithespace handling is
moved out of the lexer.


But of course there exist use cases for scannerless parsers, in detail
for languages other than programming languages.


> To accomplish this, my basic approach is to track the source of a given symbol
> referenced by a state within a rule's deterministic state machine.
>
> This can lead to situations where a given rule has tens to hundreds of
> possible sources, especially if the nesting goes 20+ deep (expression
> precedence defined through left-recursive rules, versus defining operator
> precedence sugar in the grammar definition language.)


Right, precedence rules are a weak point in formal grammars. Here the
human factor suggests (to me) some formal annotation, in favor of deeply
nested rules, for the implementation of operator precedence, grouping etc.




> So far I'm to the point where I've collapsed common routes into a single view,
> so if '(' is available through two possible parse paths, but both are the same
> outgoing target at the current look ahead, it considers them the same.
>
> An example:
> A ::= B | C; //::= defines rules
> B ::= D | E;
> D ::= '(' P ')';
> E ::= '(' Q ')';
> C ::= @'O'*; //'@' means case insensitive.
> P := [0-9A-Fa-f]+; //:= defines tokens
> Q := [:Ll:]+; //:Xx: means unicode character classes as defined by .NET's
> standard sets.
>
> 'A' would contain '(' within its first set, but it would be observed through
> 'B' and only 'B', so it would be valid to parse 'B' always when '(' is
> encountered at the start of 'A'. B's definition would have to look ahead
> further, for the sake of simplicity, token-level ambiguities are resolved by
> longest prevails, or first come first serve. So if you want [A-F] to be
> parsed in a 'Q' token, you'd have to specify it first, unless other letters
> are parsed that make 'Q' parse a longer sub-string.


Here I feel a need for some annotation, that tells the (human) generator
of a sentence how to go ahead, in order to reach the intended goal. I
think that you found the same need already, too?


In LL(1) the rules D and E could be reduced into
      DE ::= '(' (P|Q) ')';
what I'd understand better. I prefer to use EBNF as much as possible, in
order to reduce the number of rules in an LL grammar. Also PEG
parsers/grammars are easier to understand to me, when it comes to the
clarification of ambiguities; a PEG parser will use the first
interpretation, found in the grammar, so that it's easier to find out
how to make the parser consider some alternative interpretation.


> (Next step will involve collecting all states which observe a given symbol and
> looking ahead further, then further, and so on, until a decision is possible.
> I might find this is computationally implausible. Left-recursive rules will
> be handled through loops that parse alternate paths until the rule itself can
> be pushed to the stack, once this is done, it can safely re-parse alternate
> paths until there are no more symbol reductions. This is what I do by hand in
> the Grammar Definition Language's parser.)


IMO left recursion should be resolved by EBNF notation or, if not easily
possible, an LR parser should be used instead of an LL parser.


> The main question: what's the functional intent behind the First vs. Follow
> sets,


Follow sets are useful in the implementation/emulation of left
recursion, when the first token in a rule can be empty (epsilon). Then
both the First set of the first (optional) and following token are
acceptable for that non-terminal. Once the RD parser enters that
subroutine, when will it return from there?


The decision is easy as long as the First and Follow sets are disjoint
(LL(1)), then the loop over tokens in the First set is left when a
different token is found, which consequently must be a member of the
First set of the next token.


When both sets overlap (LL(1) First/First conflict), descending
immediately into the same subroutine will yield an endless loop. That's
why I add a rule to my LL parsers, that every call *must* consume at
least one terminal token, before descending and (possibly) recursing. In
the simplest implementation the parser remembers the current position in
the input stream, after consuming its first (assumed) terminal, then
tries to parse the first (optional) token/alternative. In case of an
error it returns to the input position and tries the next alternative.
This model can be optimized by using an stack or queue of tokens, in
order to eliminate repeated lexing of the offending terminals.


Other models can be used as well, but they may lead easily to LR
parsers, which have little or no use for recursive descent at all. E.g.
LALR parsers use the First sets (SLR also the Follow sets) in order to
abbreviate the search for the applicable production, i.e. to reduce the
size of their LR token stack.


> and the goal of the project I'm writing is a recursive descent parser,
> from what's described above, is my classification of LL(*) accurate?


I don't see your classification of LL(*) :-(


I'd say that an LL(k) parser definitely knows how to proceed after
inspection of at most k terminals, but I may be wrong here - number of
terminals or tokens?


DoDi


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.