# Re: FIRST function computation problem.

## =?Windows-1252?Q?S=F6nke_Kannapinn?= <soenke.kannapinn@wincor-nixdorf.com>2 Jul 2003 00:41:23 -0400

From comp.compilers

Related articles
FIRST function computation problem. mefrill@yandex.ru (2003-06-22)
Re: FIRST function computation problem. derkgwen@HotPOP.com (Derk Gwen) (2003-07-02)
Re: FIRST function computation problem. torbenm@diku.dk (2003-07-02)
Re: FIRST function computation problem. haberg@matematik.su.se (2003-07-02)
Re: FIRST function computation problem. soenke.kannapinn@wincor-nixdorf.com (=?Windows-1252?Q?S=F6nke_Kannapinn?=) (2003-07-02)
| List of all articles for this month |

 From: =?Windows-1252?Q?S=F6nke_Kannapinn?= Newsgroups: comp.compilers Date: 2 Jul 2003 00:41:23 -0400 Organization: Siemens Business Services References: 03-06-103 Keywords: parse Posted-Date: 02 Jul 2003 00:41:23 EDT

> I have a problem with FIRST function for 1 lookahead symbol evalation.
> To lookup such the symbols for some nonterminal A I am doning follows:
>
> 1. go through all rules that left part is A. Here I do following:
> a) If I have rule A --> aB, where a is terminal. I just add this
> terminal to lookahead set for A.
> b) If I have rule A --> e, where e is epsilon, I add e to lookahead
> set for A.
> c) If I have rule A --> B C ... where B - nonterminal, I at first
> compute the lookahead set for B, if B's lookahead set contains
> epsilon, check C and etc. I found that recursive function here is the
> best. But I do not know how to work with rules like follows: A --> B,
> B --> C A D... To compute lookahead set for A I have to at first
> compute lookaheads for B, but if lookaheads for C contain epsilon I
> need to know the same about A to understand do checking of D or
> not...
>
> Please, help to decide this problem and there may be the good FIRTS
> computaion algorithm?
>
> Thanks,

I have commented on the computation of FIRST sets some time ago in
comp.compilers (http://compilers.iecc.com/comparch/article/01-04-079)

Unfortunately, when browsing old pure-ASCII articles as html from the
comp.compiler archive all "formatting" spaces seem to be messed up.
I therefore decided to add a copy of that old text again with
formatting spaces re-inserted. Hope it helps you understand the nature
of FIRST.

Regards,
Dr. S�nke Kannapinn

--
The computation of FIRST and FOLLOW sets of a grammar should, of
course, not be limited to special grammars (such as non-left-
recursive ones) because their formal definition is based simply on the
respective grammar's derivation relation which means that FIRST and
FOLLOW are always *defined* no matter how weird the grammar is. So,
some grammars then, clearly, the algorithm is insufficient!

I'll describe an algorithm below which is adequate because its
structure will follows exactly the structure of the problem. This is
what's wrong with your approach. Unfortunately, I know only a single
compiler-related text book that treats the problem well (see the
trailer); most compiler books present algorithms I'd call
"appropriate" in the above sense. (Don't misunderstand that: I didn't
say they are wrong; I said they are inappropriate.) Usual solutions
apply some least-fixpoint-computation strategy that can roughly be
sketched as follows (I'll deal with FIRST only):

Compute obvious initializing sets of FIRST(A) for all nonterminals A.
REPEAT
Find all dependencies in the grammar where some set FIRST(A)
must obviously contain some set FIRST(B) for some nonterminals A
and B, and (re-)compute FIRST(A) := FIRST(A) + FIRST(B).
UNTIL, during a complete grammar examination, none of the set unions
computed has added any new elements

Clearly, there are unnecessary re-computations done in this approach
because of the REPEAT-recompute-everithing-UNTIL-no-more-change
pattern.

I'll try to explain now how to do better. The algorithm I'll sketch
assumes that the grammar is reduced.

I can't do completely without formulas, so let me use G=(V,T,P,S)
(V vocabulary, T termininals, P productions, S start symbol,
N=V\T nonterminals). Furthermore,
- greeks alpha, beta, ... are strings from V*,
- epsilon is the empty word,
- A,B,... are nonterminals from N, and
- a,b,... are terminals from T.

Then we have

FIRST(A) = { a in T : A =>* a alpha } and
FOLLOW(A) = { a in T : S =>* alpha A a beta }.

(Of course, => is the derivation relation, =>* its reflexive
transitive closure.)

Now, looking at the grammar productions we see that we can read
off direct and indirect contributions to sets FIRST(A):

Examining successively each p in P, we see

direct contributions:
If p is of the shape A -> alpha a beta where alpha =>* epsilon
then, of course,
a in FIRST(A).

indirect contributions:
If p is of the shape A -> alpha B beta where alpha =>* epsilon
then, of course,
FIRST(A) is a superset of FIRST(B).
Let's express this formally by a relationship
(A,B) in contains-the-FIRSTs-of, or, in infix notation:
A contains-the-FIRSTs-of B.

We'll interpret the relation contains-the-FIRSTs-of in N x N as
the edge relation of a directed graph with node set N.
If you think about it, loops in this graph express left recursion!

Here's an intricate example grammar:
A -> F C a | E A
B -> C G h
C -> D B a | b h | c | epsilon
D -> F e | epsilon
E -> G h | D f
F -> E A | d F | epsilon
G -> g

>From the above, we know:
FIRST0(A)={a}, FIRST0(B)={}, FIRST0(C)={b,c}, FIRST0(D)={e},
FIRST0(E)={f}, FIRST0(F)={d}, FIRST0(G)={g}, and
contains-the-FIRSTs-of = { (A,C), (A,E), (A,F), (B,C), (B,G), (C,B),
(C,D), (D,F), (E,D), (E,G), (F,E)}.

Drawn as a poor man's ASCII-graph this is:

G <------------
^ _______ |
| | | |
| | v |
| A-->F-->E--
| | ^ |
| v | |
B<--C-->D<--
| ^
| |
---

Let's assume that we know whether a nonterminal is nullable. Then,
both
- the initial sets FIRST0(A), reflecting the "direct contributions",
- and the relation contains-the-FIRSTs-of, reflecting exactly the
"indirect contributions" (subset-dependencies of FIRST sets),
can be computed in a single pass over the grammar productions in
time linear to the size of G.

Now one observes that, for each nonterminal A, any terminal a in
FIRST(A) indeed stems either from a direct contribution - i.e.
a in FIRST0(A) - or from an indirect contribution - i.e. a in
FIRST(B) for some nonterminal B, and A contains-the-FIRSTs-of A.
In other words, we can collectively characterize all FIRST sets
as the smallest sets FIRST(A) satisfying, for each A in N:
1.) FIRST(A) contains FIRST0(A), and
2.) for each nonterminal B, s.t. A contains-the-FIRSTs-of B:
FIRST(A) contains FIRST(B).
By the way, this is equivalent to
FIRST(A) = Union of { FIRST0(B) : A contains-the-FIRSTs-of* B }.

Now, what an appropriate algorithm must do in view of the
problem visualization as a graph is to compute the FIRST0 set of
each graph node ("init"), and to traverse the contains-the-FIRSTs-of
graph in depth-first manner unioning FIRST(B) into FIRST(A)
when returning from the traversal of en edge (A,B) ("union").

Well, and cycles must be treated correctly, i.e. all nodes
(nonterminals) along a cycle must finally get the same FIRST sets
assigned ("distribute"):
For example, for the cycle D->F->E->D we have
FIRST(D) contains FIRST(F) contains FIRST(E) contains FIRST(D),
so FIRST(D) = FIRST(F) = FIRST(E) = FIRST(D).
And there can be cycles in cycles...

Therefore, to do the minimum work required, an appropriate
algorithm must determine the structure of the contains-the-FIRSTs-of
graph w.r.t. cycles (or, to be precise, the "strongly connected
components" (SCCs) of the graph).

All this can be perfectly combined with Tarjan's brilliant algorithm
for the computation of a directed graph's SCCs. All we must do is
to add three set computation statements to Tarjan's algorithm, and
there we are.

Here is a sketch of an implementation-friendly variant of Tarjan's
algorithm computing what we need in case of FIRST (the three
set-computing statements added are marked by (*...!*) ).

(* asssume that nonterminals are INTEGERs 0, 1, ..., n-1 *)
(* and that sets FIRST0[0..n - 1] are already computed. *)

PROCEDURE ComputeFIRST;

CONST clean = - 1; done = n;

VAR Stack, Low: ARRAY [0 .. n - 1] OF INTEGER;
Top: INTEGER;

PROCEDURE Traverse(v: INTEGER);

VAR k, j, w, Top1: INTEGER;

BEGIN
INC(Top); Stack[Top] := v; Low[v] := Top; Top1 := Top;
FOREACH w such that "v contains-the-FIRSTs-of w" DO
IF Low[w] = clean THEN
Traverse(w);
FIRST[v] := FIRST[v] + FIRST[w]; (*union!*)
END;
IF Low[w] < Low[v] THEN Low[v] := Low[w] END
END;
IF Low[v] = Top1 THEN (* v root of SCC *)
WHILE Top >= Top1 DO w := Stack[Top];
FIRST[w] := FIRST[v]; (*distribute!*)
Low[w] := done; DEC(Top)
END
END
END Traverse;

VAR v: INTEGER;

BEGIN
FOR v := 0 TO n - 1 DO FIRST[v] := FIRST0[v] END; (*init!*)
Top := - 1;
FOR v := 0 TO n - 1 DO Low[v] := clean END;
FOR v := 0 TO n - 1 DO
IF Low[v] = clean THEN Traverse(v) END
END
END ComputeFIRST;

In general, this algorithm does no more unnecessary set
computations. The underlying graph depth-first graph traversal
algorithm visits each node (nonterminal) exactly once and traverses
each edge (contains-the-FIRSTs-of relationship) exactly once, and it
has linear run-time w.r.t. the size of the graph (or, here, the
grammar). You can hardly do any better.

The algorithm is appropriate because its basic algorithmic structure
corresponds to the structure of the FIRST-computation problem. I
hope I have made that clear enough so that you will agree.

Computing FOLLOW is structurally very similar except for the fact,
of course, that the initial sets FOLLOW0 are different from the sets
FIRST0, and that the graph-spanning relation contains-the-FOLLOWs-of
is different from contains-the-FIRSTs-of.
In other words, a completely different graph will be traversed.
Having understood the above, it shouldn't be hard to derive how

Some final remarks:

In case you compute FIRST sets for an LL(1) parser you can easily
extend the above algorithm to complain about left recursion, i.e.
cycles in the graph.

I chose the above example grammar such that the resulting graph
is the one used in the classic book
K.Mehlhorn: Graph Algorithms and NP-Completeness, Springer 1984
to study Tarjan's SCC algorithm with.

The idea to add set-computing statements to (a variant of) Tarjan's
SCC algorithm is explained in the beautiful article
F.DeRemer, T.Pennello: Efficient Computation of LALR(1) Look-Ahead
Sets, TOPLAS 1984.

Tarjan's SCC algorithm has been published in
R.E.Tarjan: Depth-First Search and Linear Graph Algorithms, SIAM
Journal on Computing 1972.

Tarjan's original article on SCC computation is hard to understand.
Maybe you prefer the presentation in
T.H.Cormen, C.E.Leiserson, R.L. Rivest: Introduction to Algorithms,
The MIT Press, 1990.

Also, chapter 2 of
S.Sippu, E.Soisalon-Soininen: Parsing Theory, vol.1, Springer 1988
treats the computation of set-valued functios over graphs in detail.
It explains more mathematically inclined what I've only sketched
above for the "theory-aware practitioner".

Post a followup to this message