Re: Context sensitive scanner ?

"Scott Stanchfield" <thetick@magelang.com>
24 Nov 1997 23:38:47 -0500

          From comp.compilers

Related articles
Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-20)
Re: Context sensitive scanner ? johnm@non.net (1997-11-23)
Re: Context sensitive scanner ? pjj@cs.man.ac.uk (1997-11-23)
Re: Context sensitive scanner ? Mikael.Pettersson@sophia.inria.fr (Mikael Pettersson) (1997-11-23)
Re: Context sensitive scanner ? genew@vip.net (1997-11-23)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-24)
Re: Context sensitive scanner ? cfc@world.std.com (Chris F Clark) (1997-11-28)
Re: Context sensitive scanner ? henry@zoo.toronto.edu (Henry Spencer) (1997-11-28)
Re: Context sensitive scanner ? ok@cs.rmit.edu.au (1997-11-29)
Re: Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-29)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-30)
Re: Context sensitive scanner ? johnm@non.net (1997-11-30)
[4 later articles]
| List of all articles for this month |

From: "Scott Stanchfield" <thetick@magelang.com>
Newsgroups: comp.compilers
Date: 24 Nov 1997 23:38:47 -0500
Organization: MageLang Institute - http://www.magelang.com
References: 97-11-117 97-11-127
Keywords: parse

Wow! That sounds _exactly_ like what I had to do to parse Visual Basic!


Add on top of that labeled-end-ifs...(I think this was the one.)


IF X=1 THEN
    statement
100 END IF


which needs THREE tokens of lookahead. I handled it the same way you did,
although I called it a "scanner wrapper", and, using lex at the time, had my
makefil rename the lex-generated scanner to "real_yylex" and wrote my own
yylex that called real_yylex and changed things like END IF to END_IF.


Did your language have something evil like NEXT X,Y...


(Obviously the language was designed purely with the end user in mind, and
no thought for the poor compiler-writer...)


Perhaps we should write a paper on this stuff... ;)


--
  -- Scott


==============================================
Scott Stanchfield -- http://www.scruz.net/~thetick
Magelang Institute -- http://www.magelang.com
>>the second expression should be returned as IDEN x, DOT, NUM 1, DOT NUM 2
>> ...
>>I'd rather want to know what tokens the parser is expecting, and recognize
>>only those tokens (maybe with an exception on keywords) in the scanner.
>
>I don't know about any tools for dealing with this situation, but I
>once had to cope with something similar, but worse.
>
>Back in 1988, I wrote a compiler (a translator actually) for a proprietary
>Basic dialect, using yacc for the parser and a hand-written scanner (never
>liked Unix C lex much). The language had some "interesting" features:
>
>* It had an IF <exp> THEN <statements> [ELSE ...] END IF construct:
> * "END" and "IF" are separate tokens, thus requiring two tokens
> lookahead in the parser ("END" all by itself is a statement).
> As we all know, yacc does LALR(1), not LALR(2). My very first
> grammar had > 600 reduce/reduce conflicts!
> * You could *omit* the "END IF" part if the following statement
> started with a line number. In this case, a number of "virtual"
> "END IF" would be supplied to terminate any pending open "IF".
>* Two-token lookahead was also required for some other cases, like
> "END WHILE" and "END FOR".
>* There were other cases where the token to be returned for some
> construct depended on a limited range of surrounding context.
> (I cannot, and really don't want to, recall the details now.)
>
>Now, one could imagine some awful hack, adding persistent (i.e., C
>"static") state variables in the scanner routines to recognize the
>different situations and cope with them. An early version of my
>scanner did just that, but it was an unmaintainable mess that never
>really worked well.
>
>I eventually recast the problem as a stack of coroutines
>processing streams of tokens. That is:
>1. A simple low-level scanner generates a stream of tokens.
>2. A "transformation" scanner is a coroutine with an input stream
> and an output stream. It steps through its input stream
> looking for its particular kind of transformation. If it
> doesn't find one, the first input token is output as-is,
> otherwise the transformation consumes N input tokens and
> produces M output tokens.
>3. Push one transformation scanner for each individual kind
> of transformation. Keep each scanner as simple as possible.
>4. The yacc-generated parser gets its input tokens from the
> output stream of the top-most transformation scanner.
>
>For instance, one transformation scanner looked for two-token
>sequences "END IF/WHILE/FOR", emitting instead synthetic
>"END_IF/END_WHILE/END_FOR" tokens.
>Another maintained the number of pending END_IF tokens by counting
>IF tokens: upon seeing a LINE_NUMBER token, it would insert
>the pending number of END_IF tokens before the LINE_NUMBER.
>
>This strategy was quite successful, since I could essentially
>forget about all the nasty special cases in both the yacc grammar
>and the low-level scanner. The implementation of the coroutines
>is almost trivial. Haven't seen anything like this described
>in the compiling litterature, though.
>
>/Mikael
>--
>Send compilers articles to compilers@iecc.com, meta-mail to
>compilers-request@iecc.com. Archives at http://www.iecc.com/compilers
>






--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.