Re: Parsing a simple BASIC language

"Barry Kelly" <barry_j_kelly@hotmail.com>
10 Apr 2001 01:15:43 -0400

          From comp.compilers

Related articles
Parsing a simple BASIC language paul.dunn4@ntlworld.com (paul.dunn4) (2001-04-04)
Re: Parsing a simple BASIC language barry_j_kelly@hotmail.com (Barry Kelly) (2001-04-10)
Re: Parsing a simple BASIC language stephan@pcrm.win.tue.nl (2001-04-10)
Re: Parsing a simple BASIC language christl@fmi.uni-passau.de (2001-04-12)
Re: Parsing a simple BASIC language paul.dunn4@ntlworld.com (Dunny) (2001-04-12)
Re: Parsing a simple BASIC language barry_j_kelly@hotmail.com (Barry Kelly) (2001-04-14)
Re: Parsing a simple BASIC language marcov@toad.stack.nl (2001-04-18)
Re: Parsing a simple BASIC language michael@moria.de (2001-04-18)
[2 later articles]
| List of all articles for this month |

From: "Barry Kelly" <barry_j_kelly@hotmail.com>
Newsgroups: comp.compilers
Date: 10 Apr 2001 01:15:43 -0400
Organization: Ireland On-Line Customer
References: 01-04-014
Keywords: parse, Basic
Posted-Date: 10 Apr 2001 01:15:43 EDT

"paul.dunn4" <paul.dunn4@ntlworld.com> wrote in message
> I have recently begun a new project, [...] I am writing a Sinclair
> Spectrum BASIC interpreter - [...] what has got me stumped is parsing
> the text entered by the user for errors.


> The problem that I have is one of speed.


> I would like to create a simple syntax highlighter which displays a
> syntax template (such as FOR <variable> = <Numexpr> TO <numexpr> ) and
> highlights errors in red as you type. Unfortunately, it slows dow to
> the point of unusability if you type over about 150 chars in the input
> string.


> 1) How to parse a simple BASIC language (like Sinclair Basic) quickly,
> 2) Any better method than the one I'm using here. I used BNF for
everything,
> but that was even slower than the method I am using.


I'm not familiar with Sinclair Basic, but I have written many parsers
in Object Pascal (Delphi). Might I suggest that you use a simple
recursive descent parser (and change your grammar to LL(1), with
context if necessary), and begin highlighting in red at the beginning
of the first invalid token found by the parser?


For speed, the method I have found to work efficiently with Object
Pascal, for lexical analysis, looks like this; It uses a little hack,
similar to Classes.pas's TParser, which defines TToken as a character;
this makes it easy to use for ':', ';' etc:


---8<---
// class outline edited for clarity
type
    TToken = #0..#255;


    // use a hash lookup at the end of NextToken to check if
    // a tokIdentifier is actually a tokKeyword
    TKeyword = (keyNone, keyBlah, keyFor, keyEtc);


    TLexer = class
    private
        FCurrentPosition: Integer;
    public
        constructor Create(ASource: string);


        procedure NextToken;


        property Token: TToken;
        property TokenStart: Integer;
        property TokenLength: Integer;
        property TokenAsString: string;
        property TokenAsInteger: Integer;
        property TokenAsFloat: TFloat;
        property TokenAsWhatever: TWhatever;
        property TokenAsKeyword: TKeyword;
    end;


const
    tokEOF = TToken(0);
    tokInteger = TToken(1);
    tokFloat = TToken(2);
    tokWhatever = TToken(3);
    tokKeyword = TToken(4);
    tokIdentifier = TToken(5);
--->8---


The job of the NextToken procedure is:


beginning at the CurrentPosition:
    skip whitespace
    determine the token type from its first character
    read in that token type fully
    leave the CurrentPosition 1 after the end of the token just read
    set TokenAsString, TokenAsInteger, Token, etc as appropriate.


I'm sure you're familiar with writing native code regexp matchers, but I
have seen many people using Object Pascal do it inefficiently, so I'm going
to be fairly verbose, edited a little for clarity:


---8<---
procedure TLexer.NextToken;
const
    WhiteSpace = [#1..#32];
    Alpha = ['a'..'z', 'A'..'Z', '_'];
    Numeric = ['0'..'9'];
    AlphaNumeric = Alpha + Numeric;
var
    currPos: Integer; // allows register variable
    // another optimization if Delphi doesn't do automatic
    // loop variable induction is to use a PChar for currPos
    // instead; it'll mean slightly more pointer arithmetic
    // to figure out TokenStart, and you should use the
    // SetString function instead of the Copy function.
begin
    currPos := FCurrentPosition;


    // skip whitespace
    while source[currPos] in WhiteSpace do
        Inc(currPos);


    // determine token type
    case source[currPos] of


        #0:
        begin
            // a #0 (null character) means end of source
            Token := tokEOF;
            // we don't touch the current position,
            // so repeated NextToken keeps setting
            // the token to tokEOF
            TokenStart := currPos;
            TokenLength := 1;
        end;


        Alpha:
        begin
            // an identifier
            FTokenStart := currPos;


            while source[currPos] in AlphaNumeric do
                Inc(currPos);


            // the current position is now at the first
            // character that ISN'T part of the identifier


            // extract token (from start, length currpos - start)
            StringToken := Copy(source, TokenStart,
                currPos - TokenStart);


            // set token type
            Token := tokIdentifier;
            TokenLength := currPos - TokenStart;
        end;


        Numeric:
        begin
            // a number
            TokenStart := currPos;


            while source[currPos] in Numeric do
                Inc(currPos);


            // extract token (from start, length currpos - start)
            StringToken := Copy(source, TokenStart,
                currPos - TokenStart);


            IntegerToken := StrToInt(StringToken);
            FloatToken := IntegerToken;


            // set token type
            Token := tokInteger;
            TokenLength := currPos - TokenStart;
        end;


    else
        // some operator like + or whatever
        // an easy cheat is to store the token as
        // a character type and use characters below
        // 32 (whitespace usually; platform) to
        // represent token types like integer,
        // float, identifier, etc.
        Token := source[currPos];
        TokenStart := currPos;
        TokenLength := 1;


        // VERY IMPORTANT to leave current position
        // AT ONE AFTER current token
        Inc(currPos);
    end;


    FCurrentPosition := currPos;
end;
--->8---


Make sure you initialize FCurrentPosition to 1 (not the default 0) in the
constructor. Keep a copy of ASource in something like FBuffer if you use
PChars instead of indices, otherwise the refcount of the string might go to
zero when the constructor returns.


Recursive descent parsers are trivial to write; essentially, you start off
writing a class that takes a TLexer parameter in the constructor; the
constructor should probably call NextToken on the lexer before any start
token is read. The class should have a public method that corresponds to a
starting grammar rule, like ReadStatement, or whatever. Each grammar rule
should eat anything it expects in the token stream, collect information
about identifiers, etc, and at the end do a case statement which decides
what rule to continue. For rules that look like subtraction or division
(need to be bracketed from the left), use a while or repeat loop instead of
recursing into the same function.


An example (I'm just checking for consistency):


---8<---
procedure TParser.ReadForLoop;
begin
    EatKeyword(keyFor); // should be a helper method
    EatToken(tokIdentifier); // just checking for consistency
    // easy enough to add more support if necessary
    EatToken('='); // this is possible with the 'hack'


    ReadExpression; // change to return type of
    // expression if you like


    EatKeyword(keyTo);


    ReadExpression;


    if Lexer.TokenAsKeyword = keyStep then
    begin
        Lexer.NextToken;
        ReadExpression;
    end;
end;
--->8---


Of course, the Eat* methods should raise exceptions if the token in the
lexer doesn't correspond to the expected token.


If you add methods to the lexer and parser to reset, they should easily be
fast enough to handle thousands of characters without any noticable delay;
you should catch any exceptions thrown by the Eat* (or Expected* - similar
to Eat* but they don't call Lexer.NextToken) methods and begin your
highlighting with the information given in the lexer state. You shouldn't
need to parse the input anymore until the user presses return or else
modifies somewhere before the error position.


This recursive descent parser technique was inspired by Jack Crenshaw.


One final note for speed; make sure you're using a hash table for lookup.
I've got a fairly fast hashtable on


http://codecentral.borland.com/codecentral/ccweb.exe/listing?id=15171.


Much of this goes without saying, but I want to make sure.


-- Barry
barry_j_kelly@hotmail.com


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.