Re: Just Starting Writing An Interpreter

Craig Burley <burley@cygnus.com>
29 Oct 1997 23:05:52 -0500

          From comp.compilers

Related articles
Just Starting Writing An Interpreter alankemp@bigfoot.com (Alan Kemp) (1997-10-26)
Re: Just Starting Writing An Interpreter burley@cygnus.com (Craig Burley) (1997-10-29)
| List of all articles for this month |

From: Craig Burley <burley@cygnus.com>
Newsgroups: comp.compilers
Date: 29 Oct 1997 23:05:52 -0500
Organization: Cygnus Support
References: 97-10-114
Keywords: interpreter

"Alan Kemp" <alankemp@bigfoot.com> writes:


> This is the defintion of the very simple language I am working with:
>
> It will have no comments
> It will have only one command:
> PRINT (text to be printed goes here);
> All commands will end with a semi-colon
> The program will terminate upon finding the word:
> END


I hope that is the definition of the *sample* language you're trying
to get working before you actually start supporting the rest of the
language! Otherwise, you don't need to write it in C -- presumably
perl (or sed or awk, etc.) would handle this in a line or two of code.


Before I can answer much in the way of questions on this, *why* are
you trying to write an interpreter? That you seem to be defining your
own language as you go along makes it difficult to tell whether you'd
prefer defining a language that is inherently easy to interpret
(e.g. a PostScript-like language) or one that is difficult to
interpret (e.g. Fortran 90), either of which would be a good exercise,
the former not just easier, but much less likely to provide
opportunities to learn about the difficulties of writing an
interpreter (which, in some areas, can be more trying than those of
writing a compiler!).


> I am using a seperate function to deal with every possible command
> (at the moment only one) is this the best way to do it?


Yes and no. A separate function to deal with a separate *aspect* of
every possible command is quite reasonable. Having a single function
deal with *all* aspects of a command might be less so for a fairly
rich language, though that relates to language design.


For example, does your language permit a command to define a distinct
syntax for its arguments? If so, there should probably be a distinct
function provided to parse the arguments for that command, plus (of
course) the function to perform the command.


That way, you can handle parsing code like


    if (some_cond)
        do
            command_with_strange_syntax("2000);
        end
    ...


more elegantly. The problem shown here is that, while in most cases
double-quote would, say, start a double-quote-delimited string in your
language, in the case of this command it doesn't -- let's say it
starts an octal constant (a la some popular extensions to FORTRAN 77).


So, if some_cond turns out to be FALSE, you don't want to *execute*
command_with_strange_syntax, but you do probably need to parse it the
same way so you know where it ends (and thus where the "end" begins).


It helps me a lot to view the artificial-language-processing problem
as a series of distinct layers -- lexing, syntactic analysis, semantic
analysis, and execution being obvious ones.


Lexing includes recognizing "print", "end", "(", ";", and so on as
single tokens, and, as such, it is usually made a distinct phase in
the processor, one that has little or no knowledge of the subsequent
phases. (For example, your lexer really shouldn't see "print" as
fundamentally different from "foobar", unless your language requires
it.)


It's *possible* that lexing would demand, or encourage, use of a
distinct function per command/statement, though this might really be
per lexeme, else it would require lexing to include sufficient
syntactic analysis to know which command's lexing function to call.


Syntactic analysis includes recognizing ";" as ending a statement,
determining the type of statement, and so on. Again, ideally, this
phase shouldn't really care a lot about the other phases, but some
languages (Fortran, C) make that difficult or impossible.


It's in syntactic analysis where you might want (or need) your first
distinct function per command/statement.


Semantic analysis includes recognizing variable names, function names,
determining/inferring type information, maybe handling precedence
(although in a well-designed language, this is done in syntactic
analysis; that's basically the case for Fortran, anyway, but not C or
C++).


In semantic analysis you might want/need yet another distinct function
per command/statement, depending on your language and how much neat
stuff you want to put in your interpreter (e.g. stuff like detecting
undeclared variables in statements that aren't executed due to, for
example, being disabled via a false "if" or "else").


The execution phase is where you're probably already thinking of
having distinct functions for commands, and that's fine. It's here
you either generate code to perform the command (a compiler, usually)
or perform the command directly (an interpreter).


So, you might want to have two or more distinct functions for "print"
and other commands.


> One of the main problems I have is reading the character at postion
> lineNo, charNo. Is there way of doing this is C/C++?


Yes, but except when using file formats involving fixed-length
records, it's slower than remembering the byte offset of the data in
the file and using the "normal" positioning to go there, which I think
requires accessing the file via direct access instead of sequential
(but I haven't worried about C/C++ I/O in a long time, so what do I
know?).


Another approach to the problem I think you're trying to solve is
simply save up (in memory) any code that might be later "jumped back"
to. Depending on the language, that can mean saving lots of
unnecessary stuff; depending on the language, that can mean saving the
already-somewhat-digested (e.g. lexed, syntactically analyzed,
semantically analyzed, or maybe even compiled) code to reduce
subsequent processing time and, perhaps, memory usage. Also, knowing
when it is okay to finally free some of the memory used for some code
can be problematic -- block-structured languages can make this much
easier if they don't allow arbitrary references to within blocks, for
example.


This is why knowing what you want to accomplish can govern what you
can or can't play with, and learn from, feasibly, in a single
interpreter project where you get to define the language.


E.g. you can go to one extreme and define a language that allows such
complete flexibility that you won't have any real opportunity to do
all sorts of optimizations that are practical and effective in most
real languages, by allowing code to redefine the lexemes and syntax of
subsequently executed commands. Imagine a language with a statement
that redefines the comment character for subsequent statements, for
example -- pretty much any lexing done before executing that statement
would be useless afterward, so any statements executed afterward would
have to be "freshly lexed" to cope with the new comment character.


Or you can go to the other extreme and define a language that would be
quite clean whether implemented by an interpreter or compiler, but
then miss out on some of the arcane stuff that real interpreters (like
shells) have to deal with because *they* provide more flexible
languages.


So, as with most things in life, it's best to know your goals and
explain them to obtain guidance!
--
James Craig Burley, Software Craftsperson burley@gnu.org
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.