Re: Buffered input for a lexer?

"Randall Hyde" <rhyde@cs.ucr.edu>
19 Apr 2002 11:31:50 -0400

          From comp.compilers

Related articles
[12 earlier articles]
Re: Buffered input for a lexer? bear@sonic.net (Ray Dillinger) (2002-04-10)
Re: Buffered input for a lexer? bear@sonic.net (Ray Dillinger) (2002-04-10)
Re: Buffered input for a lexer? cgweav@aol.com (2002-04-13)
Re: Buffered input for a lexer? ralph@inputplus.co.uk (2002-04-16)
Re: Buffered input for a lexer? joachim_d@gmx.de (Joachim Durchholz) (2002-04-16)
Re: Buffered input for a lexer? cgweav@aol.com (2002-04-17)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-19)
Re: Buffered input for a lexer? monnier+comp.compilers/news/@RUM.cs.yale.edu (Stefan Monnier) (2002-04-19)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-20)
Re: Buffered input for a lexer? joachim_d@gmx.de (Joachim Durchholz) (2002-04-23)
Re: Buffered input for a lexer? bear@sonic.net (Ray Dillinger) (2002-04-23)
Re: Buffered input for a lexer? rhyde@cs.ucr.edu (Randall Hyde) (2002-04-23)
| List of all articles for this month |

From: "Randall Hyde" <rhyde@cs.ucr.edu>
Newsgroups: comp.compilers
Date: 19 Apr 2002 11:31:50 -0400
Organization: Prodigy Internet http://www.prodigy.com
References: 02-04-061 02-04-081 02-04-097
Keywords: lex, practice
Posted-Date: 19 Apr 2002 11:31:50 EDT

Joachim Durchholz wrote in message 02-04-097...
>
>I think it makes more sense to have the language processor properly
>interpret lines without an end-of-line. It's just good programming
>practice to have an as small set of constraints on the input as is
>reasonable.


While I agree with much of what has been said in this thread up to
this point, I feel compelled to make a few minor comments since my
name has come up in this thread.


>I happen to be nearly done with a handcrafted lexer right now, and
>while handling boundary cases like this wasn't exactly child's play, I
>have encountered no serious problems. It was all just an exercize in
>organizing the data flow and putting the "if" statements into the
>right levels of abstraction. (Interpreting end-of-file as end-of-line
>was just one boundary condition that I had to consider, and actually
>one of the easier ones.)


I, too, have finished a handcrafted lexer for HLA v2.0 (the High Level
Assembler). That compiler, like GCC, will complain if it discovers
that the last character in the file is not some sort of delimiter
character. The reason is simple, HLA uses memory mapped files for all
the source files it processes. If the last character of a source file
just happens to also be the last byte of a memory page *and* the next
physical page in memory is not read-enabled (or worse yet, is
read-enabled and contains data that looks like it is a continuation of
the lexeme ending in the previous byte), *and* the last character of
the file is not some sort of delimiter (or single character lexeme),
then it is possible for HLA v2.0 to do some bad things (i.e., continue
scanning into the next page). At worst, this could cause a seg fault.
At best, the next character would terminate the current lexeme and
things would be fine. Now the likelihood of all the conditions
happening is exceedingly small (less than one chance in 10,000-100,000
using an off-the-cuff estimate). Of course, an professional software
engineer would still check...


However that check can be very expensive it certain spots of the
lexer. For example, the loop that processes identifiers, integer
constants, and a few other lexeme classes contains only a few machine
instructions (per character) in HLA v2.0. Sticking in even *two*
machine instructions into this loop increases the processing time of
the loop by 33-50%. This is very significant since these loops are
among the most commonly executed in the lexer (and the lexer consumes
the majority of the compiler's time for HLA). Therefore, for
performance reasons I make the assumption that a source file does not
end with the last character of an identifier, numeric literal, and
certain other lexemes *and* the size of the source file is some
multiple of the page size (4096 bytes). If so, the compiler emits a
warning before attempting to process the source file.


Now some would argue that this is poor engineering; this is a known
defect and it's easily corrected. However, it occurs so infrequently,
is easily worked around by the user (just add a newline to the end of
the file, for example), and leaving the 'defect' in produces a
noticable boost in compilation speed. As long as the compiler warns
you about the possibility of a problem (and doesn't bomb or
mis-process the source code as a result of the defect), I can live
with the problem.


Note, btw, that HLA does not require a newline at the end of the file.
Any single character lexeme, whitespace, or other delimiter works fine
(since HLA does check for EOF *between* lexemes).


John suggested tacking on an extra page to the end of the mapped file.
Alas, this isn't portable (at least between the crop of OSes I'm
looking at) so that's not an option for me. At one time I was tempted
to trap the seg fault, if one occurs, and attempt to deal with it.
However, given the rarity of the problem, and the fact that the
compiler can warn the user about the problem, I didn't feel this to be
worthwhile.


While I appreciate what you've done with "organizing the data flow and
putting the "if" statements..." this wasn't an option in my case
because one of those "if" statements was my two machine instructions I
wanted to eliminate.


Now granted, HLA's lexer (which is written in HLA, that is, assembly
language) is probably fast enough that most people would be happy with
it. However, the boost in performance I'm seeing on my old 300 MHz
PII machine is impressive (to me, at least). So this is a compromise
I'm willing to live with.


I totally agree with other poster's comments that the compiler must
not modify the source file. There may be some good reasons that the
programmer left off an EOLN or organized the source file in some other
manner. (not to mention, some OSes only allow read-only memory mapped
files, so modifying the source isn't an option in HLA's case).


Randy Hyde
[How about looking at the end of the file, and using a special slow
version of the lexer if the eof isn't a newline? -John]



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.