Tokenizer theory and practice

Hans-Peter Diettrich <>
Tue, 13 May 2008 14:51:21 +0200

          From comp.compilers

Related articles
Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice (cr88192) (2008-05-16)
Re: Tokenizer theory and practice (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice (cr88192) (2008-05-18)
[4 later articles]
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Tue, 13 May 2008 14:51:21 +0200
Organization: Compilers Central
Keywords: lex
Posted-Date: 14 May 2008 11:57:08 EDT

I'm currently thinking about lexing/tokenizing binary data and Unicode,
in e.g. PDF files, and possible optimizations in LL/PEG parsers.

Binary data typically comes in fixed length chunks, where the parser has
to provide the length and encoding of the next token. This procedure is
quite different from lexing text, that's why I prefer the term
"tokenizer" in this context, and bottom-up (LR) parsers seem not to be
very usable in this case. A lexer at best can skip over binary data,
what often is sufficient, but then it should not be confused by embedded
null "characters" in the input. Traditional lexers often use null
characters as end markers, for either the end of the whole input or for
the end of the input buffer, so that this value needs special handling
anyways. Then it should not be hard to add another meaning to such
bytes, as legal members of the expected character set. But how to
implement that handling?

IMO the most time-efficient procedure would use an pointer into an input
buffer, so that special handling (e.g. fill_buffer) is only required
when a null byte is detected. A dedicated getter procedure instead can
implement transparent buffering, returning an dedicated (epsilon/EOF)
value only at the real end of the input. The latter case seems to be
more suitable with lexer generators, where null bytes could be
substituted by some non-zero code, which would not require changes to an
automatically generated lexer, apart from handling an extended character
set of at least 257 members. Opinions?

Unicode introduces a couple of problems into lexers, which I don't want
to discuss too deeply. Most important seems to be the expansion of the
character codes, from single to multiple bytes. Together with various
forms of encoding (UTF...), a getter method seems to be preferable over
using pointers. Then the getter can implement and encapsulate the
encoding, so that the lexer only has to deal with uniform (4 byte)
character codes.

In formal languages a mix of ASCII (keywords) and "real" Unicode parts
(literals...) can be expected, so that an optimized tokenizer should be
able to treat these parts differently. Again a top-down parser seems to
be preferable, because it can switch the tokenizer into the different
modes (SBCS, MBCS, binary...). But wouldn't it make sense, then, to
replace the lexer/tokenizer by multiple "matchers", which are called by
the parser as appropriate, and which can e.g. match expected literals
(required keywords...) without guessing of the kind of the next token.
But can such specialized functions really speed up or simplify parsing?

Back to binary files, including PDF, ZIP and other file formats with
structured content, a multi-level parser seems to be required in most
cases, which can separate the file into distinct parts, which can (often
must) be processed in random order. I vaguely remember an approach to
formalize the description and decoding of binary files - can somebody
give me more precise hints in that direction?


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.