Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>
Fri, 16 May 2008 11:13:16 +1000

From comp.compilers

Related articles
Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-13)
*Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192)* (2008-05-16)**
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
[3 later articles]

| List of all articles for this month |

From:	"cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Fri, 16 May 2008 11:13:16 +1000
Organization:	Saipan Datacom
References:	08-05-050
Keywords:	lex, design
Posted-Date:	15 May 2008 21:23:45 EDT

"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message
> I'm currently thinking about lexing/tokenizing binary data and Unicode,
> in e.g. PDF files, and possible optimizations in LL/PEG parsers.
>
> Binary data typically comes in fixed length chunks, where the parser has
> to provide the length and encoding of the next token. This procedure is
> quite different from lexing text, that's why I prefer the term
> "tokenizer" in this context, and bottom-up (LR) parsers seem not to be
> very usable in this case. A lexer at best can skip over binary data,
> what often is sufficient, but then it should not be confused by embedded
> null "characters" in the input. Traditional lexers often use null
> characters as end markers, for either the end of the whole input or for
> the end of the input buffer, so that this value needs special handling
> anyways. Then it should not be hard to add another meaning to such
> bytes, as legal members of the expected character set. But how to
> implement that handling?

<snip/>

> Back to binary files, including PDF, ZIP and other file formats with
> structured content, a multi-level parser seems to be required in most
> cases, which can separate the file into distinct parts, which can (often
> must) be processed in random order. I vaguely remember an approach to
> formalize the description and decoding of binary files - can somebody
> give me more precise hints in that direction?

IMO, parsers of this sort (especially text-intended parser
generators), are not a good approach to parsing binary data, nor do I
believe are the formalisms often employed, which are themselves
intended for textual structures.

Rather, I suggest looking at how such formats are actually structured,
and how they are commonly represented within format specs, and try to
derive from this a general notation, formalism, and possible
implementation.

an issue I think:
there are a wide variety of possible ways in which data can be, and often
is, structured:
uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
....);
structures and offsets (ELF, COFF, ...);
TLV structures (RIFF, PNG, ASN.1, ...);
bitstreams (Deflate, ...);
bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
escaped tags (JPEG, MPEG, FLAC, ZIP, ...);
....

So, rather than having a single formalism (such as LL or LR), more
likely, one will end up with a kind of mini programming language just
to specify how the pieces fit together.

presumed features:
structs (explicit on-disk layout), conditionals, support for variable length
fields and members, ability to deal with file offsets and similar ideas, ...

even then, there is likely to be a whole lot more that can't be formalized,
than that can be.

more likely, one can end up more with what would amount to a format-design
tool, than something that can actually reliably parse existing formats.

even then, it is usually not parsing the format that is the work, rather it
is processing the data that is usually a lot more work.

for example (just pulling something out of nowhere):
struct PACK_Entry {
char name[56];
u32le offset;
u32le size;
byte@offset data[size];
};

struct PACK_Header {
FOURCC magic="PACK";
u32le offset;
u32le ents;
PACK_Entry@offset dir[ents];
};

now, we can note that this is a trivial format, but even as such it is
awkward to formalize...

now, most other formats are nowhere near this trivial...

now, personally, I don't usually use such formalisms or tools, since
personally I have usually found it to be most effective simply to
write the parser myself, and if likely it were beyond my abilities to
write a parser, it would also likely be somewhat beyond the abilities
of these formalisms as well...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>Fri, 16 May 2008 11:13:16 +1000

"cr88192" <cr88192@hotmail.com>
Fri, 16 May 2008 11:13:16 +1000