Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>
Sun, 18 May 2008 09:51:34 +1000

From comp.compilers

Related articles
Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-13)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-16)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-16)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
*Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192)* (2008-05-18)**
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-18)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-20)

| List of all articles for this month |

From:	"cr88192" <cr88192@hotmail.com>
Newsgroups:	comp.compilers
Date:	Sun, 18 May 2008 09:51:34 +1000
Organization:	Saipan Datacom
References:	08-05-050 08-05-065 08-05-067
Keywords:	lex
Posted-Date:	18 May 2008 01:06:05 EDT

"Hans-Peter Diettrich" <DrDiettrich1@aol.com> wrote in message
> cr88192 schrieb:
>
>>> Back to binary files, including PDF, ZIP and other file formats with
>>> structured content, a multi-level parser seems to be required in most
>>> cases, which can separate the file into distinct parts, which can (often
>>> must) be processed in random order. I vaguely remember an approach to
>>> formalize the description and decoding of binary files - can somebody
>>> give me more precise hints in that direction?
>>
>> IMO, parsers of this sort (especially text-intended parser
>> generators), are not a good approach to parsing binary data, nor do I
>> believe are the formalisms often employed, which are themselves
>> intended for textual structures.
>
> That's why I mentioned multi-level parsers, where the top level handles
> the file structure. In PDF files the file structure is textual, with
> embedded binary streams. In other files the structure can be binary,
> with embedded text snippets.
>

ok.

>
>> Rather, I suggest looking at how such formats are actually structured,
>> and how they are commonly represented within format specs, and try to
>> derive from this a general notation, formalism, and possible
>> implementation.
>
> It would be nice to start with some already existing formalism.
>

should one exist...

>> an issue I think:
>> there are a wide variety of possible ways in which data can be, and often
>> is, structured:
>> uniform-sized blocks (common in filesystems, formats like GCF, B-Trees,
>> .....);
>> structures and offsets (ELF, COFF, ...);
>> TLV structures (RIFF, PNG, ASN.1, ...);
>> bitstreams (Deflate, ...);
>> bytestreams (Java class, WBXML, many simpler LZ77 variants, ...);
>> escaped tags (JPEG, MPEG, FLAC, ZIP, ...);
>> .....
>>
>> So, rather than having a single formalism (such as LL or LR), more
>> likely, one will end up with a kind of mini programming language just
>> to specify how the pieces fit together.
>
> I hope to separate the data structures, as syntactical elements, from
> attributes etc. as kind of semantic code. In the best case it should be
> possible to derive both an serializer and an de-serializer from a given
> formal description.
>

ok.

so, one first describes all of the pieces, and then later how they are
assembled?...
ok, but it may be difficult to pull off...

>
>> presumed features:
>> structs (explicit on-disk layout), conditionals, support for variable
>> length
>> fields and members, ability to deal with file offsets and similar ideas,
>> ...
>>
>> even then, there is likely to be a whole lot more that can't be
>> formalized,
>> than that can be.
>
> I don't expect that a single formalism can cover really all weird file
> formats. But it would be nice to have more than a description of the
> basic "tokens", which cannot be much more than numbers or strings - in a
> lot of different formats, of course. I don't want to decode bitstreams
> or encrypted data, that would be a task for an automaton in an sub-parser.
>

yes, these would be types.
however, I find it doubtful that they could be recognized automatically,
except maybe in some crude likilihood-based measure ("this looks like it is
probably a 32 bit integer"...).

>
>> more likely, one can end up more with what would amount to a
>> format-design
>> tool, than something that can actually reliably parse existing formats.
>>
>> even then, it is usually not parsing the format that is the work, rather
>> it
>> is processing the data that is usually a lot more work.
>
> The task of the parser is finished, when a file is converted into a tree
> structure, possibly with additional tables for file offsets, tag names
> and more.
>

yes, ok.

however, in many formats, the parsing and processing are heavily
interconnected.
for example, how do you decode a JPEG, apart from first processing the more
basic elements (headers, huffman table, ...).

of course, one could limit themselves to context-independent fileformats,
but this is fairly limiting.

>> for example (just pulling something out of nowhere):
>> struct PACK_Entry {
>> char name[56];
>> u32le offset;
>> u32le size;
>> byte@offset data[size];
>> };
>>
>> struct PACK_Header {
>> FOURCC magic="PACK";
>> u32le offset;
>> u32le ents;
>> PACK_Entry@offset dir[ents];
>> };
>
> Such a description indeed will cover the most common data structures, of
> a fixed or counted length. We'll have to add (zero) terminated items
> (strings and other vectors), and multi-level descriptions for offsets
> (relative to some file section, table lookup...). A "whitespace"
> equivalent will be required as well, for the description of alignment
> and padding.
>

as noted, I also said that there would be conditionals, that or we could
also support BNF-based descriptions.

u64 uvli() {
byte v;
return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
};

u64 svli() {
uvli v;
return((v&1)?(-((v+1)>>1)):(v>>1));
}

of course, this does not make it clear how to encode these types...

>
>> now, personally, I don't usually use such formalisms or tools, since
>> personally I have usually found it to be most effective simply to
>> write the parser myself, and if likely it were beyond my abilities to
>> write a parser, it would also likely be somewhat beyond the abilities
>> of these formalisms as well...
>
> I've decoded many file formats myself, and my most useful tool was a
> hex viewer with formatting capabilities. It could display files as
> tables of fixed-size records, where the record formats were described
> by strings, with 'c' for a character, 'l' for a longword etc. The
> table positions and formats were written to an file, ready for later
> reuse. What I'm still missing is the next step, with more
> sophisticated types, and an ability to convert the fixed table offsets
> and counts into computed elements, as templates for viewing files of
> the same type. It may be more promising to extend that old tool with
> the capabilities, mentioned above, instead of formalizing all aspects
> of automated parsing of non-textual files. BTW, IDA is a fine example
> for such a tool...
>

your intent is for reverse engineering or something?...
that is what this sounds like to me at least...

or, at least, in my case when examining files (usually to verify encoders or
decoders are working about right) I often end up decomposing the structure
in my head. this works fairly well for byte-based formats at least, but
things like huffman-coded data or highly compressed formats go beyond my
mental abilities to decode.

usually for more complex or bulky formats, I write special tools...

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

"cr88192" <cr88192@hotmail.com>Sun, 18 May 2008 09:51:34 +1000

"cr88192" <cr88192@hotmail.com>
Sun, 18 May 2008 09:51:34 +1000