Re: Tokenizer theory and practice

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sun, 18 May 2008 10:37:49 +0200

From comp.compilers

Related articles
[3 earlier articles]
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice haberg_20080406@math.su.se (Hans Aberg) (2008-05-17)
Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-05-17)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-18)
Re: Tokenizer theory and practice mailbox@dmitry-kazakov.de (Dmitry A. Kazakov) (2008-05-18)
*Re: Tokenizer theory and practice DrDiettrich1@aol.com (Hans-Peter Diettrich)* (2008-05-18)**
Re: Tokenizer theory and practice cr88192@hotmail.com (cr88192) (2008-05-20)

| List of all articles for this month |

From:	Hans-Peter Diettrich <DrDiettrich1@aol.com>
Newsgroups:	comp.compilers
Date:	Sun, 18 May 2008 10:37:49 +0200
Organization:	Compilers Central
References:	08-05-050 08-05-065 08-05-067 08-05-071
Keywords:	lex
Posted-Date:	19 May 2008 21:26:42 EDT
X-Orig-X-Trace:	individual.net x6lGFeJDsdW1Iiw90gJxNQ4x84JyarR71gkwKItiG9y9ukgnjS

cr88192 schrieb:

>> I hope to separate the data structures, as syntactical elements, from
>> attributes etc. as kind of semantic code. In the best case it should be
>> possible to derive both an serializer and an de-serializer from a given
>> formal description.
>>
>
> ok.
>
> so, one first describes all of the pieces, and then later how they are
> assembled?...
> ok, but it may be difficult to pull off...

You are right. Looking at some popular binary file formats, the
construction (writing) must be done sequentially, with possible
patches of offsets in preceding tables. When an offset table is
written last, it must be read in first, from the end of the file. I'm
not sure whether a formal description of such procedures is possible,
for use in both reading and writing such an file. An according grammar
may become context sensitive, what classifies the problem as very
interesting from the scientific point of view, but it's unlikely that
it will result in a usable tool.

> as noted, I also said that there would be conditionals, that or we could
> also support BNF-based descriptions.
>
> u64 uvli() {
> byte v;
> return((v&0x80)?(((v&0x7f)<<7)|uvli()):(v&0x7f));
> };
>
> u64 svli() {
> uvli v;
> return((v&1)?(-((v+1)>>1)):(v>>1));
> }
>
> of course, this does not make it clear how to encode these types...

A common example would be UTF-8 encoding/decoding, which is easily
described in a procedural way, and possibly also in a grammar, but
deriving the code from such a grammar seems to exceed my capabilities.
<sigh>

> your intent is for reverse engineering or something?...
> that is what this sounds like to me at least...

That's the background, how I came to parsers at all ;-)

> usually for more complex or bulky formats, I write special tools...

After considering all the topics, mentioned in this thread, I better
leave the theory to other people, too. As you stated before:
>>
more likely, one can end up more with what would amount to a
format-design tool, than something that can actually reliably parse
existing formats.
<<

DoDi

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Tokenizer theory and practice

Hans-Peter Diettrich <DrDiettrich1@aol.com>Sun, 18 May 2008 10:37:49 +0200

Hans-Peter Diettrich <DrDiettrich1@aol.com>
Sun, 18 May 2008 10:37:49 +0200