Re: String tokenizer place in Chomsky hierarchy?

Hans-Peter Diettrich <>
Fri, 11 Apr 2008 17:15:39 +0200

          From comp.compilers

Related articles
String tokenizer place in Chomsky hierarchy? (Tegiri Nenashi) (2008-04-08)
Re: String tokenizer place in Chomsky hierarchy? (Mitch) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? (Hans-Peter Diettrich) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? (2008-04-15)
Re: String tokenizer place in Chomsky hierarchy? (Rock Brentwood) (2008-04-25)
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers,comp.theory
Date: Fri, 11 Apr 2008 17:15:39 +0200
Organization: Compilers Central
References: 08-04-030
Keywords: lex
Posted-Date: 11 Apr 2008 16:46:28 EDT

Tegiri Nenashi wrote:

> String tokenizer is the simplest parser of all. In java, arguably, it
> is much more frequently used than regular expressions. Yet I fail to
> see any parsing theory book ever mentioning it.

Just a shot into the dark:

Tokenizers typically are described as automatons, which are not bound
to the restrictions of grammars [questionable, I know ;-]. With
regards to the Chomsky hierarchy, a tokenizer is not bound or
restricted to some specific Chomsky class.

The most important difference IMO is the goal, where an automaton is
allowed to stop *before* reaching the end of some input, whereas a
grammar requires that only the *entire* input can be a valid element
of the language. OTOH an automaton is allowed to never stop, what at
least is undesireable for an tokenizer, and for an parser as well.

Of course one can argue that the language, recognized by an
tokenizer/lexer, simply consists of a concatenation of valid tokens.
Even then I'd prefer to distinguish between "words" and "sentences" of
a language, where the tokenizer recognizes the words, and a parser
recognizes sentences, consisting of words. Kind of a multi-level
grammar, whose levels can be analyzed separately. At least I found
considerable problems in the transformation of languages, which were
*designed* for distinct lexer and parser "grammars", into a single
grammar for an scannerless parser.

I leave it to somebody else, to find a more "mathematical" definition
for my thoughts, so that my "obvious" differences can be discussed in a
more formal way ;-)

> Consider an example, an alphabet of terminals {a,b,c} with the "a"
> being a separator. Can you suggest a grammar that be able to tokenize
> the word "babcac" into {b,bc,c}?

goal ::= [separator] { word [separator] } EOF.
separator ::= "a".
word ::= { "b" | "c" }.

> would work, but does this 8(!) rule grammar present any new insight to
> what string tokenizor is?

IMO that's a matter of the grammar notation.

> Besides, assuming this grammar produces an
> unambigious parse tree, there is still an extra step of extracting the
> set of words from the tree.

That's why a lexer does not normally produce an parse tree, but a
stream of tokens. The lexer can be considered as a filter, sitting in
a pipeline in between the input and the parser. In the practice, of
e.g. an C parser, more filters sit in the tool chain, implementing
e.g. the preprocessor with macro expansion and include file
handling. But again this only is an argument for the separation of an
parser into distinct parts, for easy modelling and implementation, not
a formal discussion of the underlying math or algebra.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.