String tokenizer place in Chomsky hierarchy?

Tegiri Nenashi <TegiriNenashi@gmail.com>
Tue, 8 Apr 2008 10:44:53 -0700 (PDT)

          From comp.compilers

Related articles
String tokenizer place in Chomsky hierarchy? TegiriNenashi@gmail.com (Tegiri Nenashi) (2008-04-08)
Re: String tokenizer place in Chomsky hierarchy? maharri@gmail.com (Mitch) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? DrDiettrich1@aol.com (Hans-Peter Diettrich) (2008-04-11)
Re: String tokenizer place in Chomsky hierarchy? sipayi@gmail.com (2008-04-15)
Re: String tokenizer place in Chomsky hierarchy? markwh04@yahoo.com (Rock Brentwood) (2008-04-25)
| List of all articles for this month |

From: Tegiri Nenashi <TegiriNenashi@gmail.com>
Newsgroups: comp.compilers,comp.theory
Date: Tue, 8 Apr 2008 10:44:53 -0700 (PDT)
Organization: Compilers Central
Keywords: lex
Posted-Date: 10 Apr 2008 23:26:18 EDT

String tokenizer is the simplest parser of all. In java, arguably, it
is much more frequently used than regular expressions. Yet I fail to
see any parsing theory book ever mentioning it.


In my practical experience it is easer write a scanner on string
tokenizer foundation rather than to use off the shellf reg exp engine.
I'm interested however what is the language theory perspective onto
this phenomenon.


Formally, a set of terminals is partitioned into a separator or a set
of separators, and the rest of terminals. Then, string tokenizer
translates a given word into a set (or list) of words. Here we have
the first technical difficulty, what exactly this translation is? Is
it a mapping from Monoid to Idempotent Semiring? Certainly not if we
insist on the result being a list of words, and not the set. Then if
we try exrtend this mapping naturally to the domain of sets, then we
have to deal with set of sets (or set of lists) on the range side of
the mapping?


Consider an example, an alphabet of terminals {a,b,c} with the "a"
being a separator. Can you suggest a grammar that be able to tokenize
the word "babcac" into {b,bc,c}? Sure something like


s >= a
t >= b
t >= c
w >= wt
w >= 1
u >= ws
v >= 1
v >= uv


would work, but does this 8(!) rule grammar present any new insight to
what string tokenizor is? Besides, assuming this grammar produces an
unambigious parse tree, there is still an extra step of extracting the
set of words from the tree.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.