Re: Lexing Unicode strings?

gah4 <gah4@u.washington.edu>
Tue, 4 May 2021 01:11:51 -0700 (PDT)

          From comp.compilers

Related articles
Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-04-21)
Re: Lexing Unicode strings? johann@myrkraverk.com (Johann 'Myrkraverk' Oskarsson) (2021-05-03)
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04)
Re: Lexing Unicode strings? christopher.f.clark@compiler-resources.com (Christopher F Clark) (2021-05-04)
Re: Lexing Unicode strings? gah4@u.washington.edu (gah4) (2021-05-04)
Re: Lexing Unicode strings? haberg-news@telia.com (Hans Aberg) (2021-07-14)
| List of all articles for this month |

From: gah4 <gah4@u.washington.edu>
Newsgroups: comp.compilers
Date: Tue, 4 May 2021 01:11:51 -0700 (PDT)
Organization: Compilers Central
References: 21-05-001
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="48110"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, i18n, comment
Posted-Date: 04 May 2021 11:33:47 EDT
In-Reply-To: 21-05-001

On Monday, May 3, 2021 at 4:58:22 PM UTC-7, Johann 'Myrkraverk' Oskarsson
wrote:
> On 21/04/2021 4:20 pm, Johann 'Myrkraverk' Oskarsson wrote:




Snip.


> > [The obvious approach if you're scaning UTF-8 text is to keep treating the input as
> > a sequence of bytes. UTF-8 was designed so that no character representation is a
> > prefix or suffix of any other character, so it should work without having to be clever
> > -John]


> That's not always feasible, nor the right approach. Let's consider the
> range of all lowercase greek letters. In the source file, that range
> will look something like [\xCE\xB1-\xCF\x89] and clearly the intent is
> not to match the bytes \xCE, \xB1..\xCF, and \x89.


Ranges that are ranges of bytes in ASCII won't necessarily be in Unicode.
You needs some Boolean logic in your matching, though that shouldn't be so
hard.


> There is also the question of validating the input. It seems more
> natural to put the overlong sequence validator, and legal code point
> validator into the lexer, rather than preprocess the source file.


This reminds me of the question, which I don't remember where I saw,
about applying Boyer-Moore to Unicode. One answer is, as John notes,
to apply it to the UTF-8 bytes. But the whole idea behind Boyer-Moore
is to use the unequal probability distribution of characters, to quickly
skip over impossible matches. But the bytes of UTF-8 don't have the
same (im)probabiity of the individual characters.


It seems to me that you lose some of the speed advantage of Boyer-Moore,
but maybe it is still plenty fast enough.


On the other hand, Java uses a 16 bit char, and one should be able to apply
Boyer-Moore just as well in that case. The tables will be bigger, but then
so are computer memories today.


So, as with many problems, there are trade-offs between speed and size,
and one has to choose the best case for the specific problem.


Note that in addition to have a 16 bit Unicode char, the Java language
itself is defined in terms of Unicode. Variable names can be any Unicode
letter, followed by Unicode letters and digits. Presumably, then, the
designers
of Java compilers have figured this out, I suspect using the 16 bit char.


One can, for example, have variables named A and Α in the same program.
(In case you can't see it, the second one is an Alpha.)


Yes, Unicode can be fun!
[Remember that Unicode is a 20 bit code and for characters outside the first 64K,
Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John]


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.