Re: Fragments

Kaz Kylheku <493-878-3164@kylheku.com>
Sat, 21 Dec 2019 20:07:49 +0000 (UTC)

          From comp.compilers

Related articles
Fragments borucki.andrzej@gmail.com (Andy) (2019-12-21)
Re: Fragments 493-878-3164@kylheku.com (Kaz Kylheku) (2019-12-21)
Re: Fragments jamin.hanson@googlemail.com (Ben Hanson) (2019-12-22)
| List of all articles for this month |

From: Kaz Kylheku <493-878-3164@kylheku.com>
Newsgroups: comp.compilers
Date: Sat, 21 Dec 2019 20:07:49 +0000 (UTC)
Organization: Aioe.org NNTP Server
References: 19-12-013
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="6270"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, i18n
Posted-Date: 21 Dec 2019 15:28:25 EST

On 2019-12-21, Andy <borucki.andrzej@gmail.com> wrote:
> In examples is usually used very small alphabet: 3 to 5 letters but in
> lexical analysing is not only Ascii but many thousands of Unicode.
> Many chars are grouped by the same action: for example digits->a
> letter->b whitepsaces->c
> We can use "fragments" [A-Za-z], [0-9] instead of alone letters.
> Problem that fragments not always are disjoint: digits and all chars, letters and letter 'a', etc.
>
> How to handle with not disjoint fragments? on input we get regular
> expression in Posix standard and we want make DFA with a few
> transitions.


In the TXR language, I wrote a regex engine to handle the full Unicode
range of characters.


To represent character ranges, I wrote a polymorphic "char set" data
type in C. See:


http://www.kylheku.com/cgit/txr/tree/regex.c


The main declaration looks like:


    typedef union char_set {
        struct any_char_set any;
        struct small_char_set s;
        struct displaced_char_set d;
        struct large_char_set l;
    #ifdef FULL_UNICODE
        struct xlarge_char_set xl;
    #endif
    } char_set_t;


Each of the types included in the union has a type field in the same
place which indicates which type of set it is.


The different set types are geared toward different situations.


The small_char_set is a 256 entry bitmask; it handles the ASCII/ISO
Latin 8 bit range.


The large_char_set has a two-level tree structure: basically a sparsely
populated array of bitmasks.


The xlarge_char_set has a three-level structure.


the displaced_char_set is also a small bitmask, but with an offset,
so it can represent characters anywhere in the map, if they fall into
the same 256-wide window.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.