RE: Integer sizes and DFAs

Christopher F Clark <christopher.f.clark@compiler-resources.com>
Sun, 27 Mar 2022 15:02:16 +0300

          From comp.compilers

Related articles
RE: Integer sizes and DFAs christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-27)
Re: Integer sizes and DFAs gah4@u.washington.edu (gah4) (2022-03-26)
RE: Integer sizes and DFAs christopher.f.clark@compiler-resources.com (Christopher F Clark) (2022-03-27)
| List of all articles for this month |

From: Christopher F Clark <christopher.f.clark@compiler-resources.com>
Newsgroups: comp.compilers
Date: Sun, 27 Mar 2022 15:02:16 +0300
Organization: Compilers Central
References: 22-03-073 22-03-074
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="46571"; mail-complaints-to="abuse@iecc.com"
Keywords: lex, performance
Posted-Date: 27 Mar 2022 12:20:00 EDT

@gah4: Ok, The state space for the whole GenBank is larger. And, if
we ever get to the point where we can deal with epigenomic factors, we
probably need extra coding space for that too. But, I still think it
is likely that we will be in the range where we can make "disks" (or
"disk arrays") large enough for them and use 64 bit integers to
address them.


And, wandering off into the space of hardware implementations. At
Intel we were considering doing that circa 2010, so a bit later than
you were. In particular, we were looking to see if we could adapt our
technology to do the "FASTA" algorithm. Like you, we never brought
that to market.


But, I want to make a couple of slight points on the encoding of transitions.


If one is using byte addressing and one always puts the states on
aligned boundaries, one has a few low order bits in the address that
will always be 0. You can tweak one of those to be your "match
complete/final" bit and then mask that off when using it to address
the next state. Alternately, one keeps a separate list of "final
states". That is more common in implementations. And, the third choice
is to have a state header with extra data, like whether the state is
final or not. The last two alternatives expand your memory footprint
somewhat, but by a relatively minimal amount.


Now, if you don't use byte addressing, but use "word" addressing (a la
most computers in the 1960s), you actually have shrunk your address
requirement. You've moved those 0 bits from the bottom of the address
to the top. Even more so, if you use "state" addressing. Of course, if
your hardware has byte addressable memory, you may have to convert
those addresses into byte address before going to actual memory. But
all of that doesn't involve extra memory lookups, just some small
amount of integer/bit arithmetic which on modern computers is
essentially irrelevant and gets swamped by pipeline stalls waiting for
the cache.


Per your idea of keeping them on disk, you can "split" a 64 bit
address into a "page address" and an "address within a page" and use
the paging mechanism to bring in the relevant pages. I believe some
"huge page" implementations actually do that to work well with TLBs
and caches. Anyway, what you proposed is not far fetched.


Beyond that there are lots of tricks one can play. We thought for a
while of Huffman encoding the addresses and using relative pointers.
If you are doing something like the Aho-Corasick algorithm, pointers
that "match" and lead further on in the machine, so they are always
positive numbers, while "failure" pointers are the ones that point
backward and those do so to a limited number of states, where absolute
addressing makes sense. There is more you can do along those lines,
but those are the basic steps.


Now, in actual genes there are "network expressions" and "spacers".
The network expressions can be segments that are repeated, and that
leads to some extra backward links, but the number of them is also
small (and those are relative and not failure links).


---------


Still, none of this impacts the fact, that when doing analysis, you
can still treat those memory references as O(1) and say your lexer is
linear in terms of bytes processed per second (and that having more
states doesn't impact that), because fundamentally reading from the
memory is the bottleneck, and we are always talking fixed memory
images (RAM or "disks", that is it is a FINITE state machine) and not
reading from a Turing Machine tape.


--
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.