Expected Token Density in Random Stream

Andrew Tomazos <andrew@tomazos.com>
Wed, 7 Dec 2011 15:06:23 -0800 (PST)

          From comp.compilers

Related articles
Expected Token Density in Random Stream andrew@tomazos.com (Andrew Tomazos) (2011-12-07)
Re: Expected Token Density in Random Stream kaz@kylheku.com (Kaz Kylheku) (2011-12-11)
Re: Expected Token Density in Random Stream andrew@tomazos.com (Andrew Tomazos) (2011-12-13)
Re: Expected Token Density in Random Stream gene.ressler@gmail.com (Gene) (2011-12-19)
| List of all articles for this month |

From: Andrew Tomazos <andrew@tomazos.com>
Newsgroups: comp.compilers
Date: Wed, 7 Dec 2011 15:06:23 -0800 (PST)
Organization: Compilers Central
Keywords: parse
Posted-Date: 11 Dec 2011 08:11:23 EST

Summary: We want to find out how often a given token appears in a
random stream formed by concatenating randomly chosen strings from a
given set of strings.


Details:


Given S, an array of n (non-empty) strings; and T, a string of length
m;


We create a random stream of characters by the following process:


        1. assign i = a (uniformly) random integer between 1 and n
inclusive
        2. write the characters of the ith element of S to the stream
        3. goto 1


(The elements of S are not necessarily the same length)


For example:


        if S = {"a", "bug", "ug"}


then the stream may start as follows:


        augbugugugaabug...


Each time T appears in the stream we call that a hit. More precisely:


      1. Initialize a queue Q with m null elements


      2. If Q equals T record a hit
      3. Pop front of Q
      4. Push to back of Q next char from stream
      5. Goto 2


For example:


      If T = "ugu"...


then:


        augbugugugaabug...
                1 2


We get two hits so far.


(Note hits can overlap each other)


What is the expected frequency of hits in terms of S and T?


More precisely:


        let y be the number of chars read from the stream so far
        let x be the number of hits


        What is the expected limit of x/y as y approaches infinity?


We can approximate this empirically as follows in C++:


      $ cat > HitFinder.cpp
          <paste below code>
      $ g++ -o HitFinder HitFinder.cpp
      $ ./HitFinder ugu a bug ug
T = ugu
A = {a, bug, ug}


(x, y, x/y) = (0, 1, 0)
(x, y, x/y) = (0, 2, 0)
(x, y, x/y) = (0, 4, 0)
(x, y, x/y) = (0, 8, 0)
(x, y, x/y) = (1, 16, 1/16)
(x, y, x/y) = (3, 32, 1/10.6667)
(x, y, x/y) = (8, 64, 1/8)
(x, y, x/y) = (16, 128, 1/8)
  ...
(x, y, x/y) = (29821708, 268435456, 1/9.00134)
(x, y, x/y) = (59648899, 536870912, 1/9.00052)
(x, y, x/y) = (119304725, 1073741824, 1/8.99999)
(x, y, x/y) = (238593314, 2147483648, 1/9.0006)
(x, y, x/y) = (477198099, 4294967296, 1/9.00039)


As you can see the answer for this converges on 1/9


How can we calculate this figure by direct computation?


ie How would you implement a function with signature...


      double ExpectedHitLimit(string T, vector<string> A)


...that quickly calculates this limit for any given T and A?


Thanks and have fun,
Andrew.


--
Andrew Tomazos <andrew@tomazos.com> <http://www.tomazos.com>




// ======================= HitFinder.cpp Cut Here ========================
// (C) 2011, Andrew Tomazos <andrew@tomazos.com>. All Rights Reserved.


#include <queue>
#include <string>
#include <iostream>
#include <limits>
#include <cstdlib>


using namespace std;


typedef long long int64;
bool IsPow2(int64 i)
{
return i != 0 && !(i & (i - 1));
}


int64 x = 0;


void Hit()
{
x++;
}


int Rand(int n)
{
return rand() % n;
}


class RandCharStream
{
public:
vector<string> S;
size_t n, istr, ichar;


RandCharStream(vector<string> S_) :
S(S_),
n(S.size()),
istr(Rand(n)),
ichar(0)
{
}


char nextChar()
{
if (ichar == S[istr].size())
{
istr = Rand(n);
ichar = 0;
}


return S[istr][ichar++];
}
};


class HitFinder
{
public:
RandCharStream& stream;
queue<char> T;
int m;
queue<char> Q;


HitFinder(RandCharStream& stream_, string T_) :
stream(stream_),
m(T_.size())
{
for (int i = 0; i < m; i++)
{
T.push(T_[i]);
Q.push('\0');
}
}


void nextChar()
{
Q.pop();
Q.push(stream.nextChar());


if (Q == T)
Hit();
}
};


int main(int argc, char** argv)
{
if (argc < 3)
{
cerr << "Usage: " << argv[0] << " <T> <S1> <S2> ... <Sn>" << endl;
return -1;
}


string T = argv[1];
cout << "T = " << T << endl;


cout << "A = {";
vector<string> A;
for (int i = 2; i < argc; i++)
{
string s = argv[i];
cout << s << (i < argc - 1 ? ", " : "");
A.push_back(s);
if (s.empty())
{
cerr << "Element S" << i - 1 << " is empty" << endl;
return -1;
}
}
cout << "}" << endl;
cout << endl;


RandCharStream stream(A);
HitFinder finder(stream, T);


for (int64 y = 1; y < __LONG_LONG_MAX__; y++)
{
finder.nextChar();


if (IsPow2(y))
{
cout << "(x, y, x/y) = (" << x << ", " << y << ", ";
if (x == 0)
cout << "0";
else
cout << "1/" << double(y) / double(x);
cout << ")" << endl;
}
}


return 0;
}
//======================= HitFinder.cpp Cut Here ========================


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.