Re: regular expression generation

Hans-Peter Diettrich <>
Tue, 02 Feb 2010 09:03:37 +0100

          From comp.compilers

Related articles
regular expression generation (Ralph Boland) (2010-01-31)
Re: regular expression generation (Hans-Peter Diettrich) (2010-02-02)
Re: regular expression generation (Matthias-Christian Ott) (2010-02-02)
| List of all articles for this month |

From: Hans-Peter Diettrich <>
Newsgroups: comp.compilers
Date: Tue, 02 Feb 2010 09:03:37 +0100
Organization: Compilers Central
References: 10-02-007
Keywords: lex
Posted-Date: 02 Feb 2010 23:08:20 EST

Ralph Boland schrieb:
> I am building a finite state machine generator.
> It currently has the ability to generate random
> regular expressions for testing the engine
> but the generation of random character classes
> (i.e. expressions of the form [ab-ew-zA-CD;@&])
> is not supported.
> I was wondering if anyone has done this and can
> give me ideas on the best way to do so.

I don't understand what your particular problem is, it looks to me
more like establishing rules for creating well formed (nice looking)
expressions - that's up to you.

For nice looking classes you can use predefined character classes
(digits, lower/upper case characters, punctuators...), and randomly
merge subsets of these classes into a new character class.

For handling character classes, Delphi/Pascal has Sets of up to 256
elements, i.e. covering single-byte character codes. The
implementation uses an array of bits, indexed by e.g. the character
byte code. This makes the test of membership of a character (byte) in
an given set almost one machine insctruction. For random selection
from a character class, a string (array of characters) containing all
characters will be faster, using an random index to pick one of the
stored codes. An implementation of a character class may contain and
use both representations internally.

Creating fully random character classes can be implemented by a random
set size first, resulting in the accordingly dimensioned character
array, which then is filled with random character values. Duplicates
can be omitted by according checks, or the code can be incremented on
a collission when character clusters are preferred. Or you create
random pairs of a starting S and ending E character for ranges, where
an S<E will represent an range <S>-<E>, otherwise a single character

Finally the created (any) character class can be translated into the
according expression syntax.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.