3 Jul 2001 23:13:26 -0400

Related articles |
---|

Any references for efficient division code sequences? shankar@webnexus.com (Shankar Unni) (2001-06-28) |

Re: Any references for efficient division code sequences? torbenm@diku.dk (2001-07-02) |

Re: Any references for efficient division code sequences? rv3s@cobra.cs.virginia.edu (Raja Venkateswaran) (2001-07-02) |

Re: Any references for efficient division code sequences? David.Chase@naturalbridge.com (David Chase) (2001-07-03) |

Re: Any references for efficient division code sequences? kadiyala@home.com (Suresh Kadiyala) (2001-07-17) |

From: | David Chase <David.Chase@naturalbridge.com> |

Newsgroups: | comp.compilers |

Date: | 3 Jul 2001 23:13:26 -0400 |

Organization: | Compilers Central |

References: | 01-06-060 |

Keywords: | arithmetic |

Posted-Date: | 03 Jul 2001 23:13:25 EDT |

Shankar Unni wrote:

*> Specifically for dividing by small constants (< 16). We're working on a*

*> small MIPS-like network processor that has no MUL/DIV unit, and need to*

*> be able to do a *fast* mod by small numbers.*

*> Are there any references that discuss code sequences for dividing (or*

*> specifically in our case, remaindering) integers by small constants (for*

*> any processor that doesn't have a MUL/DIV unit)?*

References for your situation, I'm not sure off hand.

What I have done in the past for computing results

modulo-N is precompute the largest N * (2**M), and then

iteratively subtract. For example:

unsigned int modulo_small_odd_n(unsigned int x, unsigned int n) {

unsigned int largest_multiple = lm[n];

while (largest_multiple > n) {

if (largest_multiple >= x) x -= largest_multiple;

largest_multiple = largest_multiple >> 1;

}

return x;

}

This is going to give you nearly wordsize iterations.

Another way of doing this is with two (or more) tables.

Since your small constant is less than 16, you could use

16 sets of 4 tables of 256 bytes each (16 k total), where

(for 1 <= x <= 16, 0 <= y <= 255)

T_x_3[y] = (y << 24) mod x

T_x_2[y] = (y << 16) mod x

T_x_1[y] = (y << 8) mod x

T_x_0[y] = (y << 0) mod x

Given an input z, z mod x is

T_x_0[ T_x_0[z & 255] +

T_x_1[(z >> 8) & 255] +

T_x_2[(z >> 16) & 255] +

T_x_3[(z >> 24) & 255] ]

This approach works for x <= 64, because the maximum value of the sum

is 4 times (x-1) which must be smaller than 256.

The tables aren't really that large, since you can share some of the

tables. I got bored and wrote a Java program to compute them all;

there's 31 256-byte tables, or less than 8k total.

I got more bored, and wrote a benchmark, and the official no-remainder

code (included below, minus the table initializations, which are left

as an exercise to the reader or available on request) is about 25%

faster than computing with the hardware (when the hardware is an

800Mhz Pentium 3) (I am thinking to myself that I know of a few places

where this might be useful in my own code. Note that this could

trivially be extended to 64 bits for a modulus as large as 32.)

Now, once you have modulus, if you are dealing with a small range of

integers and want division, then you can subtract out the modulus, and

replace the division with a (possible) shift and a multiply by the

(odd) divisor's inverse in the ring 2**N. (This is different from the

inverse in the Alverson paper.)

Or, you can modify the iterative division presented at the top.

Neither of these choices is likely to be faster than computing both

quotient and remainder with a single machine instruction, on those

machines that support it.

Table driven mod follows:

static unsigned char * mod_tables [17][4] = {

{table_1_0,table_1_0,table_1_0,table_1_0}, /* Dummy table */

{table_1_0,table_1_0,table_1_0,table_1_0},

{table_2_0,table_1_0,table_1_0,table_1_0},

{table_3_0,table_3_0,table_3_0,table_3_0},

{table_4_0,table_1_0,table_1_0,table_1_0},

{table_5_0,table_5_0,table_5_0,table_5_0},

{table_6_0,table_6_1,table_6_1,table_6_1},

{table_7_0,table_7_1,table_7_2,table_7_0},

{table_8_0,table_1_0,table_1_0,table_1_0},

{table_9_0,table_9_1,table_9_2,table_9_0},

{table_10_0,table_10_1,table_10_1,table_10_1},

{table_11_0,table_11_1,table_11_2,table_11_3},

{table_12_0,table_12_1,table_12_1,table_12_1},

{table_13_0,table_13_1,table_13_2,table_13_0},

{table_14_0,table_14_1,table_14_2,table_14_3},

{table_15_0,table_15_0,table_15_0,table_15_0},

{table_16_0,table_1_0,table_1_0,table_1_0}

*};*

int mod(unsigned int x, unsigned int m) {

int x0 = 255 & (x >> 0);

int x1 = 255 & (x >> 8);

int x2 = 255 & (x >> 16);

int x3 = 255 & (x >> 24);

unsigned char * t0 = mod_tables[m][0];

unsigned char * t1 = mod_tables[m][1];

unsigned char * t2 = mod_tables[m][2];

unsigned char * t3 = mod_tables[m][3];

return t0[t0[x0]+t1[x1]+t2[x2]+t3[x3]];

*}*

You can shrink this slightly (removing table_{2,4,8,16}_0 ) and

improve performance slightly (assuming a uniform distribution in the

1-16 range) by prechecking for modulus by a power of two:

int modB(unsigned int x, unsigned int m) {

unsigned int mask = m-1;

if (m == (m & ~mask)) {

return x & m;

} else {

...

The timings on my machine for many 115,200,000 mods (this includes

loop overhead) are:

1. straight remainder 8.9s (62 cycles)

2. simple table 6.8s (47 cycles)

3. check + table/mask 5.9s (41 cycles)

Do note, this is for unsigned arithmetic. Doing this for signed

arithmetic is messier. The modified code is:

int modC(unsigned int x, unsigned int m) {

int mask = (int) x >> 31;

x = (x ^ mask) - mask;

{

int x0 = 255 & (x >> 0);

int x1 = 255 & (x >> 8);

int x2 = 255 & (x >> 16);

int x3 = 255 & (x >> 24);

unsigned char * t0 = mod_tables[m][0];

unsigned char * t1 = mod_tables[m][1];

unsigned char * t2 = mod_tables[m][2];

unsigned char * t3 = mod_tables[m][3];

return ((t0[t0[x0]+t1[x1]+t2[x2]+t3[x3]]) ^ mask) - mask ;

}

*}*

With timings

4. table + signage 8.5s

5. signed remainder 8.7s

Notice that (for whatever reason, could be compilation artifacts) the

signed remainder instruction is faster than the unsigned remainder

instruction. The table is still a hair faster, but only a hair with

the extra overhead added.

David Chase

--

David.Chase@NaturalBridge.com

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.