Fri, 23 Oct 1992 20:10:15 GMT

Related articles |
---|

replacing multipication of constant by simpler instructions (LONG) wu@sequent.com (Youfeng Wu) (1992-10-23) |

Newsgroups: | comp.compilers |

From: | Youfeng Wu <wu@sequent.com> |

Organization: | Compilers Central |

Date: | Fri, 23 Oct 1992 20:10:15 GMT |

Keywords: | arithmetic, architecture |

Replacing Multiplication of Integer Constant

by Sequence of SUB/ADD/SHIFT/LEA Instructions

1. Introduction

For pipelined or superscalar processors, multiplication usually is many

times slower than the simple instructions like add, sub, shift, or LEA

(shift by 1, 2, or 3, and add). Replacing a multiplication by a sequence

of simple instructions is beneficial because that it reduces the number of

cycles and provides more opportunity for instruction level parallelism.

For example, for the INTEL i486 processor, multiplication of a 4-bytes

integer takes 26 cycles, but add and sub take only one cycle and shift and

LEA take two cycles. If we replace reg0*136 by the following two

instructions,

reg = reg0 shift 7 *** reg0 left shift 7 bits

reg = LEA ( reg, X, 3) *** reg add ( reg0 left shift 3 bits )

the multiplication will take only 3 cycles rather than 26 cycles.

Since the speed discrepancy between the multiplication and the simple

instructions are so high, we can usually use a sequence of more than two

instructions and still gain performance improvement. For example, we can

use the following five instructions to perform reg0 * 1950 and save 17

cycles.

reg = LEA ( reg0, reg0, 1 )

reg = reg shift 5

reg = LEA ( reg, reg0, 1 )

reg1 = reg0 shift 11

reg = reg1 sub reg

In addition to the cycles savings, sequence of simple instructions are

RISC style of coding and are more suitable for instruction level

parallelism. For example, it is easy to insert instructions among the

sequence of simple instructions and run them in parallel on a superscalar

processor.

Here we focus our attention on converting multiplication by constant to

sub/add/shift/LEA instruction sequence. More powerful instructions are

not allowed. Also we assume the integer constant is positive and negative

number can be handled similarly as mentioned by Bernstein [2].

2. Problem Formulation.

If only add is allowed in the simple instruction sequence, a nice mathematic

formulation of the problem can be found in Knuth [1], namely the concept of

"addition chain". For a multiplication X * C, where C is a constant,

an "addition chain" for C is defined as a sequence of integers

1 = a0, a1, a2, ..., ar = C

that are generated by the rule

ai = aj + ak, for some j, k such that i > j >= k for all i in [1. r].

Once an addition chain is found, the calculation of X * C can be performed

by the sequence of additions:

a1 = X + X

...

ai = aj + ak i > j >= k

...

C = ar = aw + au r > w >= u.

For example, an addition chain for the number 9 can be 1 2 4 5 9 because

9 = 5+4, 5=4+1, 4=2+2, 2=1+1. From this addition chain, the following

sequence of additions can be generated to calculate reg1*9:

reg2 = reg1 + reg1

reg3 = reg2 + reg2

reg4 = reg3 + reg1

reg5 = reg4 + reg2

The problem to find the optimal sequence is equivalent to find the addition

chain with the minimal length, which is believed to be NP-complete (no time to

prove it yet).

In the case where add, sub, shift, and LEA instructions are all allowed,

Deniel Magenheimer [2] et el extended the concept of addition chain to

"sub/add/shift/LEA chain" (I will call it SASL-chain for short). That is,

the rule

ai = aj + ak, for some j, k such that i > j >= k for all i in [1. r].

is extended to

ai = aj + ak or

ai = aj - ak or

ai = aj << ak or

ai = LEA(ak, aj, 1) or

ai = LEA(ak, aj, 2) or

ai = LEA(ak, aj, 3) for some j, k such that i > j >= k

for all i in [1. r].

For example, an SASL-chain for 9 is 1, 9 because 9 = LEA(1, 1, 3).

Although including sub/shift/LEA instructions can reduce the length of

instruction sequence, it should increase the size of the search space for

the minimal sequence, and therefore the problem of finding the optimal

sequence becomes harder and should remain NP-complete. Any practical

solution to find a fast sequence for multiplication by constant should

using heuristics.

3. Heuristics solution of Robert Bernstein

Robert Bernstein [3] provided a nice "branch and bound" heuristics

searching for a good add/sub/shift sequence for constant multiplication.

We extended it to use LEA instructions as much as possible.

The extended algorithm can be summerized as follows.

Algorithm branch_bound (n)

Input: an integer n, costs for add, sub, shift, and LEA instructions

Output: an SASL-chain for multiplication by n and the associated cost:

( SASL-chain, cost of the chain).

Process: In the following, chain.result stands for the intermediate

result after evaluating the SASL-chain "chain".

if ( n <= 1 )

return (nil, 0)

if ( bound condition meet, abort this search path )

return (nil, max_cost)

if ( n is even and n = m * 2**i where m is an odd number )

( chain, cost) = branch_bound ( m )

return ( [chain, chain.result << i], cost + cost of shift );

(chain1, cost1) = branch_bound ( n - 1 )

(chain2, cost2) = branch_bound ( n + 1 )

(chain3, cost3) = branch_bound ( n / (2**i + 1) ), i = 1, 2, or 3

(chain4, cost4) = branch_bound ( n / (2**i + 1) ), i > 3

(chain5, cost5) = branch_bound ( n / (2**i - 1) ), i > 0

if ( n == (2**i + 1) ), i = 1, 2, or 3

(chain6, cost6) = branch_bound ( 1 )

min_cost = min ( cost1, cost2, cost3, cost4, cost5, cost6 )

if ( min_cost = cost1 )

return ( [chain, chain.result + 1], cost + cost1 of add );

if ( min_cost = cost2 )

return ( [chain, chain.result - 1], cost + cost2 of sub );

if ( min_cost = cost3 )

return ( [chain, LEA(chain.result, chain.result, i)],

cost3 + cost of LEA );

if ( min_cost = cost4 )

return ( [chain, chain.result << i + chain.result],

cost4 + cost of shift + cost of add );

if ( min_cost = cost5 )

return ( [chain, chain.result << i - chain.result ],

cost5 + cost of shift + cost of sub );

if ( min_cost = cost6 )

return ( [chain, LEA(chain.result, chain.result, i)],

cost6 + cost of LEA );

This heuristics can discover many good sequences, and in many case the

result is optimal. For example, Preston pointed that for number between 1

to 1000, the original Bernstein's algorithm requires a total of 5512

cycles whereas the optimal sequences require 5039".

But the extended Bernstein algorithm using LEA still didn't take the full

advantages of LEA instruction. An example of non-optimal case is X*29,

which need the sequence of (even with the extension to use LEA

instruction):

8 = 1 << 3 reg1 = X << 3

7 = 8 - 1 reg2 = reg1 - X

28 = 7 << 2 reg3 = reg2 << 2

29 = 28 + 1 reg4 = reg3 + X

while we notice that 29 = 32 - 3 so the following three instructions are

sufficient:

reg1 = LEA(X, X, 1)

reg2 = X << 5

reg3 = reg2 - reg1

We will compare this extended Bernstein algorithm using LEA with our

algorithm later.

4. Heuristics solution of Deniel Magenheimer [2] et el.

A rule based heuristics was used in Deniel Magenheimer [2] et el, which

takes full advantage of LEA instructions.

The advantage of the rule based system is that many small tricks we human

do can be specified as rules and they may be picked up by the inference

engine. Deniel Magenheimer [2] et el. mentioned that the rule based

approach generated optimal sequences for all numbers under 10,000 except

12. We believe that can be done and it should not be too difficult to add

more rules to cover the 12 cases too. But we speculate that the rule

bases approach may not be as fast as a procedural approach. Also it may

not be cost-effective to include a rule based system into an existing

compiler solely for the purpose of replacing multiplication by constants.

This situation may change since people are proposing rule based approach

for several other compiler optimizations ([ref]).

5. Our Heuristics exploring LEA instructions

LEA instruction is very powerful. It potentially can perform two

operations in one instruction and should be explored as much as possible.

Assume LEA (B,I,S) = B + I << S, S = 1, 2, ..., MAX_S.

5.1. Basic algorithm using LEA

Each integer constant N can be represented as a binary sequence

N = sum ( ai*2**i | i = 0, ..., k ) where ai = either 0 or 1.

To use LEA, we note that the above number can be represented recursively

as the following, assuming ai and aj are the first two bits that are ares:

N = 2**i ( 1 + 2**(j-i) * N' )

where N' = ( N - 2**i ) >> j has one less 1 bits than N.

If (j-i) <= MAX_S, N can be calculated by calculating N' followed by

an LEA instruction and possibly a shift instruction, namely,

N' = ...

temp = LEA(1, N', (j-i) )

N = temp << i

If i == 0 (i.e. a0 = 1), the last shift can be removed.

Note that, if (j-i) <= MAX_S, N' will have a0' = 1, so a shift

is not needed after calculating N'.

If (j-i) > MAX_S, then

N = 2**i ( 1 + 2**MAX_S * N' )

where N' = ( ( N - 2**i ) >> (MAX_S + i) ) has one less 1 bits than N.

and N can be calculated as following:

N' = ...

temp = LEA(1, N', MAX_S )

N = temp << i

In this case, N' will have a0' = 0, so a shift may be needed

after calculating N'.

Assume N has r ones, t instances in which the consecutive ones are more

than MAX_S apart. This basic algorithm will use r-1 LEA, t+a0 shifts for

multiplying N. Namely a total of r - 1 + t + a0 LEA and shift

instructions.

An improvement can be made to the above procedure when we notice

that if 1 <= i <= MAX_S,

N = 2**i + N'

where N' = ( N - 2**i ) has one less 1 bits than N.

and N can be calculated as following:

N' = ...

temp = LEA(N', 1, i)

This will save the shift instruction in the above procedure when

1 <= i <= MAX_S.

Before describing the basic algorithm, we introduce the following

"condensed representation" of integer numbers which simplifies the

implementation of the above procedure.

For an integer number N, assume the number of ones in its binary

representation is r. We may mark the ones in the binary sequence, and

assume they are at locations:

L0, L1, ..., Lr-1, where Li+1 > Li >= 0

The condensed representation of the number N is

N = R0, R1, ..., Rr-1

where R0 = L0, and Ri = Li - Li-1, for i = 1, ..., r-1.

The basic algorithm can be described as follows.

Algorithm basic_decompose_mul ( R, r, N )

Input: N - the constant to be multiplied

r - the number of 1 in the constant to be processed.

R[0...r-1] - the condensed representation of the constant.

Output: emit LEA/shift instructions for multiplying the constant

and return the root of the instruction tree emitted.

Process:

if ( R[0] > MAX_S ) {

r0 = R[0];

R[0] = 0;

rslt = decompose_mul ( R, r );

rslt = emit_shift ( rslt, r0 );

} else if ( R[0] > 0 ) {

r0 = R[0];

R[1] += r0;

rslt = decompose_mul ( R[1..r-1], r - 1 );

rslt = emit_LEA ( rslt, 1, r0 );

} else {

r1 = R[1] > MAX_S ? MAX_S : R[1];

R[1] -= r1;

rslt = decompose_mul ( R[1..r-1], r - 1 );

rslt = emit_LEA ( 1, rslt, r1 );

}

return rslt

Since the number of instructions generated by this algorithm is in

proportion to the number of ones in N, we should try to reduce the

number of ones when possible.

5.2. Generalized complementary scheme

The above recursive procedure can be improved to convert a sequence of

one's to a single one with a subtraction (if it is safe to do so). For

example, X consecutive 1's can be converted to 2**X - 1. This

complementary approach can reduce the number of ones very quickly. Since

2**X - 1 takes two instructions, X should be at least three for this

complementary scheme to be beneficial.

This method can be extended to a sequence of 0 and 1's as long

as the final sequence takes less instructions. Assume

N = sum ( ai*2**i | i = 0, ..., k )

where ai = either 0 or 1, and 2**k < N < 2**(k+1)

then

2**(k+1) = sum ( (ai + ~ai)*2**i | i = 0, ..., k ) + 1

= N + sum ( (~ai)*2**i | i = 0, ..., k ) + 1

That is,

N = 2**(k+1) - N'

where N' = sum ( (~ai)*2**i | i = 0, ..., k ) + 1

Note that sum ( (~ai)*2**i | i = 0, ..., k ) is the number of zeros

in N before the last one.

So if N has r 1's and t zeros among the ones, r - t - 2 will be the

number of instructions that will be saved using the conversion.

The generalized complementary scheme is to replace N with 2**(k+1) - N'

when

r - t - 2 > 0.

When t = 0, we get back to the case for consecutive 1's.

This generalized scheme can be incorporated into the above recursive

procedure very nicely. Namely, at the beginning of each step of

recursion, check N to identify a prefix that maximizes r - t - 2 > 0. If

such a prefix is found, say N = N1 + N2, where N1 = 2**(k+1) - N1',

then N can be calculated as following:

N1' = ...

temp = N2 + 2**(k+1)

N = temp - N1'

5.3. Factoring to use LEA

If N is a multiple of ( 1 + 2**i ), for i = 1, 2, ..., or MAX_S, namely N can

be defined as

N = ( 1 + 2**i ) * N'

where N' = N / ( 1 + 2**i )

then the multiplication by N can be calculated as:

N' = ...

N = LEA ( N', N', i )

In most cases, N' will have much less number of 1 bits than N so the above

factoring will reduce the number of 1 bits very rapidly. But that is not

always true. There are cases for which N' has more 1's than N. So it is

potentially possible that factoring may do worse than without factoring.

An example is 290 = 58 * 5, where 290 = 2**1 + 2**5 + 2**8, while 58 =

2**1 + 2**3 + 2**4 + 2**5.

So we apply factoring only when N' has less one bits than N has.

5.4. Combined algorithm

The compiled procedure for replacing multiplication of integer N with a

sequence of sub/add/shift/LEA uses all of the three techniques mentioned

in 5.1, 5.2, and 5.3. For a number N, it first checks if it has a prefix

sub-sequence of 0 and 1's that can be complemented to reduce the number of

ones in N. If so apply the complementary scheme with N' with less ones.

If not, it checks if N can be factored to use LEA and reduce number of

ones. If so apply the factoring scheme. If not the basic algorithm is

used. This can be described as follows:

Algorithm decompose_mul( R, r, N )

Input: N - the constant to be multiplied

r - the number of 1 in the constant to be processed.

R[0...r-1] - the condensed representation of the constant.

Output: emit LEA/shift instructions for multiplying the constant

and return the root of the instruction tree emitted.

Process:

if ( r <= 2 )

return evaluate_simple_cases(R, r, N);

if ( complementary scheme applied )

return its result

if ( factoring scheme applied )

return its result

return basic_decompose_mul ( R, r, N );

5.5. Performance result

The above algorithm combines three techniques to use LEA and reduce the

number of simple instructions for multiplication of constant. Because it

reduces the number of ones of the input number by at least one in each

iteration, it runs very fast. Table 1 compares the execution time (on an

i486 system) of our algorithm (ALG1) with Bernstein's algorithm

implemented by Preston Briggs with our extension to use LEA. Both user

time and system time are listed with system time in parenthesis.

The execution time of ALG1 is farely uniform, roughly at 50/1,000,000

second per multiplication, and the amount of system time is ignorable. On

the other hand, the execution of of ALG2 seemingly grows more than

linearly as the number of constant increases. Also, ALG2 uses a

significant amount of system time overhead (we have more than doubled the

hash table size to speed up the search time in ALG2).

Table 1. Execution time comparison (0.1 sec)

NUMBER RANGES ALG1 ALG2

===============================================

1 100000 42.7 (0.1) 40.7 (36.3)

1 200000 92.2 (0.1) 149.7 (184.6)

1 300000 145.3 (0.1) 309.2 (296.7)

1 400000 199.1 (0.1) 523.0 (357.6)

1 500000 255.6 (0.1) 777.7 (390.8)

1 600000 312.9 (0.2) 1089.1 (394.7)

1 700000 371.4 (1.2) 1456.8 (396.0)

1 800000 428.0 (0.4) 1893.9 (399.2)

1 900000 488.4 (0.2) 2356.3 (408.3)

1 1000000 548.9 (0.2) 2882.3 (440.1)

The following compares the quality of the generated instruction

sequences.

For simplicity, we assume each of the simple instructions takes

only one cycle. Thus we can compare the goodness of instruction

sequence by simply comparing how many instructions it uses.

Table 2 compares the number of instructions used by our combined

algorithm and by Bernstein's algorithm implemented by Preston

Briggs with our extension to use LEA.

ALG2 constantly uses more (around 7%) instructions than

ALG1. Furthermore the difference is higher for small numbers

than for big numbers.

Table 2. Number of instructions

NUMBER RANGES ALG1 ALG2 % diff

======================================================

1 100000 679904 740311 8.885%

1 200000 1446258 1565844 8.269%

1 300000 2247738 2422837 7.790%

1 400000 3065921 3301085 7.670%

1 500000 3921855 4197590 7.031%

1 600000 4755289 5098608 7.220%

1 700000 5604671 6012433 7.275%

1 800000 6478143 6938681 7.109%

1 900000 7369242 7871300 6.813%

1 1000000 8275983 8813951 6.500%

The following summerizes the relative effects of each of three techniques

in our algorithm on the number of instructions used. The basic algorithm

for exploring LEA is referred to as STEP1, STEP1 with generalized

complementary scheme is referred to as STEP2, and STEP2 with factoring for

LEA is referred to as STEP3. For each range of numbers, Table 3 shows the

difference between the total number 1 bits and the number of the

instructions used. The bigger is the difference, the better is the

techniques since that means less instructions are used. Cleanly, the

performance of the algorithm gets better as it evolves from STEP1 to STEP2

and then to STEP3.

Table 3. Performance results of the algorithm

NUMBER RANGES STEP1 STEP2 STEP3

================================================================

00000001 10000000 -14631598 3546659 17975029

10000001 20000000 -16631634 3574258 19228199

20000001 30000000 -16412241 3235187 20628282

30000001 40000000 -18850950 3908062 19065351

40000001 50000000 -17188144 3869904 21889975

50000001 60000000 -17636373 2596056 20623154

60000001 70000000 -16601799 12169129 25227637

70000001 80000000 -23099969 -4279395 14079310

80000001 90000000 -20917643 1244995 20105549

90000001 100000000 -15458721 6554094 24913372

5.6. Preventing overflows

Converting multiplication by constant to a sequence of sub/add shift/LEA

instructions may introduce runtime overflow. This can be prevented by

excluding sub instruction from the sequence. Since sub instruction is

only used in the generalized complementary scheme, we added a compiler

option which disables that operation when it is specified.

6. Acknowledgment

I would like to thank Rocky Bernstein, Preston Briggs, Cliff Click, Hans

Olsson, Thomas David Rivers, and many other people from the net for

references, insights, and discussions on the topic of multiplication by

constant. Special thanks to Preston Briggs for his implementation of

Bernstein algorithm which is used in our performance comparison.

References

[1] Knuth, The Art of Computer Programming, Volume 2/Seminumerical

Algorithms. pp 402-422.

[2] Robert Bernstein, "Multiplication by Integer Constants,"

Software -- Practice and Experience, Vol 16, No. 7, Jul, 1986. pp 641-652

[3] Deniel Magenheimer, et. al, "Multiplication and Division on

the HP Precision," Proceedings of ASPLOS II. pp. 90-99.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.