31 Jul 1996 19:19:55 -0400

Related articles |
---|

Re: High Precision Arithmetic Benchmarks glew@ichips.intel.com (1996-07-28) |

Re: High Precision Arithmetic Benchmarks hbaker@netcom.com (1996-07-31) |

Re: High Precision Arithmetic Benchmarks Terje.Mathisen@hda.hydro.com (Terje Mathisen) (1996-07-31) |

Re: High Precision Arithmetic Benchmarks jgk@jgk.org (Joe Keane) (1996-08-04) |

From: | Terje Mathisen <Terje.Mathisen@hda.hydro.com> |

Newsgroups: | comp.arch,comp.benchmarks,comp.compilers |

Date: | 31 Jul 1996 19:19:55 -0400 |

Organization: | Hydro |

References: | <4svpfb$ifq@news-rocq.inria.fr> <DOCONNOR.Jul2330104@sedona.intel.com> 96-07-196 |

Keywords: | arithmetic |

Andy Glew wrote:

[snip]

*>*

*> There aren't all that many ways of writing extended precision add or*

*> multiply. Certainly, it is not at all unreasonable to expect a*

*> compiler to recognize an extended precision addition coded in the*

*> naive 8 bit or 16 bit algorithm - which is highly portable - and*

*> optimize it to be performed 64 or more bits at a time.*

[snip]

*> The more advanced extended precision algorithms, Karatsuba or Coomba,*

*> are a bit harder to scale from small to large integers.*

*> I therefore do not think it at all unreasonable that compilers should*

*> be able to support extended precision arithmetic coded in a high level*

*> language like C or C++, even if the algorithm is expressed in terms of*

*> smaller-than-optimal integer sizes, like 8 or 16 bits, for*

*> portability. Conversion to use larger integers like 64 bits, or*

*> conversion to use small integer vectors, should be within present*

*> compiler technology.*

Even with Karatsuba you still need a regular O(N^2) multiply at the end,

this kernel seems to run faster (at least on a Pentium) when written using

fp muls, and doing the minimum possible amount of carry propagation.

I do this by generating a single result each time through the outer loop,

instead of using saxpy()-like code on each row.

I.e. my code currently looks sort of like this (modulo some unrolling etc):

// a[N]*b[N] -> r[2N], each element is a 24-bit integer stored as IEEE single

// At this point, N is guaranteed to be less than 32, so the sum of 48-bit

// results cannot overflow a 53-bit double!

double top, sum = 0.0;

for (i = 0; i < N+N-1; i++) {

for (j = max(0,i-N+1); j < min(N,i+1); j++) {

sum += (double) a[j] * b[i-j];

}

top = floor(sum * two_to_minus_24);

r[i] = sum - top * two_to_24;

sum = top;

}

r[N+N-1] = sum;

Even with only 24*24->48 bits generated by each mul, this code is quite a

bit faster than a 32*32->64 integer multiply on a Pentium, due to the

pipelined fp multiplier.

Properly scheduled code for the inner loop can run at 4 cycles/FMAC, which

more than makes up for the increased number of operations.

It seems to me that the integer version will have a tough job even on a PPro

(which has a 5-cycle pipelined integer mul), because the x86 architecture

doesn't have enough registers available to feed the multiplier while

accumulating 64-bit results.

*> I therefore think it *DESIRABLE* that any putative encryption*

*> benchmark contain efficient C-coded algorithms. Adaptation of the*

*> benchmark rules for assembly coded support should be optional,*

*> and, indeed, discouraged.*

Agreed.

--

- <Terje.Mathisen@hda.hydro.com>

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.