Re: High Precision Arithmetic Benchmarks

Terje Mathisen <Terje.Mathisen@hda.hydro.com>
31 Jul 1996 19:19:55 -0400

          From comp.compilers

Related articles
Re: High Precision Arithmetic Benchmarks glew@ichips.intel.com (1996-07-28)
Re: High Precision Arithmetic Benchmarks hbaker@netcom.com (1996-07-31)
Re: High Precision Arithmetic Benchmarks Terje.Mathisen@hda.hydro.com (Terje Mathisen) (1996-07-31)
Re: High Precision Arithmetic Benchmarks jgk@jgk.org (Joe Keane) (1996-08-04)
| List of all articles for this month |

From: Terje Mathisen <Terje.Mathisen@hda.hydro.com>
Newsgroups: comp.arch,comp.benchmarks,comp.compilers
Date: 31 Jul 1996 19:19:55 -0400
Organization: Hydro
References: <4svpfb$ifq@news-rocq.inria.fr> <DOCONNOR.Jul2330104@sedona.intel.com> 96-07-196
Keywords: arithmetic

Andy Glew wrote:
[snip]
>
> There aren't all that many ways of writing extended precision add or
> multiply. Certainly, it is not at all unreasonable to expect a
> compiler to recognize an extended precision addition coded in the
> naive 8 bit or 16 bit algorithm - which is highly portable - and
> optimize it to be performed 64 or more bits at a time.


[snip]


> The more advanced extended precision algorithms, Karatsuba or Coomba,
> are a bit harder to scale from small to large integers.


> I therefore do not think it at all unreasonable that compilers should
> be able to support extended precision arithmetic coded in a high level
> language like C or C++, even if the algorithm is expressed in terms of
> smaller-than-optimal integer sizes, like 8 or 16 bits, for
> portability. Conversion to use larger integers like 64 bits, or
> conversion to use small integer vectors, should be within present
> compiler technology.


Even with Karatsuba you still need a regular O(N^2) multiply at the end,
this kernel seems to run faster (at least on a Pentium) when written using
fp muls, and doing the minimum possible amount of carry propagation.


I do this by generating a single result each time through the outer loop,
instead of using saxpy()-like code on each row.


I.e. my code currently looks sort of like this (modulo some unrolling etc):


// a[N]*b[N] -> r[2N], each element is a 24-bit integer stored as IEEE single
// At this point, N is guaranteed to be less than 32, so the sum of 48-bit
// results cannot overflow a 53-bit double!


    double top, sum = 0.0;
    for (i = 0; i < N+N-1; i++) {
        for (j = max(0,i-N+1); j < min(N,i+1); j++) {
            sum += (double) a[j] * b[i-j];
        }
        top = floor(sum * two_to_minus_24);
        r[i] = sum - top * two_to_24;
        sum = top;
    }
    r[N+N-1] = sum;


Even with only 24*24->48 bits generated by each mul, this code is quite a
bit faster than a 32*32->64 integer multiply on a Pentium, due to the
pipelined fp multiplier.


Properly scheduled code for the inner loop can run at 4 cycles/FMAC, which
more than makes up for the increased number of operations.


It seems to me that the integer version will have a tough job even on a PPro
(which has a 5-cycle pipelined integer mul), because the x86 architecture
doesn't have enough registers available to feed the multiplier while
accumulating 64-bit results.


> I therefore think it *DESIRABLE* that any putative encryption
> benchmark contain efficient C-coded algorithms. Adaptation of the
> benchmark rules for assembly coded support should be optional,
> and, indeed, discouraged.


Agreed.


--
- <Terje.Mathisen@hda.hydro.com>
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.