Re: Writing fast compilers II rs/6000 (John D. McCalpin)
18 Aug 91 14:17:49 GMT

          From comp.compilers

Related articles
Re: Writing fast compilers II rs/6000 (1991-08-18)
Re: Writing fast compilers II rs/6000 (1991-08-19)
Re: Writing fast compilers II rs/6000 (1991-08-18)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (John D. McCalpin)
Keywords: design, optimize
Organization: College of Marine Studies, U. Del.
References: <>
Date: 18 Aug 91 14:17:49 GMT

[From comp.arch -John]

>>>>> On 18 Aug 91 00:42:43 GMT, (Piercarlo Grandi) said:
pcg> Consider the appropriate choice of abstraction level: if your
pcg> application requires a matrix multiplication, is it clearer and more
pcg> efficient to write three nested loops to do it and have them rewritten
pcg> behind your back to five only if the target is a vector machine, or to
pcg> write 'a matmult b' and have the compiler generate either the three of
pcg> five loop version depending on machine type?

That is fine, of course, if you are doing matrix multiplication. Since I
occasionally find myself doing other things {:-)}, I find (not too
surprisingly) that there exist no HLL's that implement the operations that I
need. If the language is suitably extensible, then I can define them myself.
But to get the required efficiency, I still have to program the kernels in an
ugly, unreadable, machine-specific way.

The example I posted a few days ago is a case in point. I want to be
able to write:
t = jacobian(h,z)
where the finite-difference jacobian operator is defined as the
average of the 3 straightforward forms (J0,J+,J-), which are defined
by the 3 groups of lines in my previous posting (original form). In
a suitable HLL, one might provide the compiler with something like:

array function jacobian(a,b) == (J0(a,b)+JP(a,b)+JM(a,b))/3.0
array function J0(a,b) = (first two lines of "original code")
array function JP(a,b) = (next three lines of "original code")
array function JM(a,b) = (next three lines of "original code")

I believe that optimizations such as I showed in the "optimized" form are
well beyond the ability of modern compilers when presented with a
"high-level" description of the problem. I wish it were not so, but I have
not yet found *any* compiler that can fuse and unroll loops to produce code
that is within 25% of the manually "improved" versions.

The biggest problem is that the optimizations that are *crucial* on
hierarchical memory machines (function/subroutine inlining -> loop
fusion/rewriting -> loop unrolling) are still rather poorly done.
On machines that have enough memory bandwidth to run in streaming
mode (e.g. the ETA-10, and Cray X & Y), the original code gives
performance which is not too far from optimum....
John D. McCalpin
Assistant Professor
College of Marine Studies, U. Del. J.MCCALPIN/OMNET

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.