Compilers taking advantage of architectural enhancements

aglew@crhc.uiuc.edu (Andy Glew)
11 Oct 90 19:49:20 GMT

          From comp.compilers

Related articles
Compilers taking advantage of architectural enhancements aglew@crhc.uiuc.edu (1990-10-11)
Re: Compilers taking advantage of architectural enhancements preston@titan.rice.edu (1990-10-11)
Re: Compilers taking advantage of architectural enhancements golds@fjcnet.GOV (1990-10-12)
Re: Compilers taking advantage of architectural enhancements aglew@crhc.uiuc.edu (1990-10-12)
Re: Compilers taking advantage of architectural enhancements spot@TR4.GP.CS.CMU.EDU (1990-10-12)
Re: Compilers taking advantage of architectural enhancements ctl8588@rigel.tamu.edu (1990-10-14)
Re: Compilers taking advantage of architectural enhancements jourdan@minos.inria.fr (1990-10-15)
[8 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
From: aglew@crhc.uiuc.edu (Andy Glew)
In-Reply-To: gillies@m.cs.uiuc.edu's message of 10 Oct 90 21:51 CDT
Keywords: design, optimize
Organization: Center for Reliable and High-Performance Computing University of Illinois at Urbana Champaign
References: <162639@<1990Oct9> <3300194@m.cs.uiuc.edu>
Date: 11 Oct 90 19:49:20 GMT

[Copied from comp.arch. -John]


>[gillies@m.cs.uiuc.edu]
>One of the problems in new CPU designs is that designers don't realize
>which architecture enhancements are pointless, because we don't and
>may never have the optimization technology to take advantage of them.


Ah, a potentially interesting and useful topic. Perhaps we can start
a discussion that will lead to a list of possible hardware
architectural enhancements that a compiler can/cannot take advantage
of? Maybe the real compiler guys (preston, David C.) will help us
out?


By the way, it is useful in such discussions to distinguish between
run-of-the-mill optimizations that can pretty much be added to any
existing compiler by a fairly good CS grad, and optimizations that
require so much background that only the "experts" in the field can
reasonably be expected to do this. For example, many of the
parallelizing FORTRAN loop optimizations can reasonably only be
expected to be done by people from Rice or UIUC CSRD (note that I'm
UIUC, but CRHC, not CSRD), so unless you are willing to pay for these
guys (or their assorted front companies) you aren't likely to get them
onto your machine.
        Cost of compiler development can be significant. Sometimes a
company might put a hardware feature in even though they know a
compiler approach would be better, because they can't afford the
compiler guys (or the compiler guys already have exclusive contracts
with a competitor).




Let me list a few things to start off the discussion. I hope and
expect to be shot down on a few of them. It's easy to list a few of
the hardware enhancements that we already know compilers can take
advantage of.




Branch Delay Slots - small number
        One or two branch delay slots can be filled quite easily.
        DIFFICULTY: a reasonably good CS grad can do it.


Branch Delay slots - large number
        Most groups report difficulty filling a large number of branch delay
        slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu)
        has been able to get good results for up to 8 delay slots, on slightly
        different hardware (forward semantic, with profiling feedback).
        DIFFICULTY: given the slightly different hardware, and exposure
         to the necessary ideas, a reasonably good CS grad can do it.


Register file - moderate sized (up to 32 registers)
        Most compilers can use these efficiently.
        DIFFICULTY: a reasonably good CS grad can do it.
         Avoiding IBM's patents is a slight problem.


Register file - large (around 128 registers, or more)
        Most compilers do not get enough benefit from these to justify
        the extra hardware, or the slowed down register access.


Multiple, hierarchical, registers sets
        Like Cray's fast S registers and slower T registers.
        It took a very long time for compilers to take advantage
        of these multiple register sets.
        DIFFICULTY: complex.




Separate floating point register file
        Most compilers can take advantage of this.
        DIFFICULTY: easy


Heterogenous register file
        Few compilers have been developed that can take advantage of a truly heterogenous
        register file, one in which, for example, the divide unit writes to registers
        D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers
        S1..S4 -- even though such hardware conceivably has a cycle time advantage over
        homogenous registers, even on VLIW machines where data can easily be moved
        to generic registers when necessary.
        DIFFICULTY: hard.


Instruction cache
        Many groups have reported compilers that can take advantage of small
        instruction caches by rearranging code to minimize I-cache misses. Eg.
        HP, UIUC IMPACT.
        DIFFICULTY: moderately complex - a good CS grad can do it, but it requires
        tools and background that take a while to develop, and are not widely
        available. One of the complicating factors is that this sort of
        optimization is most useful when it is done in cooperaton with the
        linker/loader. Which is something that your average compiler hacker
        doesn't have background in (although it isn't too hard).


Data cache - software managed consistency
        This reportedly has been done, but certainly isn't run-of-the-mill.
        DIFFICULTY: needs a skilled compiler expert.


Micro-scheduling parallelism (like CONVEX's ASAP)
        Apparently isn't too hard to develop compiler support for this.
        Hardware is rather expensive, though.
        DIFFICULTY: moderately complex


Vectorizing
        Done in lots of places, but high performance vectorizing compilers
        still seem to be a problem for a lot of companies.
        DIFFICULTY: complex. The easy things are easy, but milking the
        last few FLOPs out of your vector unit requires compiler experts.


Parallelizing - fine grain, small numbers of processors
        (like Alliant's concurrent hardware)
        Done, eg. by Alliant and Cray
        DIFFICULTY: to be competitive with this sort of hardware, you
         need real compiler experts on your team. Harder than vectorizing.


Parallelizing, fine grain, large numbers of processors.
        What Tera is trying to do.
        DIFFICULTY: very complex. Leading edge.
         Good luck, Tera.


Multiple functional units - heterogenous - VLIW or superscalar
        DIFFICULTY: complex.
         Complexity of the packing algorithms grows rapidly as
         you try to go beyond basic block parallelism


Multiple functional units - homogenous - VLIW or superscalar
        DIFFICULTY: moderately complex
         Easier than the heterogenous case, and the packing algorithms
         are considerably easier.


Special hardware instructions - scalar
        Taking advantage of simple instructions like abs(), conditional exchange, etc.
        DIFFICULTY:
         (1) When treated not as a compiler problem, but as a problem of simply
         writing libraries to inline optimized machine code, EASY
         Requires inlining support.
         (2) When treated as a compiler problem, eg. compiler should recognize
         IF a < 0 THEN a = -a
         and automatically convert it to a = ABS(a)
         MODERATELY COMPLEX.
        I expect Herman Rubin to comment on this.


--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.