Re: Re How many vector registers are useful? (Bill Mangione-Smith)
Tue, 2 Feb 1993 14:05:40 GMT

          From comp.compilers

Related articles
How many vector registers are useful? (1993-01-25)
Re How many vector registers are useful? (1993-01-26)
Re: Re How many vector registers are useful? (1993-01-28)
Re: Re How many vector registers are useful? idacrd!desj@uunet.UU.NET (1993-01-31)
Re: Re How many vector registers are useful? (1993-02-02)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Bill Mangione-Smith)
Keywords: optimize, vector
Organization: University of Michigan
References: 93-01-174 93-02-016
Date: Tue, 2 Feb 1993 14:05:40 GMT

Bill Mangione-Smith <> writes:
> Santosh Abraham, Ed Davidson, and I had a paper two asplos's ago that
> looked at the minimal number of vector registers required for specific
> codes. [.... W]e decided to focus on the minimal number of registers
> required to achieve optimal performance.

idacrd!desj@uunet.UU.NET (David desJardins) writes:
      I haven't looked at your paper, but I think that you have to be very
      careful in using the word "optimal" here. I have written a fair number of
      assembly-language routines for vector machines, and it is very often the
      case that the number of vector registers needed for "nearly optimal" code
      is substantially less than that needed for "perfectly optimal" code.

I've certainly seen that happen as well. We constructed a tool that used
implicit enumeration, along with a predictable set of latencies (as you
suspected), to guarantee optimal performance and minimal register
requirements. The study really focused on register requirements, and not
code scheduling. But once I convince you that we meant real optimal, and
not fuzzy optimal like most people seem to, you've got to either assume
that I don't know the problem complexity or I'm not selling compiler

      In my experience, what often happens is that you can get a code which is
      "nearly optimal" in the sense of taking the correct number of chimes to
      execute the loop, but a few more ticks than is strictly necessary, because
      the usage of the vector registers is not perfectly synchronized. A vector
      functional unit might have to wait for its input for a few ticks, for
      example, because the latency of the unit feeding it is greater than its
      own latency. These few ticks might only add a few percent to the
      execution time of the loop, but it might take as much as double the number
      of vector registers to eliminate them.

In our study we did assume an idealized machine model, but you could
easily account for these effects without increasing the complexity of the
problem. In fact, the addition of resource hazards tends to reduce the
search time by pruning the tree sooner. But you are right, in practice
you probably want to ignore these issues. Jalby, Eisenbies, and
Lichnewsky had a very simple machine model for the Cray-2 which achieved
very good results. Of course, you could get into trouble if you were
scheduling on a machine with either short vector registers or a
programmable scratch pad that the compiler decided to divide into lots of
short registers.

      Perhaps you were looking at some sort of "ideal" vector machine? Assuming
      things like constant latencies in the functional units would certainly
      simplify a truly optimal analysis while probably producing nearly
      equivalent results for practical purposes.

I don't see how we can discuss real optimal schedules without assuming
constant latencies. For many machines the function units really have
constant latencies, and even without multiprocessor contention memory
references are a tough problem (to say the least). But one thing we can
do is assume a pessimistic memory system (maybe modeled in the compiler as
a varied load depending on the code) and schedule to that, like the
Cydrome people did.

Ultimately, who knows what the memory is going to do from run to run,
except in a really degenerate case? Scheduling pessimistically can cause
resource collisions that you would not have otherwise seen. It may not be
possible to say for sure what the optimal performance on live hardware is,
let alone identify or generate code that achieves it.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.