16 Jul 1997 23:04:32 -0400

Related articles |
---|

optimizing compiler against iaverage assembly programmer. gclind01@starbase.spd.louisville.edu (1997-06-15) |

Re: optimizing compiler against average assembly programmer. gclind01@starbase.spd.louisville.edu (1997-06-30) |

Re: optimizing compiler against average assembly programmer. debray@cs.arizona.edu (1997-06-30) |

Re: optimizing compiler against average assembly programmer. rhyde@cs.ucr.edu (1997-07-04) |

Superoptimizer (was: optimizing compiler against assembly programmer.) pardo@cs.washington.edu (1997-07-13) |

Re: Superoptimizer (was: optimizing compiler against assembly programm qua@microunity.com (1997-07-16) |

From: | qua@microunity.com (Henry Massalin) |

Newsgroups: | comp.compilers |

Date: | 16 Jul 1997 23:04:32 -0400 |

Organization: | Compilers Central |

References: | 97-06-071 97-06-110 97-06-140 97-07-023 97-07-067 |

Keywords: | performance, optimize, architecture |

Hi!

As author of the original Superoptimizer program, I feel moved to

respond to this thread, if only to say hello and expand on a few

things.

First of all, I feel quite honored that even 10 years after the fact,

this is still being talked about. I'd like to publicly acknowledge my

PhD. Advisor, Calton Pu. It was his help and encouragement to forge

ahead that led to what was my very first research paper.

| The 68K has some tricky instructions so I do expect weird

| combinations to output unexpected results, but what about

| more streamlined ISAs [...] Would the Superoptimizer really

| be useful on a MIPS ?

Oh, yes! See the rest of the response.

| As the title above suggests, the SO produces the *shortest*

| possible program. This is not always the same thing as the

| *fastest* possible program.

True. The original program used this metric because:

(1) It is the simplest cost-metric to implement.

(2) It happened that for the instructions in its search-space,

there was a 1:1 correspondance between program length and

execution-time. (i.e., no latency-dependent stalls).

As many have already pointed out, shortest-program isn't the only

metric, nor necessarily the preferred one.

However, "fastest-program" isn't the correct metric either. At least,

not unless you have a *very* sophisticated cache and memory model to

back it up. Why? Because one can always compute complicated

functions in essentially constant time using table-lookup in a

sufficiently large address space. But most people would probably not

want to dedicate a (2^67)-byte chunk of cache memory just to make

double-precision sin(x) go fast... :-)

So you'd somehow have to cap the total memory used. A weighted

cost-function may help:

cost = exec_time (uS) + mem_used * (ConversionFactor uS/byte)

Of course, all this may be overly complicated: All superopt-like

programs that I'm aware of implicitly cap memory usage - they don't

search the class of functions having internal lookup tables.

While on the topic of metrics, an interesting one might be to assume

all memory except for the code-fragment starts out zeroed, and to

optimize the total runtime for N executions of the target function

plus the initialization cost of any internal tables. This does weigh

the searcher toward easy-to-initialize tables; on the other hand, if a

table is easy to initialize, there's probably also a simple

instruction sequence that obviates the need for that table...

Of course, there is the question of how to search over functions that

have internal data tables. Counting through all possible values for

each of the table entries doesn't get you very far. Some kind of

theorem-prover is necessary to eliminate huge chunks of search space.

This is difficult, but not impossible.

The general approach I've tried is to "execute" the test-code

symbolically and come up with an expression that relates the values in

the tables to the squared-difference between the function-value

returned and the wanted result. The goal is to find the global

minimum squared-error and the table entries that correspond to it. If

the table affects the output in an arithmetic way (i.e., no bitwise

masking ops between the table-lookup and the final output), one can

take partial-derivatives with respect to each table entry and solve

for zero to locate the minima.

A correct function has a sum-of-squared-differences equal to 0, which

must be the global minimum, so this method will find, if they exist,

the necessary table entries that goes with a particular code-fragment.

You can do this in O(N^3) time, where N is the number of elements in

the table. (I think it's O(N^3)...might be O(N^4). And again,

subject to the condition that the table affects the output in an

arithmetic way.) This is a powerful technique because the length of

code fragment required to compute a function often goes down

significantly when you allow lookup-tables. So even though the cost

of program-testing goes way up (symbolic execution is *very*

expensive), the searching-depth required goes way down.

I have, in fact, used a varient of this method to help find 8- and

16-bit fixed-point approximations to sin(x) (among others) - each

having small (16-entry) internal lookup tables. Using the Microunity

processor's vector-lookup instruction, this gets nearly 50 million

evaluations of complex z*exp(I w) per second at a chip clock of 200MHz.

A similar method can often help find values for immediate fields

in instructions without having to search thru all possibilities.

| (2) Optimization is provably NP-Complete and that's how

| the SO runs.

This is still true. However, you can use theorem-prover generated

rulebases and symbolic execution to greatly reduce the size of the

exponent. The idea is to derive sets of equivalent code sequences.

You search only one member from each equivalence set; when you find a

code fragment that does what you want, then you can go grovelling for

which particular choice from each equivalence set gives the shortest

overall execution time. In fact, once you're at the level of

equivalence-sets, you're really representing small code- fragments as

a set of transformation-rules. The testing phase goes faster, since

you don't have to emulate every instruction one at a time. It helps

with the symbolic execution necessary to find values for any internal

lookup-tables. And it cuts down on the branching factor.

With equivalence-sets up depth 5, the branching factor is around 4 -

small enough to do a rich search to depth 20 in about a week on an

ordinary 10-processor DEC-Alpha machine :-)

Another idea is to treat classes of short code-fragments as if they

were lookup-tables albeit with constraints on the entries, and use a

method similar to the lookup-table-finder discussed earlier. At this

level, you're essentially "proving" what classes of code-fragments

admit potential solutions.

One of these days I may actually get the time to write a followup

paper on the subject....

So while NP-class problems are hard, it doesn't mean you can't take

huge whacks at that exponent... :-)

Henry

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.