Interesting summary of compiler optimzation testing (Eugene Miya N.)
12 Jun 87 01:11:14 GMT

          From comp.compilers

Related articles
Interesting summary of compiler optimzation testing (1987-06-12)
| List of all articles for this month |

From: (Eugene Miya N.)
Newsgroups: comp.compilers
Keywords: Functional testing, performance, speedy
Date: 12 Jun 87 01:11:14 GMT
Organization: NASA Ames Research Center, Moffett Field, CA.

I sorry to take to so long in posting this summary, but these comments
have trickled in. I did actually miss some of the back and forth flaming.
But, I think in the end it was all worth it for Bill Wulf's
message at the very end. Note, more than one note to me asked to
anonymity because they felt very badly about the compilers they were
offering being poor optimizers. I think this is an area which needs
more work because it appears we are taking optimization for granted.
I agree with Wulf's sentiments about generators, I only wish we could do
something in the public domain to benefit everyone.

>From the Rock of Ages Home for Retired Hackers:

--eugene miya
    NASA Ames Research Center
From: "J Michael Lake" <>
Subject: Re: Testing your optimizing compilers

One good source of information would be the many texts and articles
which are now appearing which discuss formal testing methods. While
not specifically targeted to optimizing compilers, the methods outlined
are usually applicable. I know of at least one compiler (a private
project, which I am not at liberty to discuss in much depth--sorry) where
a good mix of white- and black-box testing discovered several architectural
errors (we thought the hardware did one thing, when it really did another--
out of date (or simply wrong?) hardware doc's misled us there) when
eliminating explicit test instructions.

Tartan Laboratories, Pittsburgh, PA, is in the business of producing "the
best optimizing compilers available." (Their boast.) You may want to
contact them to see if they are willing to discuss their methods.

Mike Lake
Institute for Defense Analyses
{I am primarily MLAKE@ADA20.ISI.EDU...followups should go there.}

From: Amos Shapir <nsc!nsta!instable.ether!amos@Sun.COM>
Subject: Re: Testing your optimizing compilers

I do not work on the compiler group here (I'm sure one of them is going
to answer you soon) but when I worked on CCI's 6/32 compiler I used to
have a set of programs, and an automatic compile & diff utility; the changes
from one version to the next were small enough to check that the code
remained correct and really optimized. The programs were an assorted mixture
of system sources, plus a growing number of 'weird special cases' routines.
Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel Tel. (972)52-522261 @{hplabs,pyramid,sun,decwrl} 34 48 E / 32 10 N
From: primerd!bobsun!bob@EDDIE.MIT.EDU (Bob Pellegrino)


I am currently writing a book on debugging techniques. I would like it
very much if you could share the information you receive on debugging
optimizing compilers. I too can sign a non-disclosure agreement, or
perhaps you can forward the names of people who have sensitive information
and I could ask their permission to see their response.

If you could find the time for this cooperation, I would appreciate it
very much. Thank you.

Bob Pellegrino

or you might try
From: (Rich A. Hammond)
Subject: Testing optimizing compilers

I test by looking at the assembly code output (reverse assemble if necessary)
and deciding that it does the same function as the unoptimized code.
That only works for small benchmarks, or parts of benchmarks.
I generally have lots of information that I can get dumped to stderr
with special flags, I use that information to determine what code to
look at.

My best test is running the compiler through itself and comparing the
output of the new (optimized) version and the base version for differences
given the same input files.

I don't suppose that works for FORTRAN very well.

I'd love to have a program that could compare two assembly language
programs and decide that they functioned identically, but I view that
as harder than writing the optimizer to begin with.

My guess is that many manufacturers build up suites of validation tests
by collecting bug reports from users. I'm sure Convex does that with
our bug reports, since they want working programs that demonstrate the
problem with real input/output. I assume this is so they can compile the
validation program, run it and diff the output with 'correct' output.

You may quote me publicly.
I hope you get some more interesting responses.
Rich Hammond
From: Andrew Lue <ames!nsc.NSC.COM!decwrl!andrew>
Subject: optimizing compiler testing

Hello, Eugene.

I saw your posting on the net.

I am a software support specialist at National Semiconductor. We
market optimizing compilers (currently C and Fortran, soon Modula-2 and Pascal).

Being in support, I get to see the flaws in the products. I am definitely in
terested in learning about ways to test optimizing compilers. The only way
I've check the various optimizations is by manually inspecting the
assembly code. I would appreciate your sending me a summary of your findings.


/andrew lue

>From ames!decwrl!pyramid!oliveb!intelca!mipos3!omepd!max Thu May 28 00:23:03 1987
Subject: Testing Optimizing Compilers

I am currently involved in evaluating a new C compiler. The
general validation of correctness is more or less done, but
we have just barely started evaluating the optimizations.
Much of the testing MUST involve eye balling the assembler -
I would love to hear of any more automatic QA alternatives.

          For some optimizations, you can alter programs in some
way that will be removed by the optimization and then more
or less automatically compare the two assembler ouputs. For
example, you can add a dead-store at the source level, and
check that it vanishes, if your compiler eliminates them.

          So my strategy involves taking a piece of code C, get-
ting the assembler for it (call it S), perturbing it in some
way (C'), getting S', and then diffing S and S'.

          For some optimizations, this strategy can handle con-
cerns 1-3. These are mainly simple, local optimizations
that don't globally effect the code (ie. interact with
register allocation.)

          Part of the problem is the relative lack of tools that
can readily process assembler in a meaningful way. Suppose
you want to verify that pipeline interlock's never cause a
delay in the execution of a new processor? (ie. in a MIPS
style processor) This tool would have to parse assembler,
and keep track of who is touching what part of the register
set, and lengths of times of execution for each form of each

          Does anybody have time to write that sort of thing from
scratch, as part of a testing strategy? I sure don't. I am
convinced that automatically testing some optimizations
requires effort on the level of actually IMPLEMENTING the
optimization, give or take a power of 2.

          Summarize me to your hearts content, if you think i've
said anything useful.
From: Preston Briggs <>


regarding how to test/validate/... optimizing compilers

Hard questions! Generally, I think of validation suites as testing
correct implementation of the language and that the language
accepted is the intended language (e.g. Pascal, not almost-Pascal).
In addition, validation usually tests that illegal constructs
are rejected.

I worked for a time at Intermetrics, maintaining the HAL/S compiler,
[Gee whiz, I know this language^^ enm]
a fairly large optimizing compiler written about 1970--72.
Over the years, a large number of "acceptance" tests had accumulated---
some as part of the initial validation suite (never formally completed),
some as examples of bugs to be fixed, many as examples that the bugs
had been fixed.

All this seems pretty similar to normal program acceptance and maintenance.

Optimizations are sort of a study all to themselves (suprise!).
A lot of work has been done on "effective" and "correct". The algorithms
are susceptable to proof, but the details of implementation are
much harder (why can't I spell?). An interesting difference
between "correct" and "close" comes up when rearranging floating
point computations, and in avoiding overflow in integer computation.

Dead code in the compiler could be found (or suggested) by profiling.

Interaction with nasty side-effects is (I think) not too much of a
problem, given the correctness of the transformations. Sometimes,
there are phase ordering problems where an optimization cannot be done
because it depends on results from a later optimization; this is
normally a design issue. Sometimes, one optimization will undo the
effects of an earlier optimization. An example is spilling during
register allocation (because of insufficient registers) will undo
to some extent the effects of common subexpression elimination
and loop-invarient removal.

I expect that (historically) early bad impressions about the correctness
of optimizing compilers arose for several reasons. Mistakes in the
implementation (just like any other program, except it is a very
complex program) rather than algorithmic design errors, loopholes
and misunderstandings in the language definition (early languages
tended to be defined by their implementation rather than by a spec),
inplementation specific code (i.e. uses non-portable features for
speed or space efficiency, and then get bit by the optimizer).

Even when the language is well defined
there are easy examples where users and the optimizer make different
assumptions. For example, in FORTRAN, the compiler is allowed to
assume that parameters do not introduce aliases, which simplifies
a lot of optimizations. In a simple compiler, the user can introduce
all sorts of aliases, without any problems. When he tries to optimize
his code, suddenly there are errors. FORTRAN also allows the
compiler to rearrange expression evaluation in any fashion,
as long as parentheses are not violated. This can lead to integer
overflow and floating point imprecision if the user doesn't
parenthesize cases where it might be important.

Pascal specifically (I think) allows expressions containing
ANDs and ORs to be evaluated fully, though a simple optimization
will sometimes improve the code. That is, if S is a string,
then the expression "(length(S) > 0) and (S[I] = 'A') could generate
a subscipt error even if the length of S <= 0. It would
be legal to evaluate the right hand side first, second, or
not at all.

Hmmm, that example would be better as "(I <= length(S)) and (S[I] = 'A')".

Anyway, I'm not particularly impressed with what I've written here.
It might be best to see "Compilers: Design, Techniques, and Tools" by
Aho, Sethi, and Ullman (the Dragon Book). I'd enjoy talking
more about the subject, or reading other responses you get.
Feel free to pass this around (you might edit it a little first!)
if you like.

Preston Briggs (maybe)


Well, I thought some more about your questions.
I guessit would be good to know if you are trying to build a compielr
or verify someone else's. An interesting article describing a
research compiler is "An overview of the PL.8 compiler," ACM SIGPLAN
Notices 17:6, 1982, Auslander and Hopkins. This describes work done
at IBM Yorktown. The compiler is organized as many passes, with
each pass being essentially a single optimization. Some passes are
repeated. By doing this, they were able to experimentally determine
the effectiveness of each optimization and different combinations
and orderings of optimizations. Additionally, this makes it much
easier to test the implementation of individual optimizations.

On the other hand, it is a research compiler, and very slow (though
the resulting code is quite good), primarily because of the large number of
passes and the necessarily large amount of I/O. Most commercial
compilers try to do a lot in each pass, for speedier compilation.

The PL.8 article also describes an interesting testing/verification
technique. The compiler is written in PL.8 and compiles itself.
After a change, the source is compiled on the original (say C) giving
a new compiler (C1). The compiler is compiler is compiled again,
but this time using C1 giving C2 (the hopefully final version).
C2 is used to compile the source (yet again) giving C3. C3 should be
bit for bit identical with C2. (I just described that from memory,
but I think it's correct.) They also use a large suite of test programs
with known runtime results, together with some sort of automated,
testing tool (like a fancy shell).

Generally, people don't seem to be doing so much work on plane-jane
(I mean plain-jane) optimization anymore. Current stuff includes
retargetable code generation, interprocedural optimizations,
better compiler generation tools, incremental compilation,
finding parallelism in sequential source, generalization
of well-known optimizations, and so forth.

Preston Briggs
From: Mark Douglas Lerner <>
Subject: Compilers and optimizations.

I'd like to here more about these optimizations, particularly references,
for a paper that I'm writing. Thanks. /Mark

Mark Lerner
Office: 280-8183
Home: 866-0995
From: ames!allegra!pitt!tl-vaxa!wulf (Bill Wulf)
Subject: compiler testing

Since my name was mentioned, I thought some of you might be interested in
my views on testing optimizing compilers. I have no objection to this
being distributed, and I would like to hear other folk's comments.

First, some general remarks on a variety of subjects:

It shouldn't need to be said, but I'll say it anyway --

Testing can increase your confidence in the correctness/effectiveness
of an optimizing compiler (or any other software), but it can't be
a guarantee.

I know that there are lots of buggy optimizing compilers in use -- but
it needn't be so. In my experience, the number of bugs is inversely
correlated with the amount of theroy that exists (and is USED) in any
software area -- that's why we see so few bugs in parsers, for example.
There is LOTS of optimization theory, but for reasons that elude me, it isn't
as commonly used as it should. At Tartan we build optimizing compilers, and
they contain bugs -- but relatively few of them are in the optimizer, and
I think it's because of the amount of supporting theory.

You didn't say it, but there are lots of SLOW optimizing compilers too.
Much the same comments apply -- they needn't be! There are some really
good algorithms around today, if only people would engineer their
implementations well. Take our C compiler for the RT PC: it's not quite
twice the speed of the PCC-based compiler that IBM delivers when it's
optimizations are turned off, and just a tad slower than the PCC compiler
when all optimizations are turned on -- and we do virtually all the
optimizations "known to man".

Standard language suites are almost useless for testing an optimizer.
They tend to test language features, often in relative isolation. Such
tests seldom tweak the convoluted combinations of, often simple, constructs
that give optimizers fits.

This could sound Dijksta-esque, but I don't mean it to sound that way.
Nonetheless, optimization isn't an add-on feature. It's got to be designed
in. Some optimizing compilers have "grown" out of earlier non-optimizing
ones, or have been built by grafting existing front-ends onto an optimizer;
either way, they suffer in the process. A lot!

Finally, a soap-box that I've been on for a long time:

There is no such thing as a "machine-independent" optimization!

Not one! People who use the phrase don't understand the problem! There are
lots of semantics-preserving transformations that improve code-size or speed
for some machines under some circumstances. But those same transformations may
be pessimizations for another machine! The relevance of this comment in the
current context is that, in testing for effectiveness, the test cases must be

Second, on testing for "correctness":

Forget the language-based test suites. At Tartan we have developed our own
test suite -- or, really a test suite generator.

The generated suite is "structure oriented" in the sense that the generator
knows a great deal about how our optimizer is structured, and uses this
knowledge to generate tests that drive the compiler through its various paths.
The suite is also "graded" in the sense that it starts by exhaustively testing
simple things in isolation; then, later, more complex tests pseudo-randomly
sample the use of the the (known to be working) simple constructs in ever
increasing complex combination. The "graded" approach is used to limit the
exponential explosion that would otherwise result. Since the generator knows
the structure of what it's testing, it's also known that the sampling approach
is "safe". Finally, because of the generator's knowledge of the structure of
the compiler, the tests (all of which are self-checking), can generally point
to the precise area of the compiler that's failing.

The "generator" approach was taken for several reasons. First, the sheer
volume of tests required would be prohibitive to write manually (currently we
have ~200,000 tests in the suite, as opposed to 2500 in the Ada ACVC's for
example). Second, one has more confidence that all the combinations have in
fact been generated, and without duplication. Third, it's much easier to
add/change knowledge in the generator as the compiler changes than it would
be to modify manually written tests.

It would be pretentious (as well as wrong) to call the generator an "expert
system", but it does have some things in common with the philosophy of their
construction. That is, the generator contains a great deal of knowledge
about the way our optimizers are built -- and an effort was made to
separate that knowledge from the test-generation mechanism so that it's
easy(er) to change as the compilers change. There's good news and bad
news here: the good news is that we have gained a lot of confidence
that our optimizers are correct, the bad news is that it would be a far
less effective set of tests for someone else's compiler.

I don't have quatative measures of the benefits we've realized from the
generator, unfortunately -- but the impressions internally are that it
has been a BIG win.

One last comment. We were surprised to find that the generated tests
found a number of problems in the front end of our Ada compiler -- even
though it's validated, and even though the tests don't really push
on any of the more exotic aspects of the language. I think there's
a lesson there. Isolated feature tests (which is what the ACVC's are,
mostly) are nice for checking that a compiler implements the full
language and nothing else (which is what the ACVC's are supposed to
do). However, the semantically subtle aspects of any language, and
especially big/complex ones, arise from the interaction of features.
If you want to check the robustness/correctness of a compiler, that's
what you have to push on. If someone asked me to produce an ACCC
("Ada Compiler Correctness Capability"), I'd take a similar
generator-based approach, and pound hard on the cross-product of

Third, on testing for "effectiveness":

This is a real toughie, especially given my soap-box position. I don't
have a general solution, indeed not even a good special-case one.

Of course, we have a set of benchmark programs, and we compare their
running time to that of the same code compiled with other folks
compilers. But that only tells you whether you're better of worse;
it doesn't tell you how much better you might be, if only ...

The best idea we have had is for a sort of regression testing.

The debugging version of (some of) our compilers contain a LARGE number of
counters -- one associated with each (variant of) transformation that
can be performed. We also have a collection of benchmark programs that
exercise most of the optimization situations. The value of the
set of counters after compiling one of these benchmarks forms a sort
of "signature" for that compilation.

So, IF you hand-inspect a compilation of one of the benchmarks, and
IF you're happy with the quality of the code, then a comparison of
that signature with that on a subsequent compilation of the same
benchmark will at least tell you whether something has changed, and,
if so, what. The only advantages over comparing output is that it's
insensitive to minor changes in things like register allocation, and
that it points a bit more closely to the internal compiler difference
(with a good optimizer, it's not always obvious from the output
what caused a specific effect).

This scheme is not wonderful. Not even close. The root of the problem is that
we don't even have a definition of "optimal" or any way to measure how close we
are to it. I would like to hear what other people have to say on the subject.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.