optimizing for caches

Richard Cownie <richard@meiko.com>
Tue, 1 Dec 1992 01:02:32 GMT

          From comp.compilers

Related articles
optimizing for caches richard@meiko.com (Richard Cownie) (1992-11-17)
Re: optimizing for caches moss@cs.cmu.edu (1992-11-19)
Re: optimizing for caches preston@miranda.cs.rice.edu (1992-11-19)
Re: optimizing for caches tchannon@black.demon.co.uk (1992-11-21)
Re: optimizing for caches markt@harlqn.co.uk (1992-11-26)
optimizing for caches richard@meiko.com (Richard Cownie) (1992-12-01)
| List of all articles for this month |

Newsgroups: comp.compilers
From: Richard Cownie <richard@meiko.com>
Organization: Compilers Central
Date: Tue, 1 Dec 1992 01:02:32 GMT
Keywords: hardware, performance, optimize, summary, bibliography

Thanks to all who replied to my request for references.

As promised, here is a summary of the bibliography.

Initial thoughts from scanning these titles and other messages are:

1) Instruction cacheing is well under control.

2) Tricks for speeding up dense matrix manipulations are known
      and implemented in production compilers.

3) There are lots of hardware tricks for improving DRAM *bandwidth*.
      Improving the cache miss latency (related to DRAM access time)
      is harder.

Thanks again,

      Richard Cownie (richard@meiko.com)


Reply-To: Urs Hoelzle <hoelzle@CS.Stanford.EDU>

>So the relative cost of a cache miss has already risen from about 1.4
>instructions to > 5 instructions, and the Viking clock speed is still only
>40MHz; the technology exists now to build processors running at 150MHz
>(e.g. Alpha), which will take the cost of a cache miss over 20

Actually a cache miss on the SS-2 is around 24-26 cy, not 5. (It refills
a 4-word line, though, not just one word.)

Some references:

%T Achieving high instruction cache performance with an optimizing compiler
%A W. Hwu
%A Pohua P. Chang
%J 16th Annual Symp. on Computer Arch, Jerusalem
%D May 28-June 1
%P 242-251
%X inlining & chache behavior of C programs. code +0-30%, call frequency
-0-99% (typically 50%). Use profiling and trace selection, put error
branches out-of-line. paper lists trace selection algorithm. see also
PLDI89 paper

%T Code Reorganization for Instruction Caches
%A A. Dain Samples
%A Paul N. Hilfinger
%D Oct 1988
%X simple algorithm improves misses by 30-50% while relocating 3-8% of
the basic blocks

%T The impact of code density on instruction cache behavior
%A Peter Steenkiste
%J 6th Annual Symp. on Computer Arch, Jerusalem
%D May 28-June 1
%P 252-260
%X "modest increases in code size (35%) have little performance impact in
well-balanced systems"

%T Program analysis and optimization for machines with instruction cache
%A Scott McFarling
%J Stanford CSL-TR-91-493 (thesis)
%D 1991
%X procedure merging/inlining, MIPS-X

and also interesting are

%T Improving direct-mapped cache performance by the addition of small
fully-associative cache and prefetch buffers
%A Norman P. Jouppi
%J Proc 17th Symp Comp Arch, Seattle
%V 18
%N 2
%D June 1990
%P 364-373
%X small "victim caches" (1-5 entries) give significant speedup

%T Cache behavior of combinator graph reduction
%A Philip Koopman
%A Peter Lee
%A Daniel Siewiorek
%V 14
%N 2
%D April 1992


[Sorry, I lost the header on this one - thanks anyway]

Luddy Harrison at U of I, CSRD, has a compiler that tries to optimize
given a memory hierarchy (which exists in their Cedar machine). A ref

    Harrison, et al., in 5th workshop on compilers and languages for
        parallel computing, editors Gelernter, Nicolau, and Padua.


Reply-To: preston@miranda.cs.rice.edu (Preston Briggs)

I agree that the problem is terribly important. On the other hand,
Portland Group's compiler for the i860 certainly attacks the problem, as
does the Kuck preprocessor that is so maligned in discussions of the SPEC
benchmarks. Additionally, there are several research compilers attacking
these problems (at Stanford, Rice, Illinois, and other places).

The Cache Performance and Optimization of Blocked Algorithms
Monica Lam, Edward E. Rothberg, and Michael E. Wolf
pages 63-74


Reply-To: Paul H J Kelly <phjk@doc.imperial.ac.uk>

This is indeed a big deal and a major subject of lots of research - and
a fair number of production compilers can do something about it.
It is one of the major reasons, apparently, why SPECmark89 figures
have become distorted: the compiler writers started applying blocking
algorithms to matrix300.

The real bibliography is enormous; a nice place to start is the following
very entertaining paper:

@STRING{asplos4="4th International Conference on Architectural Support for
Programming Languages and Operating Systems"}

author = "M.S. Lam and E.E. Rothberg and M.E. Wolf",
title = "The cache performance and optimizations of
block algorithms",
booktitle = asplos4,
pages = "63--74",
address = "Santa Clara, CA",
month = April,
year = 1991}

This paper gives some of the key references, and the authors are worth
looking up for more recent work. The technology relies on dependence
analysis covered nicely in the following textbook:

author = "Hans Zima and Barbara Chapman",
title = "Supercompilers for Parallel and Vector Computers",
publisher = "ACM Press",
year = 1990


Reply-To: moss@cs.cmu.edu (Eliot Moss)

Here are some of the relevant references (from a seminar I gave last spring
titled "Compiling for Modern Architectures"):

@string{acm = "ACM"}
@string{ibmjrd = "IBM Journal of Research and Development"}
@string{ieeetoc = "IEEE Transactions on Computers"}
@string{toplas = "ACM Transactions on Programming Languages and Systems"}

    author = "David Callahan and Ken Kennedy and Allan Porterfield",
    title = "Software Prefetching",
    crossref = "asplos4",
    pages = "40--52"

    author = "Scott McFarling",
    title = "Program Optimization for Instruction Caches",
    crossref = "asplos3",
    pages = "183--191"

    author = "Scott McFarling",
    title = "Procedure Merging with Instruction Caches",
    crossref = "sigplan91",
    pages = "71--79"

    author = "Susan Owicki and Anant Agarwal",
    title = "Evaluating the Performance of Software Cache Coherence",
    crossref = "asplos3",
    pages = "230--242"

    author = "Michael E. Wolf and Monica S. Lam",
    title = "A Data Locality Optimizing Algorithm",
    crossref = "sigplan91",
    pages = "30--44"


    title = "Third International Conference on Architectural Support for
                                    Programming Languages and Operating Systems",
    booktitle = "Third International Conference on Architectural Support for
                                    Programming Languages and Operating Systems",
    year = 1989,
    key = "ASPLOS",
    publisher = acm,
    address = "Boston, MA",
    month = apr

    title = "Fourth International Conference on Architectural Support for
                                    Programming Languages and Operating Systems",
    booktitle = "Fourth International Conference on Architectural Support for
                                    Programming Languages and Operating Systems",
    year = 1991,
    key = "ASPLOS",
    publisher = acm,
    address = "Santa Clara, CA",
    month = apr

    title = "Proceedings of the {SIGPLAN} '91 Conference on Programming
                                    Language Design and Implementation",
    booktitle = "Proceedings of the {SIGPLAN} '91 Conference on Programming
                                    Language Design and Implementation",
    year = 1991,
    key = "SIGPLAN",
    publisher = acm,
    address = "Toronto",
    month = jun

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.