Re: compilers, in a nutshell

chase@Think.COM (David Chase)
Mon, 9 May 1994 23:05:32 GMT

          From comp.compilers

Related articles
compilers, in a nutshell ellard@endor.harvard.edu (1994-05-09)
Re: compilers, in a nutshell chase@Think.COM (1994-05-09)
Re: compilers, in a nutshell munk@prl.philips.nl (1994-05-10)
Switch code generation, correction. chase@Think.COM (1994-05-10)
Re: compilers, in a nutshell bill@amber.ssd.csd.harris.com (1994-05-11)
Re: compilers, in a nutshell newburn@aslan.ece.cmu.edu (1994-05-12)
| List of all articles for this month |

Newsgroups: comp.compilers
From: chase@Think.COM (David Chase)
Keywords: courses
Organization: Thinking Machines Corporation, Cambridge MA, USA
References: 94-05-018
Date: Mon, 9 May 1994 23:05:32 GMT

Only half a lecture on optimization, eh? (just kidding.)


I think what might be missing from your list (maybe needed as a
prerequisite, or perhaps implicit in the topic "what is a compiler") is
some notion of "equivalent program" and what things are irrelevant to
equivalence. Note that there's at least two common notions of equivalence
-- "I/O equivalence" (any two programs with same inputs and outputs and
termination are equivalent) and "debugger-observable equivalence" (any two
programs which perform the same set of steps are equivalent).


It should be clear by now (well, given acceptably good compilers) that
there's no difference between "a++;" or "a+=1;" or "a=a+1;". It should be
clear by now that "register" is a noise word in C. It should be clear
that a compiler ought to be free to move things in and out of memory, and
to reuse memory, and to heap-allocate stack frames if that suits the
implementor's purposes. It should be clear that a compiler can convert a
switch statement into a linear sequence of tests, or a binary search tree,
or a jump table, or a hashed lookup table, depending upon the speed of the
various operations in the underlying architecture.


The reason I suggest this is that people often understand their programs
(or worse, their programming language) at "too low a level", which tends
to limit their idea of what it is a compiler can do. At times, I've seen
people get into a tail-wagging-dog-mode where "the hardware does things
thusly (e.g., division) and thus the language must surely do what the
hardware does." At times this reaches the level of language standards --
signed integer division depends on an implementation-defined choice "lest
existing fast code be gravely slowed" (Rationale for the Ansi C standard
-- and just how fast was integer division running in 1989, anyway?)


=======================================================


By-the-way, the switch example makes a decent exercise on picky code
generation for superscalar machines (recall recent discussions of the
goodness or badness of "arrays of labels").


For SuperSparc, for instance, you get:


linear tests, test value in %o0, assume small integer cases:


    cmp %o0,K1
    be case1 ! cycle 1
    cmp %o0,K2
    be case2 ! cycle 2
    ...
    cmp %o0,KN
    be,a caseN ! cycle N
    delay inst
    <default>


This costs one cycle per branch, plus one for the (useless) delay slot
instruction when you finally branch. If you assume a uniform distribution
of cases, and none to the default, then N cases take (N+3)/2 cycles to get
you into the case code.


indexed table:


    sub %o0,K0,%o0 ! Normalize to zero
    sll %o0,2,%g1 ! (cyc 1)
    cmp %o0,KN
    sethi %hi(table),%g2
    bgu default ! Use unsigned comparison hack (cyc 2)
    add %g2,%lo(table),%g2 ! (cyc 3)
    ld [%g1+%g2],%g1 ! (cyc 4)
    ! pipeline bubble, I think (cyc 5)
    jmp %g1 ! (cyc 6)
    nop ! Useless delay slot (cyc 7)


So, this runs faster (assuming uniform distribution, and no branches to
the default) in 7 cycles. For the linear case, you could do 10 tests in 7
cycles. On the other hand, this is much faster going into the default
case. (These numbers get worse for position- indendent code.) (You can
burn table space and avoid the normalize to zero if it thrills you, but it
only buys you about 1/2 cycle.)


binary search (assume cases 0-13)


cmp %o0,8
bge GE8 (cyc 1)
<linear tests for 0-7 (8 cases)>


GE8:
<linear tests for 8-13 (6 cases)>


In this case, the costs are


    [8*(1 + 11/2) + 6*(2 + 9/2)] / 13


or (4*13+3*13)/13 = 7 cycles.
In general, that is


      [M*(1+T(M)) + (N-M)*(2+T(N-M))] / N, where


T(X) is the time required to do an X-case switch for whatever method suits
you. The "1+" is the fall-through portion of the two-way split, the "2+"
is the branch taken portion of the two-way split.


So, given uniform distribution, no defaults taken (but still compiled for,
since that is what the language requires), we find that linear search
takes 7 cycles at N=10, jump tables always take 7 cycles, and this simple
two-way split takes 7 cycles at N=13.


Given this model, the even break between linear and binary search comes at
N=6, assuming that you split it into either 4 and 2 or 3 and 3


  T(4) = 7/2, T(3) = 3, T(2) = 5/2, T(6) = 9/2


  (2*9+9)/6 = 27/6 = 9/2
  [3*(1+3)+3*(2+3)]/6 = 27/6 = 9/2.


On the other hand, if you use a deeper binary tree, you can get an
improved result for larger numbers. For example, if you take N=19, and
first split it into 12 and 7, and then split 7 into 4 and 3, and 12 into 7
and 5, you get:


T(7) = [4+4T(4)+6+3T(3)]/7 = (4+14+6+9)/7 = 33/7
T(5) = (5+3)/2 = 4
T(12) = (7+7T(7)+10+5T(5))/12 = (7+33+10+20)/12 = 70/12
T(19) = [12+12T(12) + 14+7T(7)]/19 =
      (12+70+14+33)/19 = 6.79,


which is slightly better than the table lookup. Thus, when compiling
switches for the SS10, if you believe that the default is rarely taken
(i.e., if you are nuts) from 1 to 5 cases are treated with linear tests,
from 6 to 19 are treated with binary search, and 20 and more use table
lookup (assuming that they are dense).


NOW, if you assume a certain percentage of default cases, that changes
everything. OR, if you have any profiling feedback at all, that changes
everything.


Treating the hashed case is more entertaining yet, since the time spent
hashing is in competition with the binary search case (which, if you
haven't noticed by now, is more or less a fibonacci tree). The closed
form for the time spent executing one of these cases is a little messy,
both because they are fibonacci trees, not binary trees, and because they
have these little linear tails.


This, by the way, is code generation at an extraordinarily picky low
level. There's higher-level stuff that you could do, too. Constant
propagation is a good introduction. I'm a big fan of the SSA framework,
myself.


David Chase, speaking for myself
Thinking Machines Corp.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.