Re: Unrolling and spilling

Carlie Coats <carlie@jyarborough.com>
Sat, 23 May 2009 11:52:46 -0400

          From comp.compilers

Related articles
Unrolling and spilling linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-28)
Re: Unrolling and spilling SidTouati@inria.fr (Sid Touati) (2009-01-29)
Re: Unrolling and spilling harold.aptroot@gmail.com (Harold Aptroot) (2009-01-29)
Re: Unrolling and spilling cr88192@hotmail.com (cr88192) (2009-01-30)
Re: Unrolling and spilling carlie@jyarborough.com (Carlie Coats) (2009-05-23)
| List of all articles for this month |

From: Carlie Coats <carlie@jyarborough.com>
Newsgroups: comp.compilers
Date: Sat, 23 May 2009 11:52:46 -0400
Organization: Compilers Central
References: 09-01-058 09-01-061
Keywords: optimize
Posted-Date: 24 May 2009 19:41:15 EDT

Harold Aptroot wrote:
[snip...]
> Instruction scheduling is a very important optimization for pipelined
> processors, they really should have mentioned how it affects
> spilling/unrolling
> Software pipelining is a good example that combines scheduling with
> unrolling..


> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be
> overly common...


Maybe you don't do numerical analysis codes...


A simple example is a numerical second-derivative code; let's chase
through that:


For pseudocode, you have:


          input array Y[0:N+1]. For realism, assume N large.
          spacing DX, scratch variable DXSQ=DX*DX
          output array D2Y[1:N]
          Loop: I=1,...,N
                  D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ


With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
FP registers.


And long pipeline depth will kill your performance.


Unrolled by 4:
          Loop:
                  D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
                  D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
                  D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
                  D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ


The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.


For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.