Is register stack compilers' friend?

Dave Lloyd <dave@occl-cam.demon.co.uk>
Mon, 6 Nov 1995 18:18:57 GMT

          From comp.compilers

Related articles
Is register stack compilers' friend? jaidi@technet.sg (1995-10-31)
Re: Is register stack compilers' friend? hbaker@netcom.com (1995-11-04)
Re: Is register stack compilers' friend? cliffc@ami.sps.mot.com (1995-11-05)
Re: Is register stack compilers' friend? stevec@pact.srf.ac.uk (1995-11-06)
Is register stack compilers' friend? dave@occl-cam.demon.co.uk (Dave Lloyd) (1995-11-06)
Re: Is register stack compilers' friend? paulb@pablo.taligent.com (1995-11-09)
Re: Is register stack compilers' friend? jeremy@suede.sw.oz.au (1995-11-13)
| List of all articles for this month |

Newsgroups: comp.compilers
From: Dave Lloyd <dave@occl-cam.demon.co.uk>
Keywords: architecture, design
Organization: Compilers Central
References: 95-11-026
Date: Mon, 6 Nov 1995 18:18:57 GMT

The Transputer has a three deep register stack labelled A, B, C so
monadics pop A and push the result, dyadics pop A and B and push the
result. There are also some triadics (block move). I recall David May
defended the decision for 3 registers as being sufficient for most
typical expressions (an arbitrary sequence of monadics or operations
with the same precedence but only one outstanding value for a lower
precedence operation). If you don't try to preserve state across
'statements' (which was part of the design of transputers to allow
frequent and fast descheduling of processes) I would agree that it is
quite adequate in practice. It was also expected that the workspace
(external stack) would be placed in fast on-chip static ram so the
register stack can be treated as a cache of the real stack - in this
way transputers differ from traditional stack machines which had no
registers (or maybe an accumulator). Unfortunately the on-chip ram
is not big enough to rely on using as the program execution stack and
has to be shared between processes.


There is not much that you can do to improve upon simple depth
ordered traversal for basic expressions and so it was a quick port
for the transputer code generators for our Algol68 and Fortran90
compilers to get good code. I like Transputers for much this reason.
There are a few extra tricks that are appropriate to the architecture
and most of these are spelt out in Inmos' book "Transputer
Instruction Set: A compiler writer's guide" (ISBN 0-13-929100-8)
which is worth a read if you're interested in more details.


However since the alu and fpu both have a 3 deep stack, you can work
to schedule integer against floats (and with a T800 taking 11 cycles
for a single precision multiply, this is very worthwhile).


More interesting is the T9000, the new generation processor which has
a novel form of pipelining. Since it uses the same instruction set as
T800s, a "group scheduler" works in hardware to group instructions
which can be operated on at once. The T9000 is notionally capable of
doing two address calculations, two loads, an operation, and a store
or load every cycle. You can now pretend that the T9000 is a
three-adress memory machine and to further the illusion the T9000 has
an extra fast cache for the top 32 words of workspace (your regular
register file...). This raises the possibility of compiler
optimisations to arrange the code to group well which is similar to
the scheduling optimisations required for other RISCS.


When T800s appeared in the mid-80s, they were fast compared to most
of the competition and their parallel architecture scaled easily (it
is still not uncommon to find a 256 T800 processor array in
university departments). I suspect the simple design of the alu had a
lot to do with that. However the game has changed in the last ten
years and memory is now very slow - even fast cache is slow! The
large register files of modern riscs allow much scheduling of
independent operations but this work in general falls on the compiler
writer and is often not achieved with day to day software. The T9000
(when there is silicon which works at full speed) will be comparable
to mid-performance SPARC, a reflection of a design which is now
several years late. They are still scalable of course.


The other example of a register stack that I have to support is the
Intel x87 and it is interesting to note that the Pentium also has a
hardware optimisation (FXCH is free in some circumstances) to let you
pretend that the register stack is actually a two-address register
file. The x87 register stack is more awkward than the transputer as
it is 8 deep which really begs to be filled with longer term values.


So, in general register stack machines are great for compilers as the
obvious code is also the optimal code. This allows the compiler
writer to spend more effort on good translation strategies from the
high-level language into basic operations and leave the silicon
designer to worry about getting the most done in the cycles: however
it seems that the current conditions favour three-address register
file riscs which do better *when* the compiler has ordered the code
for the processors's best benefit.


----------------------------------------------------------------------
Dave Lloyd Email: Dave@occl-cam.demon.co.uk
Oxford and Cambridge Compilers Ltd Phone: (44) 1223 572074
55 Brampton Rd, Cambridge CB1 3HJ, UK
[Three registers for temporaries is about right -- back in 1963 the paper
on the then-new 360 series in the IBM Systems Journal said that four
registers were plenty for expression temporaries, and the rest were for
addressing and for fast storage of variables. -John]


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.