Re: Stack based-Register Based

Andrew Berg <andrewb@votehere.net>
28 Jan 2001 02:09:52 -0500

          From comp.compilers

Related articles
Stack based-Register Based vinoth@india.ti.com (Vinoth Kumar) (2001-01-19)
Re: Stack based-Register Based anton@mips.complang.tuwien.ac.at (2001-01-20)
Re: Stack based-Register Based vbdis@aol.com (2001-01-20)
Re: Stack based-Register Based roedy@mindprod.com (Roedy Green) (2001-01-26)
Re: Stack based-Register Based peter_flass@yahoo.com (2001-01-26)
Re: Stack based-Register Based andrewb@votehere.net (Andrew Berg) (2001-01-28)
Re: Stack based-Register Based rog@vitanuova.com (2001-02-01)
Re: Stack based-Register Based ucapjab@ucl.ac.uk (Jonathan Barker) (2001-02-01)
Re: Stack based-Register Based rpw3@rigden.engr.sgi.com (2001-02-04)
Re: Stack based-Register Based Martin.Ward@durham.ac.uk (2001-02-04)
Re: Stack based-Register Based anton@mips.complang.tuwien.ac.at (2001-02-04)
Re: Stack based-Register Based anton@mips.complang.tuwien.ac.at (2001-02-04)
[5 later articles]
| List of all articles for this month |

From: Andrew Berg <andrewb@votehere.net>
Newsgroups: comp.compilers
Date: 28 Jan 2001 02:09:52 -0500
Organization: Compilers Central
References: 01-01-124 01-01-137
Keywords: architecture
Posted-Date: 28 Jan 2001 02:09:52 EST



VBDis wrote:
>
> Vinoth Kumar <vinoth@india.ti.com> schreibt:
>
> > What is the motivation behind designing Java as a stack based,
> > when lots of processors around us are register based.
>
> A stack is virtually unlimited, while you always have too few
> registers, on any machine.


The TAO VM used in the new AmigaOS is register based, but they claim
that you can define an unlimited number of registers. Whether it
really is unlimited, or the limit is just really big, I don't know.
What kind of expression would make 2^32 registers be "too few"?
(That's an expression I would hate to meet in a dark alley.)


Admittedly, the TAO stuff was designed to be compiled into native as
it was loaded rather than being ever interpreted. However, many JAVA
VMs have JITs which are attempting to do a very similar thing. Having
register usage not obscured by DUPs, ROTs, PICKs, and DROPs can only
help the JIT compiler.


> For Java, the virtual machine will always run on a physical
> machine. The virtual machine then can optimize the use of the
> existing registers, without any modifications to the Java byte
> code. Performance would be bad, if the VM assumed more registers,
> than are available on a specific physical CPU.


Since the stack is "virtually unlimited" it is quite likely to be
larger than the number of physical registers. The use of a register
file in memory should not be appreciably slower than the use of a
stack in memory. Most architectures that I have experience with have
some support for fast address generation of array lookups when th
entries are word-aligned, making the register file operations
similarly fast.


> What is, in your opinion, the difference between values in registers
> and values in a stack?


The biggest difference IMO is that values in registers makes it clear
what are being used for what (at least to a compiler or hardware)
whereas values in a stack are just all jumbled together.


> [The following is my opinion, but it may be outdated a bit ;-]


I think that the same may be fairly said of me. :)


> IMO there exists no big difference, only that that the field, for
> the register number in a machine instruction, is smaller than the
> field for an stack offset.


There is also the difference that the stack pointer is used as an
implicit destination address, even if it is not the ideal one for the
situation. Imagine a three-operand register instruction where the
destination register must always be R1. That would be bad.


> Operations with values in registers may execute faster, than with
> values in a stack, but the ratio cannot be predicted easily. The
> values in registers must be loaded and saved, at least whenever a
> subroutine or other scope is entered or left.


I wonder if any study has been done of a compiler which analizes the
called routine's register usage and only saves those registers which
both are overwritten by the called routine, and contain important
values. This might be hard to get right since to do it right you'd
have to do a transitive closure of called routines, and they would
likely not all reside in the same compilation unit.


In any case, the effort to save/restore all used registers around a
subroutine call should not be any more than the effort that a stack
machine will go to between and during every expression evaluation in
keeping all the values in memory.


> There exist(ed) processors, which implement their registers in a
> stack, and moved the base pointer with every scope change. Many
> registers, but all in memory! On some newer machines IMO a data
> cache in the CPU can make operations with memory operands nearly as
> fast as with register operands. Then the number of "registers in
> use" can vary between several versions of the same CPU, whereas the
> number of "directly addressable registers" is hard wired in the
> (machine architecture? and) instruction set.


That sounds similar to the sliding windows in the SPARQ. It also
sounds reminicent of the register renaming happening in the newer out
of order processors. Hmmm, speaking of out of order execution, how
does having every instruction dependant on the contents of a stack
pointer which is updated by nearly every instruction help in
exploiting instructon level parallelism? It is probably surmountable,
but IMO clearly not helpful.


Also, this cache had probably be a separate one than the normal memory
D-cache, since if it is being accessed per-instruction it will need to
have a lot of ports. Many D-caches are dual ported already, just to
keep up with regular loads and stores. If this cache is quad ported,
it would be able to supply operands to up to two instructon per cache
cycle, since I assume that both operands will be coming from the
stack. We'd better make that cache pretty small so that cache cycles
are only one clock cycle. We will also have to add four adders to our
pipeline to calculate the addresses. If we have a "base address" to
add to this, make that eight more adders. We should be able to
implement some kind of stack pointer predictor to happen at
instruction decode time, so it should be able to pre-calculate the
stack pointer, allowing us to dispatch our two instructions as soon as
they get their operands from the cache. Since stack architectures
tend to have a lot of serialization of values being used at the top of
the stack we will need some really long reservation stations to find
independant instructions to execute, but longer reservation stations
require more physical registers, which increases the logic to locate
ready instructions. I believe the increase is as the square of the
number of physical registers. Along with really hurting our IPC, we
are intruducing a lot of extra little drop, swap, rot, dup
instructions that don't necessarily do any useful work.


Sorry for the non-comp.compilers rant there. This is just something I
feel strongly about.


> On other (RISC) architectures the best usage of the registers IMO
> can become very complicated, so that a compiler can optimize the
> usage of the registers better than an application programmer. So why
> bother with a limited number of registers in a high level language,
> when their usage then typically only will /disallow/ the compiler,
> to produce the fastest possible code?


I don't think that is true. My intuition is that as long as the high
level compiler is optimizing for a number of virtual registers which
is greater than or equal to the number of physical registers that
everything is good. The JIT compiler can perform the same kind of
register scheduling that the high level one did, and the high level
compiler's optimizing should not degrade the final answer.


IOW: I claim that if our VM has 256 registers, our ISA has 32
registers and our expression has 300 concurrent temporary values, the
high level language can schedule the block down to 256 registers by
adding stores and loads as appropriate. These stores and loads should
not hinder the JIT or load-time compiler from also being able to
generate an optimized register scheduling. In fact, the stores and
loads added in should be ones that would have been added in anyhow if
the compiler were optimizing directly from the 300 values to the 32
values in a single step.


I don't, however know if this is true. Anyone out there able to offer a
better proof either way?


-andrew


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.