Re: Portable Fast Direct Threaded Code

firth@sei.cmu.edu (Robert Firth)
4 Apr 91 13:27:21 GMT

          From comp.compilers

Related articles
Portable Fast Direct Threaded Code eliot@.cs.qmw.ac.uk (Eliot Miranda) (1991-03-28)
Re: Portable Fast Direct Threaded Code Tom.Lane@G.GP.CS.CMU.EDU (1991-04-01)
Re: Portable Fast Direct Threaded Code metzger@watson.ibm.com (1991-04-02)
Re: Portable Fast Direct Threaded Code pardo@cs.washington.edu (1991-04-02)
Re: Portable Fast Direct Threaded Code vestal@SRC.Honeywell.COM (1991-04-03)
Re: Portable Fast Direct Threaded Code firth@sei.cmu.edu (1991-04-04)
Re: Portable Fast Direct Threaded Code pardo@cs.washington.edu (1991-04-04)
Re: Portable Fast Direct Threaded Code bpendlet@bambam.es.com (1991-04-08)
| List of all articles for this month |

Newsgroups: comp.compilers
From: firth@sei.cmu.edu (Robert Firth)
Keywords: interpreter, performance, design
Organization: Software Engineering Institute, Pittsburgh, PA
References: <3035@redstar.cs.qmw.ac.uk> <1991Apr2.014216.25150@watson.ibm.com> <1991Apr2.192125.7464@beaver.cs.washington.edu> <1991Apr3.182334.16164@src.honeywell.com>
Date: 4 Apr 91 13:27:21 GMT

In article <1991Apr3.182334.16164@src.honeywell.com> vestal@SRC.Honeywell.COM (Steve Vestal) writes:


>How architecturally dependent is the performance of these techniques
>(relative to compiling to native code)?


[cost of threaded code on PDP-11, RISC &c]


We might have a misunderstanding here, because what I think of as threaded
code doesn't have a decoding and interpretation step. But I'll talk of
what I know.


A program in threaded code is just an array of addresses, possibly
interspersed with operands. So the fragment


c := a + b


becomes something like


address of 'load'
address of 'a'
address of 'load'
address of 'b'
address of '+'
address of 'store'
address of 'c'


This implies a very simple virtual stack machine - you can get more clever
by implementing a virtual register machine.


The basic execution thread then does this. We point a global register at
the table of addresses, and each primitive has the form


treg := treg + address'size
...
jump (treg)


As you can see, this is great on the PDP-11, since that reduces to one
instruction


MOV (treg)+,PC ; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH!


On a typical RISC machine, it's two cycles, since you can put almost
anything in the delay slot(s) after the jump.


Now, the load instruction, for instance, would say


load: treg := treg + address'size
load (treg) into tempreg
treg := treg + address'size
push (tempreg) onto simulated stack
jump (treg)


On the PDP-11, that's


MOV @(treg)+, -(SP)
MOV (treg)+, PC


On a RISC, it's much more like


L R0, 4(treg) ; get operand address
L R0, 0(R0) ; dereference to get operand
SUBI SP, #4 ; decrement simulated SP
S R0, 0(SP) ; push operand on stack
ADDI treg, #8 ; step over two addresses (mine & operands)
JR (treg) ; over to you, Moriarty!


[to fill delay slots, shuffle the above to 132564]


Well, if you have one load delay slot and one branch delay slot, you can
fill all three of them, so that's 6 cycles. Given that a typical load is
only 1.1 cycles in direct code (90% of the delays filled), this is
certainly a lot more than a 2:1 overhead! When you add the cost of a
simulated stack (lots of needless loads and stores), I can well believe
10:1 time expansion for simple code.


In fact, it was more than that on the PDP-11, if you compared threaded
code with direct code from a decent compiler. The big win in the Fortran
compiler came from (a) very compact threaded code, and (b) the floating
point operations were implemented in software, so the overhead of threaded
code was swamped by the cost of floating addition, subtraction &c.


Here's the full code of the above example, so you can see for yourself


Direct:
MOV a, R0
ADD b, R0
MOV R0, c


Threaded:
MOV @(treg)+, -(SP)
MOV (treg)+, PC
* MOV @(treg)+, -(SP)
* MOV (treg)+, PC
* ADD (SP)+,(SP)
MOV (treg)+, PC
MOV (SP)+, @(treg)+
MOV (treg)+, PC


Note that, if you implement a one-address add, you save two instructions,
since the *** bit reduces to


ADD @(treg)+, (SP)


But even then, it's coming out at over 4:1.


What architectural features make threaded code more efficient? The
fundamental one is main memory that is fast (or not too slow) relative to
registers, since you're doing a lot more fetching. Another is a set of
address modes with double indirection, since you're accessing most
operands one level of indirection further back. And good old
autoincrement helps a little, too. Alas, none of that says 'risc', and
much of it says '1960s'.


Incidentally, if I were to do this again today, I'd definitely simulate a
general-register machine and use a subset of the real machine's registers.
If you take 8 of them, you then have 8 loads and stores, one for each
register, but if you make an absolute rule that nobody even thinks about
touching one of those 8 that doesn't belong to him, then all the good
tricks about register allocation, slaving &c will still work. If you then
implement the operations as one-address general-register, you have again 8
versions (add into R0, add into R1, ...) and lo! you're programming a very
familiar old friend.


"But that was in another country, and besides, the machine is dead..."


Robert Firth
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.