Re: Interpreters for VLIW (Anton Ertl)
18 Jan 2001 00:52:41 -0500

          From comp.compilers

Related articles
Interpreters for VLIW (2001-01-11)
Re: Interpreters for VLIW (2001-01-18)
Re: Interpreters for VLIW (Jim Granville) (2001-01-18)
Re: Interpreters for VLIW (Xavier Leroy) (2001-01-18)
Re: Interpreters for VLIW (2001-01-20)
| List of all articles for this month |

From: (Anton Ertl)
Newsgroups: comp.compilers
Date: 18 Jan 2001 00:52:41 -0500
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
References: 01-01-070
Keywords: interpreter, VLIW
Posted-Date: 18 Jan 2001 00:52:38 EST (Paolo Bonzini) writes:
>I have compiled GNU Smalltalk on the Itanium and found it to be
>horribly slow --- Less than half the speed of an Intel chip with the
>same clock! The reason is, the basic blocks in the interpreter are
>too small for GCC to do nice instruction scheduling. Does anyone know
>of pointers to papers on optimizing interpreters for VLIW

    author = "Jan Hoogerbrugge and Lex Augusteijn and Jeroen Trum
                                  and Rik van de Wiel",
    title = "A code compression system based on pipelined
    journal = "Soft\-ware\emdash Prac\-tice and Experience",
    volume = "29",
    number = "11",
    pages = "1005--1023",
    month = sep,
    year = "1999",
    coden = "SPEXBL",
    ISSN = "0038-0644",
    bibdate = "Sat Sep 18 18:25:59 MDT 1999",
    url = ";PLACEBO=IE.pdf;
    acknowledgement = ack-nhfb,

For the IA-64 and PPCs you also have to consider the branch registers.
I don't know the timings for the Itanium, but on the PPCs you have to
set them as early as possible; if you do the move to the branch
register at the start of the VM instruction, the branch at the end
still often has to wait for the result (e.g., 5 cycles latency from
mtctr to bctr on the PPC 604e). If you have to do some load
instructions first, that adds even more. Also, you cannot tell gcc
where to place the move, and it's not very smart about moving it up by

In the following I'll assume you want to implement direct threaded
code (right?). So you should do the load of the target address in the
previous VM instruction (i.e., prefetch the VM instruction). E.g.,
Gforth does this for the PPC.

If you already do all that, the next step in this direction would be
to perform what the software pipelining guys call loop unrolling and
modulo variable renaming: you make another copy of the interpreter,
each using a different branch register and make sure that the
instructions from the two interpreters are called in an alternating
way. Then you can do the move to the target register already one
instruction earlier.

Another direction to take would be to combine VM instructions; this
reduces the number of dispatches necessary and increases the length of
the code of each instruction, so there is more scheduling opportunity.
I am currently working on a VM interpreter generator that does much of
this automatically (it's based on Gforth's generator). It should be
ready in a few weeks. Mail me if you are interested.

- anton
M. Anton Ertl Some things have to be seen to be believed Most things have to be believed to be seen

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.