Re: Fixed stack / dynamic stack for each thread

"Ira Baxter" <>
Tue, 23 Dec 2008 10:59:47 -0600

          From comp.compilers

Related articles
Re: Fixed stack / dynamic stack for each thread (Ira Baxter) (2008-12-23)
Re: Fixed stack / dynamic stack for each thread (2009-01-05)
Re: Fixed stack / dynamic stack for each thread (Ira Baxter) (2009-01-10)
Re: Fixed stack / dynamic stack for each thread (2009-01-25)
| List of all articles for this month |

From: "Ira Baxter" <>
Newsgroups: comp.programming.threads,comp.compilers
Date: Tue, 23 Dec 2008 10:59:47 -0600
Organization: Compilers Central
References: <> <>
Keywords: parallel, storage
Posted-Date: 24 Dec 2008 13:43:14 EST

<> wrote in message
> In article
> <>,
> () wrote:
> > > or it needs every single function that could possibly need to
> > > switch stacks (with programming languages that allow recursion,
> > > this is easily and safely approximated by "all the functions")
> > > to check the stack space and be prepared to switch, both on
> > > function entry and function exit.
> > By saying 'switch stack', do you mean that more stack space is needed
> > to proceed?
> Yes. We're about to have used up all the space in the current stack.
> Since we cannot extend the address space it occupies, because the next
> thing in the address space is a stack block belonging to a different
> thread, we therefore have to make use of a new stack block, and need to
> set things up appropriately.
> On a little more thought - I never considered this idea before yesterday
> - this is not trivial for languages that use SP-relative addressing for
> both received arguments and local variables. One of the easier ways to
> handle this might be to use two registers, one for things in the stack
> frame above you, and one for things in your own stack frame. These would
> have the same value most of the time, but if the function detected the
> need to start using a new stack block, they would have different values,
> one pointing into the old stack block and one into the new.
> Said further thought also reveals that you have a very serious problem
> with any truly block-structured language: C isn't really one of those,
> because you cannot define functions within the scope of functions. In
> languages where you can, it is common for inner functions to be able to
> see the variables of the outer functions. Making that cope with stack
> switching, given there is no small limit on the depth to which such
> functions may be nested, would be somewhat complicated. It would also be
> hard to make it fast.
> It's quite possible that people theorising about dynamic stacks, and
> maybe even experimenting with them, have solved all these problems. But
> I'd want to see some evidence for it.

It exists in real implementations.

I don't know about pthreads, we don't have anything to do with it, but...

Our PARLANSE parallell programming language
is based on the philosophy that there are an unbounded number of small
computational grains. This maximizes the opportunity to collect many grains
of computation and hand them out to available processors (simulated by
"threads" under Windows). PARLANSE is "block structured" in your sense;
functions have lexical access to the variables in surrounding scope.
PARLANSE allows allows (encourages) recursion, used to climb over
graphs and trees that we tend to use it for. We don't use PARLANSE
for numerical computations, but rather for irregular parallelism

It operates by using allocation activation records on a per-function
call basis, where the activation record size is determined by the
needs just of the function. Each function call thus does a stack

Argument handling occurs by passing arguments in the registers, or
passing a pointer to a block of parameters in the caller's stack.
Small results are returned in registers. "Large results" are returned
to a preallocated slot in the caller's stack frame by fast block move
operations where possible, and slower copy constructors when dynamic
values are returned (but that's hard to avoid anyway). Uplevel
addressing to lexically scoped, addressable values in surrounding
function definitions is handled by constructing a "display", an old
Algol idea which uses a small array per function, indexed by scope
level number, containing a pointer to each parent activation record.
We construct a "sparse" display, containing pointers to just the
parent activation records that might be accessed. Typical displays
are only a few entries long; in practice, nobody lexically nests
dozens of functions deep. Seems our brains just can't handle this :-}
The recieving function copies the part of the scope array it (or its
subfunctions) may use.

The overhead for all this is cheaper than you'd expect. it only takes
a few machine instructions per function call, and this matters only if
you are computing fibonacci or something else equally rediculously
simple. In practice, function activation records can be large, due to
inlining other (simple) functions, or due to paralllelism primitives,
which for PARLANSE statically allocate grain-tracking data structures
in the activation record for common static-parallelism primitives,
such as (parallel/concurrent a b c.. z) for fixed numbers of grains a
b ..., for partial order parallelism (always has a fixed number of
grains), etc. The overhead seems to be maybe 5%; this is well worth
it because we can usefully go parallel with reasonably large numbers
of processors (threads).

A real curse on under Windows, is the need for activation records to
be unnecessarily large. Because Windows insists on pushing the entire
machine state on your stack, including the floating point state, when
a thread traps, in debugging mode we round activation records up to
some 500+ bytes to allow for this. (Windows could put the machine
state in a thread-specific location rather than on the stack as its
first or an optional reaction, and then we wouldn't pay this price,
but getting microsoft to listen is pretty hard). In production, our
code doesn't trap, and activation records can be as small as 128
bytes. We think we can get them down to 64 bytes, e.g. a single cache
line. Yes, activation records are cache-line aligned.

None of these ideas are really new or original with us. We're just
engineers :-}

What's more remarkable is that PARLANSE propagates exceptions across
not only function boundaries, but cleanly across parallelism
boundaries. That requires a lot of complicated machinery but it makes
programming large robust systems a lot more practical. We have a very
large application, DMS (which does program analysis and
transformation), which is some 500K SLOC of hand code PARLANSE and
some typical 1M SLOC auto-generated PARLANSE in a typical
configuration. Seems to work reasonably well.

Ira Baxter, CTO

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.