Mon, 20 Nov 1995 17:36:35 GMT

Related articles |
---|

EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz Intel glew@ichips.intel.com (1995-11-09) |

Re: EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz In pardo@cs.washington.edu (1995-11-14) |

Re: EQNTOTT Vectors of 16 bit Numbers cdg@nullstone.com (1995-11-17) |

Re: EQNTOTT Vectors of 16 bit Numbers hbaker@netcom.com (1995-11-19) |

Re: EQNTOTT Vectors of 16 bit Numbers cliffc@ami.sps.mot.com (1995-11-20) |

Re: EQNTOTT Vectors of 16 bit Numbers bernecky@eecg.toronto.edu (1995-11-20) |

Re: EQNTOTT Vectors of 16 bit Numbers bernecky@eecg.toronto.edu (1995-11-21) |

Newsgroups: | comp.compilers |

From: | bernecky@eecg.toronto.edu (Robert Bernecky) |

Keywords: | architecture, optimize, APL, comment |

Organization: | University of Toronto, Computer Engineering |

References: | <47j9hm$tdp@caesar.ultra.net> 95-11-079 95-11-132 |

Date: | Mon, 20 Nov 1995 17:36:35 GMT |

pardo@cs.washington.edu (David Keppel) writes:

*>>["Intel's special SPEC optimization."]*

*>*

*>This reminds me of APL benchmarking from twenty years ago. The*

*>interpreters recognized particular idioms and implemented them as*

*>special cases. The APL vendors (notably, IBM) were accused of*

*>"benchmark optimizations", though I think the original motivation*

*>was speeding up real use.*

We call this product engineering -- it solves users' real problems.

It is traditional to blame marketeers for taking our work and

using it in benchmarks. 8^}

Actually, IBM was fairly innocent on that front. It was us competitors

in that field, primarily the I.P. Sharp Associates development

group who did a lot of it. For some reason, IBM was very slow to

adopt algorithmic improvements in their early APL products --

Work I did on fast set membership took at least 10 years to

hit the IBM products. IBM's APL2 does a very nice job on

structural operations -- probably ahead of all competitors on that

front (transpose, reverse, take, drop...)

APL implementors worked on two fronts:

- storage savings [Typical workspace size back then was

a generous 100k, up from the 36k of yore [yore wasn't that

much earlier]], to avoid workspace full problems.

Logically, these used loop fusion techniques to

avoid generating large temps. The earliest one I remember

was the "compress iota" idiom:

A common technique in APL is to use a Boolean compression

mask to compress a list of the first N integers.

For example, let's call the index generator "iota".

Then:

iota 5

0 1 2 3 4

and compress (denoted by "/") with a Boolean left argument

gives the elements of the right argument that correspond

to ones in the left argument. E.g., to locate the blanks

a character vector:

x=. ' Where are the blanks here? "

x=' '

10000010001000111000000100000111

Stick blanks between the digits above so its a numeric vector.

I left them out so you can see what's happening.

(x=' ')/ iota shape x

0 6 10 14 15 16 23 29 30 31

Now, a naive implementation would run across "iota shape n"

and build an integer vector w/as many elements as there are

elements in x. That would get fed to compress which would

discard 99% of it.

The integer temp is 4 times as big as characters (32-bit ints)

and 32 times as big if x is Boolean.

Result: Big workspace full problem for common applications.

So, we (I think Larry Breed was the originator of this

particular work. If not Larry, then it was Roger Moore -- both

Hopper Award winners for APL\360) made the system recognize

the "compress iota" idiom, replaced it by a new token in the

internal representation, and avoided generation of the integer

list temp. Much faster and much less (usually) space.

This was generalized in work by Phil Abrams (Stanford PhD thesis:

An APL Machine) as "J-vectors", aka "Arithmetic Progression Vectors".

These represent numeric vectors as a 3-element vector of

initial value, element count, increment

A system that implemented j-vectors [EBIverson: APL73, Springer&Verlag]

would have iota generate a j vector of: 0, (shape x), 1

and then have compress operate as desired on the resulting j-vector.

This appeared in the Siemens BS4000 APL system.

Lessons:

a. APVs are good, but in an interpreter have not proved worthwhile.

The bookkeeping gets excessive, and doesn't pay off in the

long run -- it's very similar to CISC vs RISC design arguments --

if your gate delays to do something fancy [per-operation

overheads] are larger than the potential gains from something

fancy, the design is a net loss.

In a compiler, APVs can pay off big time, as the analysis and

dispatch time is done once at compile time.

b. The compress iota kludge doesn't generalize as nicely as one would

like to other expressions. You can support a handful of common

idioms, but an interpreter doesn't have the luxury of time to

perform the required analysis. Again, a compiler can do much

better, and in a more general way. For example, loop fusion of

all sorts comes essentially for free.

We also engaged in a variety of time-saving algorithmic changes, such

as:

fast set membership and table lookup (Bernecky APL73)

fast matrix products (unpublished, but uses CDC STAR 100 algorithm)

compiled scalar functions and other commonly used operations

On a vaguely related topic, I happen to be evaluating the loop fusion

capabilities of APEX (The APL Parallel Executor) right now, to

determine how well it handles cases such as the above compress iota.

(APEX cranks out SISAL code from APL.)

[I believe that a lot of people came up with APVs at about the same time. I

invented them for an APL compiler I wrote as a class project in the fall of

1972, for example. -John]

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.