Re: EQNTOTT Vectors of 16 bit Numbers (Robert Bernecky)
Mon, 20 Nov 1995 17:36:35 GMT

          From comp.compilers

Related articles
EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz Intel (1995-11-09)
Re: EQNTOTT Vectors of 16 bit Numbers [Was: Re: Yikes!!! New 200Mhz In (1995-11-14)
Re: EQNTOTT Vectors of 16 bit Numbers (1995-11-17)
Re: EQNTOTT Vectors of 16 bit Numbers (1995-11-19)
Re: EQNTOTT Vectors of 16 bit Numbers (1995-11-20)
Re: EQNTOTT Vectors of 16 bit Numbers (1995-11-20)
Re: EQNTOTT Vectors of 16 bit Numbers (1995-11-21)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Robert Bernecky)
Keywords: architecture, optimize, APL, comment
Organization: University of Toronto, Computer Engineering
References: <47j9hm$> 95-11-079 95-11-132
Date: Mon, 20 Nov 1995 17:36:35 GMT (David Keppel) writes:
>>["Intel's special SPEC optimization."]
>This reminds me of APL benchmarking from twenty years ago. The
>interpreters recognized particular idioms and implemented them as
>special cases. The APL vendors (notably, IBM) were accused of
>"benchmark optimizations", though I think the original motivation
>was speeding up real use.

We call this product engineering -- it solves users' real problems.
It is traditional to blame marketeers for taking our work and
using it in benchmarks. 8^}

Actually, IBM was fairly innocent on that front. It was us competitors
in that field, primarily the I.P. Sharp Associates development
group who did a lot of it. For some reason, IBM was very slow to
adopt algorithmic improvements in their early APL products --
Work I did on fast set membership took at least 10 years to
hit the IBM products. IBM's APL2 does a very nice job on
structural operations -- probably ahead of all competitors on that
front (transpose, reverse, take, drop...)

  APL implementors worked on two fronts:
      - storage savings [Typical workspace size back then was
            a generous 100k, up from the 36k of yore [yore wasn't that
            much earlier]], to avoid workspace full problems.
            Logically, these used loop fusion techniques to
            avoid generating large temps. The earliest one I remember
            was the "compress iota" idiom:
              A common technique in APL is to use a Boolean compression
              mask to compress a list of the first N integers.
              For example, let's call the index generator "iota".
                    iota 5
              0 1 2 3 4
              and compress (denoted by "/") with a Boolean left argument
              gives the elements of the right argument that correspond
              to ones in the left argument. E.g., to locate the blanks
              a character vector:
                x=. ' Where are the blanks here? "
            x=' '
              Stick blanks between the digits above so its a numeric vector.
              I left them out so you can see what's happening.
                (x=' ')/ iota shape x
              0 6 10 14 15 16 23 29 30 31

            Now, a naive implementation would run across "iota shape n"
            and build an integer vector w/as many elements as there are
            elements in x. That would get fed to compress which would
            discard 99% of it.
            The integer temp is 4 times as big as characters (32-bit ints)
            and 32 times as big if x is Boolean.
            Result: Big workspace full problem for common applications.

              So, we (I think Larry Breed was the originator of this
            particular work. If not Larry, then it was Roger Moore -- both
            Hopper Award winners for APL\360) made the system recognize
            the "compress iota" idiom, replaced it by a new token in the
            internal representation, and avoided generation of the integer
            list temp. Much faster and much less (usually) space.

This was generalized in work by Phil Abrams (Stanford PhD thesis:
An APL Machine) as "J-vectors", aka "Arithmetic Progression Vectors".
These represent numeric vectors as a 3-element vector of
      initial value, element count, increment
A system that implemented j-vectors [EBIverson: APL73, Springer&Verlag]
would have iota generate a j vector of: 0, (shape x), 1
and then have compress operate as desired on the resulting j-vector.
This appeared in the Siemens BS4000 APL system.

  a. APVs are good, but in an interpreter have not proved worthwhile.
        The bookkeeping gets excessive, and doesn't pay off in the
        long run -- it's very similar to CISC vs RISC design arguments --
        if your gate delays to do something fancy [per-operation
        overheads] are larger than the potential gains from something
        fancy, the design is a net loss.
        In a compiler, APVs can pay off big time, as the analysis and
        dispatch time is done once at compile time.

b. The compress iota kludge doesn't generalize as nicely as one would
      like to other expressions. You can support a handful of common
      idioms, but an interpreter doesn't have the luxury of time to
      perform the required analysis. Again, a compiler can do much
      better, and in a more general way. For example, loop fusion of
      all sorts comes essentially for free.

We also engaged in a variety of time-saving algorithmic changes, such
    fast set membership and table lookup (Bernecky APL73)
    fast matrix products (unpublished, but uses CDC STAR 100 algorithm)
    compiled scalar functions and other commonly used operations

On a vaguely related topic, I happen to be evaluating the loop fusion
  capabilities of APEX (The APL Parallel Executor) right now, to
  determine how well it handles cases such as the above compress iota.
  (APEX cranks out SISAL code from APL.)
[I believe that a lot of people came up with APVs at about the same time. I
invented them for an APL compiler I wrote as a class project in the fall of
1972, for example. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.