|software pipelining on s/390s email@example.com (1995-11-12)|
|From:||firstname.lastname@example.org (Robert Bernecky)|
|Keywords:||architecture, optimize, IBM, comment|
|Organization:||University of Toronto, Computer Engineering|
|Date:||Sun, 12 Nov 1995 22:23:37 GMT|
Does anyone have some hard numbers [I'll even settle for soft-boiled
numbers] on the utility of software pipelining for array operations
such as double_vector+double_vector on current S/390 machines? The
reason I ask is that I'm trying it out myself, and getting fairly
puzzling results, as if it is no help at all.
The initial loop looks roughly like:
lp ld d0,0(r5)
The pipelined loop loads d4,d6 while doing the la/la/adr, then
moves d4,d6 into d0,d2 [for code generator limitation reasons],
then starts load of next operand pair into d4,d6, etc.
My numbers suggest that either caching [we ARE stride 1] is working
VERY well, or that something else in the system is working very well.
Or else, some other part of the system is running so damn slow that
it's swamping all my measurements either way.
[The 390 probably has great big cache lines. Also, they may be using
Tomasulo's scheme (invented for the 360/91) which gives the effect of
software pipelineing in hardware, and is particularly useful on the 360 arch
since it only has 4 FP registers. -John]
Return to the
Search the comp.compilers archives again.