Inline block moves

disque@unx.sas.com (Thomas Disque)
Mon, 11 Nov 91 16:26:36 GMT

From comp.compilers

Related articles
*Inline block moves disque@unx.sas.com* (1991-11-11)**
Re: Inline block moves mwm@pa.dec.comMeyer) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-11)
Inline block moves jfc@ATHENA.MIT.EDU (John Carr) (1991-11-12)
Re: Inline block moves christer@cs.umu.se (1991-11-12)
Re: Inline block moves Bruce.Hoult@actrix.gen.nz (1991-11-12)
Re: Inline block moves meissner@osf.org (1991-11-15)

| List of all articles for this month |

Newsgroups:	comp.compilers
From:	disque@unx.sas.com (Thomas Disque)
Keywords:	assembler, optimize
Organization:	SAS Institute Inc.
Date:	Mon, 11 Nov 91 16:26:36 GMT

Due to an overwhelming response, I am posting below my article on inline
block moves. Please note that although I am describing optimal code as
only hand assembly can craft it, I realize that compilers cannot always
generate this code. I am simply outlining what I percieve as the ideal
towards which we should strive. The previous statement was to prevent my
being skewered by compiler writers more knowledgable than I am :-)

          All The Myriad Moves
          by Tom Disque

An assembly language buff such as myself can derive considerable enjoyment
from comparisons of machines with different architectures. I'd like to
give those of you familiar with only one or two machine architectures a
brief glimpse of twelve different machines, ranging from micro to
mainframe. The thread to tie these thoughts together will be the C
language built-in function.

Built-in functions are benign deceptions perpetrated by a compiler to
improve performance. While they look like function calls to the user, the
machine code they generate is in-line. This allows faster execution
without greatly increasing code size. In fact, in some cases, the in-line
code may not be any larger than a function call. The choice of built-in
functions varies depending upon the machine generating the code. On the
6502 microprocessor, for example, a simple 8 bit multiply is a bad choice
for in-line code, while an 8 bit multiply on a VAX would be ridiculous to
implement as a subroutine call.

I have chosen one case of a routine that I believe can be implemented as a
built-in function on all the host machines presented here. I will show
examples of possible (and in some cases, actual) code generated for each
of the following architectures: the 6502, 6809, Z-80, 8088, 80286, 80386,
68000, and 68020 microprocessors; the Prime 9955 , Data General Eclipse
(R) MV/8000, and the Digital VAX (TM) 8700 minicomputers; and the IBM 4341
model 2 mainframe computer. I will present all code in uppercase in order
to easily distinguish it from the text. All timings presented are
calculated from the manufacturer's timing tables unless noted as
empirical.

The built-in function I will use to illustrate different code sequences is
the ANSI C compiler function, memcpy. It's calling sequence is:

MEMCPY(TO,FROM,N);

where FROM and TO are pointers and N is an unsigned integer. Please note
that TO and FROM are reversed on the Data General. In order to assure
that in-line code would be sensible on all hosts, I have assumed the
special case of N as a constant value. I have presented these in the most
natural form on each host. For instance, the Z-80 uses 16 bit pointers,
while the VAX uses 32 bit pointers. I will also contrast the myriad ways
of accomplishing the same task on different machines. This, to me, is the
most fascinating part of this study.

THE 8 BIT MICROPROCESSORS

The 6502 is an 8 bit microprocessor with a physical address space of 64K
bytes, although, as on the other 8 bit microprocessors, there exist
implementations that use bank switching to address more than 64K bytes.
The 6502 is a fairly simple chip, with three 8 bit general purpose
registers to choose from: the accumulator, the X-index register, and the
Y-index register. As for arithmetic instructions, there exists only 8 bit
add and 8 bit subtract. The 6502 provides binary coded decimal (BCD)
addition and subtraction, as do all the other hosts presented here.
Examples of 6502-based machines are the Atari, Apple, and Commodore 8 bit
computers.

The 6502 has no 'trick' instructions to do block moves, as do the Z-80,
the Intel chips, the minis, and the mainframe. It must rely on loops to
move its data, but unlike the 6809 and the Motorola 68000 series, it can
only move a byte at a time. This tends to make the code more cumbersome;
indeed, it is questionable whether block moves of more than 256 bytes
should even be in-line. Another serious problem lies in the need to have
these pointers in zero-page memory, i.e. memory locations 0 thru 255.
Zero-page locations are the only ones from which we can do indirect
indexed addressing, but they are always in short supply, since most of
them are used by the operating system. The five possible code sequences
are shown below.

    0 < N < 128 128 <= N < 256 N = 256 COMMENTS
        10 BYTES 12 BYTES 9 BYTES
          LDY #N LDX #N LDY #0 Load loop counter
          DEY LDY #0
LOOP LDA (FROM),Y LOOP LDA (FROM),Y LOOP LDA (FROM),Y Move thru the
          STA (TO),Y STA (TO),Y STA (TO),Y accumulator
          DEY INY DEY counter is also index
                                                DEX for n < 128 or n=256
          BPL LOOP BNE LOOP BNE LOOP Continue the loop

    256 < N < 65536, 256 < N < 65536,
    N modulo 256 NOT = 0 N modulo 256 = 0 COMMENTS
        40 BYTES 30 BYTES
          LDA FROM+1 LDA FROM+1 Save the pointer values
          PHA PHA being changed on the
          LDA TO+1 LDA TO+1 stack
          PHA PHA
          LDX #N/256 LDX #N/256 Load top byte of counter
          LDY #0 LDY #0 Load index
LOOP LDA (FROM),Y LOOP LDA (FROM),Y Move thru the
          STA (TO),Y STA (TO),Y accumulator, 256 bytes
          DEY DEY each time thru the
          BNE LOOP BNE LOOP inner loop
          INC FROM+1 INC FROM+1 Point to the next
          INC TO+1 INC TO+1 256 byte block
          DEX DEX Dec. block counter
          BNE LOOP BNE LOOP All 256 byte pages are
          LDX #N modulo 256 finished when x = 0
PART LDA (FROM),Y Move partial page
          STA (TO),Y
          INY
          DEX
          BNE PART
          PLA PLA Restore original
          STA TO+1 STA TO+1 pointer values
          PLA PLA
          STA FROM+1 STA FROM+1

The Motorola 6809 is an 8 bit microprocessor with a physical address space
of 64K bytes. It has four general purpose 16 bit registers: the D-register
(which is the accumulator), the X-index register, the Y-index register,
and the U-register (the user stack pointer). The D-register can also be
accessed as the A-register (the high 8 bits of D) or the B-register (the
low 8 bits). The 6809 also has an 8 bit register called the direct page
register, which can contain the high 8 bits of an address, allowing the
code to be smaller and faster by specifying only the low 8 bits of its
addresses. This1 is similar to the segment:offset structure used by the
Intel 16 bit chips. The 6809 can perform a 16 bit add or subtract and an 8
bit multiply. The 6809 is noted for its highly orthogonal instruction set.
An example of a machine using the 6809 is the Radio Shack Color Computer.

The example of the 6809 code is presented below. Here we can take
advantage of the 6809's ability to load and store 16 bits of data in a
single instruction, at a cost of two extra instructions if the count is
odd and no cost for an even count. Note that for large values of n, the
counter (the D register) does not have a decrement instruction. We would
be forced to use subtract if we had handled the loop counter in a more
orthodox manner. This seems odd in light of the fact that we have auto
increment and auto decrement modes available when using the X, Y, and U
pointer registers, but it is consistent with the 6502's inability to
increment and decrement its accumulator. At any rate, the number of
instructions for large values of n is quite less than is required for the
6502. Please note that we assume a Radio Shack Color Computer here, which
does not reserve the user stack pointer. If we were using another 6809
implementation, we would need to save the user stack pointer at the the
beginning and restore it at the end.

                    0 < N < 512
          ODD EVEN COMMENTS
          20 BYTES 16 BYTES
          LDX FROM LDX FROM Load pointers
          LDY TO LDY TO
          LDA ,X+ Copy odd byte
          STA ,Y+
          LDB #N/2 LDB #N/2 Load loop counter
LOOP LDU ,X++ LOOP LDU ,X++ Copy thru the user
          STU ,Y++ STU ,Y++ stack pointer
          DECB DECB Decrement counter
          BNE LOOP BNE LOOP until finshed

                    512 <= N < 65536
          ODD EVEN COMMENTS
          24 BYTES 20 BYTES
          LDY FROM LDY FROM Load pointers
          LDU TO LDU TO
          LDA ,Y+ Copy odd byte
          STA ,U+
          LDD #N/2 LDD #N/2 Load loop counter
LOOP LDX ,Y++ LOOP LDX ,Y++ Copy thru the
          STX ,U++ STX ,U++ accumulator
          DECB DECB Decrement lower counter
          BNE LOOP BNE LOOP until exhausted
          DECA DECA Decrement upper counter
          BPL LOOP BPL LOOP Note: this is "BNE LOOP"
                                                                        if N modulo 256 = 0

The Zilog Z-80 microprocessor is an 8 bit microprocessor with a physical
address space of 64K bytes. It overlaps the fetch, decode, and execute
phases where possible (known as pipelining). It has two index registers
(IX and IY) and seven 8 bit general purpose registers (A, B, C, D, E, H,
and L). The last 6 registers can be addressed as 3 16 bit registers (BC,
DE, and HL). All seven registers can be bank-switched with an alternate
bank of registers (A', B', C', D', E', H', L', BC', DE', and HL'). The
Z-80 provides 8 bit and 16 bit addition and subtraction. An example of a
Z-80 implementation is the Radio Shack TRS-80 Model 3.

At this point, it seems appropriate to note the two 'families' that
microprocessors seem to fall into: the 'sixers' (6502, 6809, and 680x0
microprocessors) and the 'eighters' (Z-80, 8088, 80286, and 80386
microprocessors). Each family has some characteristics that contrast with
the other. For instance, the sixers have an explicit
auto-increment/decrement addressing mode, whereas the eighters have
implicit auto-increment/decrement as part of their block mode instructions
(which are, in themselves, distinguishing features of the eighters). The
direction of the addressing (increment or decrement) on the eighters is
controlled by a direction flag. The 6502 and the 680x0 microprocessors
have self-contained instructions for binary-coded decimal arithmetic. The
6502 has a decimal flag that operates in a fashion similar to the
direction flag of the eighters. The 680x0 microprocessors use special
self-contained instructions for BCD add and subtract. The 6809 serves to
remind us that there is an exception to every rule, however. It uses the
eighters technique of using the same instruction to perform BCD arithmetic
as it uses for binary arithmetic. This is followed by a BCD adjust
instruction to convert the result back to BCD.

It is obvious from the example given below and the Intel example following
it that the Z-80 and the Intel chips share a common philosophy. The
concept of the block instruction, where specific registers are loaded with
beginning values and a single instruction performs the operation, is one
of the most notable features of the eighters.

                    0 < N < 65536 Comments
                    11 BYTES
                    LD HL,(FROM) Load the pointers into
                    LD DE,(TO) the registers required
                    LD BC,N Load the counter
                    LDIR (L)oa(D) with (I)ncrement (R)epeating

                    16-BIT AND 32-BIT MICROPROCESSORS

The Intel 8088 microprocessor is a 16 bit pipelined microprocessor with an
8 bit external bus width. It is capable of addressing up to one megabyte
of memory and has a four byte instruction cache. The 8088 has seven 16
bit general purpose registers: AX, BX, CX, DX, SI, DI, and BP. All of
these registers have special uses in connection with one instruction or
another. For example, the AX register is the dividend in for 16 bit
division. The BX, BP, SI, and DI registers can be used as index registers.
The upper and lower bytes of AX, BX, CX, and DX can be addressed as 8 bit
registers (AH, AL, BH, BL, CH, CL, DH, and DL, respectively). The 8088
provides 8 bit and 16 bit addition, subtraction, multiplication, and
division, as well as 32 bit by 16 bit division. A floating-point
coprocessor, the 8087, is available. An example of an 8088 implementation
is the IBM PC/XT.

The Intel 80286 is a 16 bit pipelined microprocessor with a 16 bit
external bus width. It is capable of addressing up to one megabyte in real
mode (which emulates the IBM PC/XT) and up to sixteen megabytes in
protected mode. It has a six byte instruction cache, and a floating-point
coprocessor (the 80287) is available. Most of the other information about
the 8088 applies to the 80286 as well. An example of an 80286
implementation is the IBM PC/AT.

The Intel 80386 is a 32 bit pipelined microprocessor with a 32 bit
external bus width and virtual machine capability. It can can address up
to four gigabytes of physical memory and up to sixty-four terabytes of
virtual memory. It has a 12 byte instruction cache. The EAX, EBX, ECX,
EDX, ESI, EDI, and EBP are 32 bit registers incorporating AX, BX, CX, DX,
SI, DI, and BP, respectively, into the lower 16 bits of each of them. The
80386 supports the full range of 8, 16, and 32 bit integer arithmetic. It
also relaxes some of the restrictions put upon the 8088 and the 80286. For
example, any register can be used as an index. Floating-point arithmetic
can be handled by either the 80287 or the 80387, although the 80386 can
emulate some floating-point operations faster than a 4 Mhz 80287 can
perform them! An example of an 80386 implementation is the Compaq Deskpro
386.

An 'enhancement' of the Z-80 example is shown below. The memcpy code for
the Intel 8088, 80286, and 80386 combine the advantages of the Z-80 block
instruction with the 6809's ability to move more than a byte at a time.
All of these Intel chips can move a 16 bit word at a time, and the 80386
can move a 32 bit double word each time the block move is executed.

The observant reader may be puzzled at this point by the limit of 65536
bytes on the smaller Intel chips, especially since the 6809 limit on small
moves was doubled from 256 to 512 by moving 16 bit words. Why can't up to
131,072 (128K) bytes be moved, since that value divided by two will fit in
our 16 bit counter? The answer is because we are only incrementing the 16
bit offsets in the block move instruction, not the segments. Therefore, we
can only move 64K blocks at a time; however, we can move 4 gigabytes at a
time on the 80386!

                0 < N < 65536
                ALL INTEL CHIPS
          ODD EVEN COMMENTS
          10 BYTES 9 BYTES
          LDS SI,[FROM] LDS SI,[FROM] Load the pointers into
          LES DI,[TO] LES DI,[TO] the registers required
          MOVSB (MOV)e (S)tring (B)yte
          MOV CX,N/2 MOV CX,N/2
          REP MOVSW REP MOVSW (REP)eat (MOV)e (S)tring (W)ord

                0 < N < 4 gigabytes
                80386 chip
N MODULO 4 = 3 N MODULO 4 = 2 N MODULO 4 = 1 N MODULO 4 = 0
    20 BYTES 19 BYTES 19 BYTES 18 BYTES
    MOV ESI,[FROM] MOV ESI,[FROM] MOV ESI,[FROM] MOV ESI,[FROM]
    MOV EDI,[TO] MOV EDI,[TO] MOV EDI,[TO] MOV EDI,[TO]
    MOVSB MOVSB
    MOVSW MOVSW
    MOV ECX,N/4 MOV ECX,N/4 MOV ECX,N/4 MOV ECX,N/4
    DB 66H DB 66H DB 66H DB 66H
    REP MOVSW REP MOVSW REP MOVSW REP MOVSW
(Note: the "DB 66H" line causes a toggle from 16 bit mode to
32 bit mode, causing the movsw instruction to move 32 bit words.)

The Motorola MC68000 is a 32 bit microprocessor with physical and virtual
memory address ranges of 16 megabytes, and virtual machine capability. It
has eight data registers (D0 - D7) and eight address registers (A0 - A7).
The data registers may be used to process 8 bit, 16 bit, or 32 bit data,
and the address registers may be used as pointers, base address registers,
or 16 bit and 32 bit data. Any register may be used as an index. The
68000 supports the full range of 8 bit, 16 bit, and 32 bit arithmetic
except for 32 bit multiplication and division. A floating-point
coprocessor, the 68881, is available. An example of a 68000
implementation is the Apple Macintosh.

The Motorola MC68020 is a 32 bit pipelined microprocessor with physical
and virtual memory address ranges of 4 gigabytes. It has an instruction
cache of 256 bytes, although only 128 bytes can be used for instructions
(the rest is taken by tag information). The 68020 instruction set is a
superset of the 68000. An example of a 68020 implementation is the Apple
Macintosh II.

With the Motorola 68000 series, we have no block instructions, but we can
move up to 32 bits in a single instruction. There is a restriction,
however. The 68000 cannot address anything larger than a byte from an odd
address; 16 and 32 bit data must be accessed from an even address. The
68020 eases this restriction; however, the time penalty for non-optimal
alignment is substantial, as we shall see later.

As a result of the addressing restrictions and of the fact that the dbf
(decrement and branch until false) instruction only uses the lower 16 bits
of the register, the code below can only move up to 64K on the 68000 and
256K on the 68020. Note that the 68000 series also has the auto-increment
and auto-decrement addressing modes seen earlier on the 6809.

          0 < N < 65536
          68000 COMMENTS
          22 BYTES
          MOVEA.L FROM,A0 Load pointers into any
          MOVEA.L TO,A1 unused address registers
          MOVE.W #N-1,D0 N-1 because dbf stops at -1
LOOP MOVE.B (A0)+,(A1)+ Move a byte
          DBF D0,LOOP Continue looping

                0 < N < 256K
                68020 chip
          N MODULO 4 = 3 N MODULO 4 = 2
          26 BYTES 24 BYTES
          MOVEA.L FROM,A0 MOVEA.L FROM,A0
          MOVEA.L TO,A1 MOVEA.L TO,A1
          MOVE.B (A0)+,(A1)+
          MOVE.W (A0)+,(A1)+ MOVE.W (A0)+,(A1)+
          MOVE.W #N/4-1,D0 MOVE.W #N/4-1,D0
LOOP MOVE.L (A0)+,(A1)+ LOOP MOVE.L (A0)+,(A1)+
          DBF D0,LOOP DBF D0,LOOP

          N MODULO 4 = 1 N MODULO 4 = 0
          20 BYTES 18 BYTES
          MOVEA.L FROM,A0 MOVEA.L FROM,A0
          MOVEA.L TO,A1 MOVEA.L TO,A1
          MOVE.B (A0)+,(A1)+
          MOVE.W #N/4-1,D0 MOVE.W #N/4-1,D0
LOOP MOVE.L (A0)+,(A1)+ LOOP MOVE.L (A0)+,(A1)+
          DBF D0,LOOP DBF D0,LOOP

Another possible code sequence involves the use of the movem instruction,
which can load and store multiple registers. Although it generates more
code, it can be almost twice as fast on the 68000. The speed increase on
the 68020 is much smaller, less than 10 percent. It also allows up to 2
megabyte moves to be generated while still using a dbf instruction, due to
the fact that 32 bytes per iteration are moved. One (big) problem with
this code is that it uses most of the registers, which could cause a lot
of otherwise unnecessary saving of register values to memory to accomodate
it.

When I tested the 68010, it presented an interesting problem for this
enhancement due to its two instruction (eight byte) cache. In the above
example, the main loop is two instructions, so it fits nicely into the
68010's cache. The example below, however, will cause continual reloading
of the cache during the loop because it is four instructions long.
Therefore, the reloading must be figured into the time for the loop below.
The 68000 and the 68020 do not share this problem because the 68000 has no
cache and the 68020 has plenty of room for a four instruction loop.

          0 < N < 2
          68000 and 68020 COMMENTS
          44 BYTES
          MOVEA.L FROM,A2 Load pointers into
          MOVEA.L TO,A3 the address registers
          MOVE.B (A2)+,(A3)+ Generate this if bit 0 of N is set
          MOVE.W (A2)+,(A3)+ Generate this if bit 1 of N is set
          MOVE.W #N&28/4-1,D0 Generate this to move 4 byte words
WORD MOVE.L (A2)+,(A3)+ These will move everything less than
          DBF D0,WORD 32 byte multiples
          MOVE.W #N/32-1,D0 N-1 because dbf stops at -1
          MOVEQ.L #1,D1 These 2 instructions replace
          ASR.L #5,D1 a MOVE.L #32,D1 and save 2 bytes
LOOP MOVEM.L (A2)+,D2-D7/A0-A1
          MOVEM.L D2-D7/A0-A1,(A3)
          ADD.L D1,A3 Auto-increment not allowed above
          DBF D0,LOOP Continue looping

          For values of N greater than those specified above, the
          following code should be added at the end:

          SUB.L #$10000,D0 Decrement the top 16 bits of the length
          BGE.L LOOP Continue looping if length is positive

                MINICOMPUTERS

The Data General MV8000 is a 16 bit pipelined minicomputer with a physical
address space of 4 megabytes and a virtual address space of 4 gigabytes
(newer DG machines, such as the DG 20000, can address up to 64 megabytes).
Although it is a 16 bit machine, most instructions will work on 32 bit
data. It has four 32 bit fixed-point accumulators and four 64 bit
floating-point accumulators. The MV8000 supports 8 bit, 16 bit, and 32
bit fixed-point arithmetic, as well as 64 bit floating- point arithmetic.
The MV8000 has block move instructions similar to those of the Z-80 and
the Intel chips. The code example is shown below.

                0 <= N < 4 GIGABYTES
        10 BYTES COMMENTS
    NLDAI N,1 Destination size in accumulator 0
    WMOV 0,1 Copy to source size
    LLEFB 3,FROM,0 Load source address
    LLEFB 2,TO,0 Load destination address
    WCMV Do the move

It is interesting to note a difference between microcomputers and the
larger machines at this point. None of the microcomputers listed here has
load doubleword/store doubleword instructions, while all the larger
machines have those instructions. This makes the case of small moves,
especially 8-byte moves, very efficient.

The Prime 9955 is a 32 bit pipelined minicomputer which can address up to
16 megabytes of physical memory and up to 512 megabytes of virtual memory.
It has an instruction cache of from 2 bytes to 16K bytes, which can also
be used for data. The 9955 has eight register sets, each containing
thirty-two 32 bit registers. Four of these are user register sets, but
only one set can be assigned to a given process. The 9955 supports 16 bit
and 32 bit fixed-point arithmetic, and 32 bit, 64 bit, and 128 bit
floating-point arithmetic. It has a block move instruction, shown below.

      0 < N < 65536 65536 < N < 64 MEGABYTES
        18 BYTES 26 BYTES COMMENTS
      LFLI 1,N LDL =N Load byte count
                                          TLFL 1 Transfer L to FL1
      EAFA 0,FROM EAFA 0,FROM Load source address
      EAFA 1,TO EAFA 1,TO Load destination address
      ZMVD ZMVD Move bytes

The Digital VAX 8700 is a 32 bit pipelined minicomputer with a physical
address range of 128 megabytes and a virtual address range of 4 gigabytes.
The VAX 8700 has a 16-byte instruction cache. It supports 8 bit, 16 bit,
32 bit, 64 bit, and 128 bit fixed-point arithmetic, and 32 bit, 64 bit,
and 128 bit floating-point arithmetic. The VAX 8700 has a large
repertoire of instructions and addressing modes, many of them specialized
for certain tasks. The VAX code for a short memcpy, shown below, is
shorter than any of the others. It uses only one instruction!

          0 < N < 65536
          14 BYTES
          MOVC3 #N,FROM,TO

          65536 <= N < 4 GIGABYTES COMMENTS
          MOVL #L,R1 Outer loop index; L = N / 65536
LOOP MOVC3 #65535,FROM,TO Move 64K - 1 each time
          SOBGTR R1,LOOP The '- 1' portion is moved below
          MOVC3 #M,FROM,TO M = N modulo 65536 + N /65536

              MAINFRAME

The IBM 4341 Model 2 is a 32 bit pipelined mainframe with 16 megabytes of
physical memory and 4 gigabytes of virtual memory. The instruction set
used on the 4341 series is basically the same as on its IBM 360 and 370
ancestors, with enhancements (for instance, the MVCL instruction, shown
below, did not exist on the 360, but the MVC did). The first two code
examples are actual sequences generated by the SAS/C (TM) compiler,
although the alternate sequences are not due to size and complexity. The
case of N > 16 megabytes is not shown here because it generates a
subroutine call. The alternate sequences shown for large N have been
shown empirically to be faster on the 4341 series model 2 and the 3081.
The MVC in a loop is faster than an MVCL because the MVCL is an
interruptable instruction, whereas the MVC is not. A 65536 byte move on
the 4341 takes 3.37 milliseconds using the MVCL code, and 2.93
milliseconds using the alternate sequence, both assuming the conditions in
table 1, footnote 3.

      0 < N <=256 256 < N < 16M COMMENTS
      14 bytes 20 bytes

      L 2,FROM L 2,FROM Load pointers
      L 4,TO L 4,TO Note: for MVCL, the count
                                            L 3,=F'N' must be in the odd registers
                                            LR 5,3 of both even-odd pairs
      MVC 0(N,4),0(2) MVCL 4,2 (M)o(V)e (C)haracters

      ALTERNATE (M = N MODULO 256)
      256 < N < 16M
      M NOT = ZERO M = ZERO COMMENTS
      44 bytes 30 bytes

          L 2,FROM L 2,FROM Load pointers
          L 4,TO L 4,TO
        MVC 0(M,4),0(2) Move remainder
          LA 2,M(2) Increment pointers
          LA 4,M(4)
          L 3,N/256 L 3,N/256 Move 256 bytes per
LOOP MVC 0(256,4),0(2) LOOP MVC 0,(256,4),0(2) iteration
          LA 2,256(2) LA 2,256(2) Increment pointers
          LA 4,256(4) LA 4,256(4)
          BCT 3,LOOP BCT 3,LOOP Continue looping

                  Analysis of Results

Table 1 gives sizes and timings for each of the host machines to move 255
bytes, based on timing information from the manufacturers. The amount of
255 bytes was an arbitrary decision and should not be construed to mean
that one machine is faster or better in any way. Rather, the purpose of
this discussion is to note trends as we move from smallest to largest
machine.

An obvious trend as we move from smallest to largest is the decrease in
amount of time required to perform the memcpy function. The size, however,
shows no correlation to size of machine. The microprocessors show a
definite decrease in cycle time as we move from 8 bit to 16 bit to 32 bit
chips.

Other trends could not be condensed into a table. For example, there seems
to be a general trend for the number of cycles per instruction to increase
as the machine size increases. Perhaps this is because each instruction
performs more work as size increases. Interestingly, though, the number
of cycles per instruction decreases within a family of processors as size
increases, as in the Intel family of 8088, 80286, and 80386 processors.

Some parallels between families of machines should be noted. The IBM 360
mainframe computers were introduced in the early 1960s, and they had many
restrictions on boundary alignment. For example, a doubleword (8 bytes)
had to be aligned on a doubleword boundary (evenly divisible by 8). Later
IBM mainframes such as the 370, 4300, and 3080 series no longer require
these alignments, but a time penalty must be paid for non-optimal
alignment. This phenomenon is also exhibited in the Motorola 68000 series.
As mentioned earlier, the 68000 cannot address anything larger than a byte
from an odd address, but the 68020 can. Empirical timings for moving 8
bytes one million times on a HP 9000/320 (which is a 16.67 Mhz 68020) are:

TYPE OF MOVE TIME(SECONDS)
eight 1 byte moves 3.6
two 4 byte moves, odd addresses 2.7
two 4 byte moves, even addresses 2.0
two 4 byte moves, 8 byte boundary 1.4

All emperical timings shown include the loop overhead. Timings to gauge
the effect of the 32 bit move vs. the move multiple were performed by
moving 10000 bytes 100000 times. The results are shown below. Note how
the 68010's two instruction cache diminishes the speed increase of the
move multiple instruction.

                                                                            32-BIT MOVE
MACHINE MOVE(SECONDS) MULTIPLE
Amiga 2000 (68000,7.14 Mhz) 1450 1014
Apollo DN300 (68010,12 Mhz) 799 775
Apollo DN3000 (68020,12 Mhz) 292 255

Empirical timing for the Intel family also show an adressing boundary
relationship, even though none of the Intel chips have any alignment
restrictions. The timings for moving 40000 bytes 200 times using the rep
movsw instruction sequence on the machines listed in table 1 are:

TYPE OF MOVE TIME(SECONDS)
byte move, even or odd address,8088 20.7
word move, even or odd address,8088 15.8

byte move, even or odd address,80286 6.4
word move, odd address, 80286 6.4
word move, even address, 80286 3.2

byte move, even or odd address,80386 2.1
16 bit word move, odd address, 80386 2.9
16 bit word move, even address, 80386 1.1
32 bit word move, odd address, 80386 2.3
32 bit word move, 2-byte boundary, 80386 2.3
32 bit word move, 4-byte boundary, 80386 .5

>From the above results, we can draw the general conclusion that if you
have to move from an odd address, you may as well use a byte move on the
Intel chips.

Timings for the DG memcpy, shown below, were for 8, 255, 256, 512, and
65536 bytes, each executed 1000 times. A time is given for both addresses
being even, both odd, and one odd/one even. Note the much faster execution
time for the 8 byte, both even example. This is due to the fact that a
load doubleword/store doubleword combination was generated for that case.
Also note the much longer times in the "by the book" row for 255 and 65536
byte moves. These times were calculated using the DG Principles of
Operations manual timing of 1.43 microseconds per byte for the wcmv
instruction, which is referred to as a maximum time.

The DG timings indicated an unfortunate decision by the DG C compiler
authors. Note that the timings are fastest for the case of both addresses
being even, then both odd, with the one even/one odd combination being
slowest, below 256 bytes. At 256 bytes and above, the compiler generates
a wblm (wide block move) for the even address case instead of the wcmv
(wide character move) that it generates for all other cases. As a result,
the even address moves above 255 bytes are slower than the odd address
move. The odd/even moves are the slowest in all cases, however, and I have
no plausible explanation as to why. My first assumption was that as the
pointers are incremented thru memory, the both odd and both even cases
will be addressed from an even boundary half the time, whereas the
odd/even case will always be addressing from one odd numbered location.
This explains why the odd case is faster than the odd/even case, but it
does not explain why the even case is faster than the odd case.

                                                    TIME IN MICROSECONDS
ADDRESSES 8 BYTES 255 BYTES 256 BYTES 512 BYTES 65536 BYTES
both even 6 55 92 178 42095
both odd 15 81 83 154 35070
odd/even 15 180 180 351 71189
by the book 13 366 368 734 93714

The only timings that were measured for the VAX were times of .43 seconds
to move 65535 bytes 128 times and .88 seconds to move 65535 bytes 255
times. This was measured on a VAX 8700 during the day under a "normal"
load, whatever that is.

The mainframe timings demonstrated how relative timing relationships
between two instructions can change from one machine to another. The
table below contains emperical timings from the IBM 4341 and the IBM 3081
for an MVC instruction in a loop, an MVCL instruction, and a LM/STM (load
multiple/store multiple) instruction pair in a loop. The code sequence
from the MC68000 microprcessors using MOVEM (move multiple) inspired us to
test the same kind of sequence on the mainframe. I used the same amount
of registers, loading and storing 32 bytes per iteration. Although the
MVC in a loop still wins the race, the LM/STM came in second on the IBM
3081. It ran a distant third when tested on the IBM 4341 and fell back to
third place when tested on the 3090. All tests moved 16K bytes 10000
times, and all times are in seconds.

TYPE OF TEST 3090 3081 4341

EMPTY LOOP .01 .02 .08
MVC 0.90 2.30 15.98
MVCL 1.18 3.49 18.29
LM/STM 1.65 3.37 29.05

                                  Conclusion

Although this has been a comparison of machines, it has not been a
competition between them. They each have their merits, and it would be
foolish to judge them on the basis of a single instruction sequence. Just
as I would not recommend the use of one and only one language for all
programming tasks, I would not recommend the use of only one machine. I
encourage the reader to learn to use as many different machines as
practical and enjoy the diversity.

        Table 1: 255 byte move statistics

                Code Cycle Total time to move
                Size Time Representative Machine 255 bytes
Machine bytes nsec. Name Speed(Mhz) usec. cycles
6502 12 559 Atari 400 1.79 2708 4848(1)
6809 20 1117 TRS-80 Color Computer 0.895 3010 2694
Z-80 11 500 TRS-80 Model III 2 2705 5410
8088 10 210 IBM PC/XT 4.77 685 3266
80286 10 125 IBM PC/AT 8 82 656
80386 62.5 Compaq Deskpro 386 16 29.4 471
68000 18 128 Apple Macintosh 7.83 334 2618
                44 (68000 move multiple) 188 1474
68020 22 63.8 Apple Macintosh II 15.6672 44.2 693(2)
                44 (68020 move multiple) 40.7 638
MV8000 20 - DG MV8000 - 366(4) -
9955 18 - Prime 9955 - - -
8700 14 45 VAX 8700 22 13.4(5) -
4341-2 14 120 IBM 4341 Model 2 8.3 to 10.3 86 to
                              to 240 4.2 43(3)

(1) Assumes page boundary crossed one-half of the time.
(2) Assumes an empty cache on entry.
(3) Assumes no operands cross doubleword boundaries, the index
        register is zero, and the operands are mutually aligned on
        doubleword boundaries. The 4341 has a variable cycle time,
        ranging from 120 to 240 nanoseconds.
(4) The 1.43 microsecond per byte timing given for the DG block move is
        referred to as a maximum time. The measured times were much faster.
(5) This timing was generated by moving 65535 bytes 255 times, then
        dividing the cpu time (.88 seconds) by 65535. Since it is a measured
        time and not "from the book", as the others are, take it with a large
        grain of salt.

Table 2: Summary of machine characteristics

                  Word Address space Pipe- Floating Instr. Virtual
Machine Size Phys. Virt. lined? point? cache? Machine?
6502 8 64K N(1) N N N N
6809 8 64K N(1) N N N N
Z-80 8 64K N(1) N N N N
8088 8/16(2) 1M N Y Y 4 N
80286 16 1M/16M(3) 16M Y Y 6 N
80386 32 4G 64T Y Y 256 Y
68000 32 16M 16M N Y N Y
68020 32 4G 4G Y Y 256 Y
DG 32 64M 4G Y
Prime 32 16M 512M 16K
Vax 32 4G 4G
IBM 32 16M 4G

(1) Bank switching on some models.
(2) 8 bit external bus width, 16 bit internal.
(3) 1 megabyte in real mode, 16 in protected mode.

                      Acknowledgements

I would like to acknowledge the invaluable assistance of the following
people: Alan Beale, Tony Fisher, Robert Cross, Mike Hecht, Larry Noe,
Rodney Radford, Brian Schellenberger, Dan Squillance, John Toebes, and Ed
Trumbull.

                          References

Programming the 6502, Rodnay Zaks, Sybex, 1980.
Programming the 6809, Rodnay Zaks and William Labak, Sybex, 1982.
Programming the Z-80, Rodnay Zaks, Sybex, 1982.
iAPX 86/88, 186/188 User's Manual, Intel Corp., 1985.
iAPX 286 Programmer's Reference, Intel Corp., 1985.
Introduction to the 80386, Intel Corp., 1986.
80386 Hardware Reference Manual, Intel Corp., 1986.
MC68000 Programmer's Reference Manual, Motorola, 1986.
MC68020 User's Manual, Motorola, 1985.
DG Eclipse MV/ Family 32 bit systems Principles of Operations,
  Data General Corp., January 1986.
Prime 50 Series Technical Summary, Prime Computer Inc., May 1983.
Prime Instruction Sets Guide, Prime Computer Inc., January 1985.
VAX Architecture Handbook, Digital Equipment Corp., 1981.
IBM 4341 Model 2 Functional Characteristics, IBM, 1981.
Digital Technical Journal, Digital Equipment Corporation, February 1987.

SAS/C is a trademark of SAS Institute Inc., Cary, NC, USA.
Copyright 1987 by SAS Institute Inc.
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Inline block moves

disque@unx.sas.com (Thomas Disque)Mon, 11 Nov 91 16:26:36 GMT

disque@unx.sas.com (Thomas Disque)
Mon, 11 Nov 91 16:26:36 GMT