vectorization in icc

"Bik, Aart" <>
3 Dec 2002 00:43:39 -0500

          From comp.compilers

Related articles
vectorization in icc (2002-11-26)
Re: vectorization in icc (Kral Stefan) (2002-12-01)
vectorization in icc (Bik, Aart) (2002-12-03)
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07)
vectorization in icc (Bik, Aart) (2002-12-07)
Re: vectorization in icc (Terry Greyzck) (2002-12-11)
Re: vectorization in icc (2002-12-11)
Re: vectorization in icc (2002-12-11)
Re: vectorization in icc (2002-12-11)
[1 later articles]
| List of all articles for this month |

From: "Bik, Aart" <>
Newsgroups: comp.compilers
Date: 3 Dec 2002 00:43:39 -0500
Organization: Compilers Central
References: 02-11-173 02-12-006
Keywords: parallel, performance
Posted-Date: 03 Dec 2002 00:43:38 EST

>I've been experimenting with the Intel C/C++ compiler for Linux, and in
>particular, with the automatic vectorization.
You did not copy and paste the full context of the loop and the resulting
assembly code, so that I am unable to determine if all arrays are aligned at
16-byte boundaries (although some of them clearly are). Let's assume the code
reads something like this:
char d[16], dm[16], mm[16], B[16]; /* 16-byte alignment enforced by Intel
compiler */
doit() {
    int j;
    for( j = 0; j < 16; j++ ) {
                  d[ j ] = d[ j ] + d[ j ];
                  d[ j ] = d[ j ] | B[ j ];
                  dm[ j ] = d[ j ] & mm[ j ];
Then the assembly after automatic vectorization with the Intel compiler will
consists of a full SIMD version of the function doit(), as shown below, where
a 16-byte alignment has been automatically enforced on the external arrays:
                PUBLIC _doit
                movdqa xmm1, XMMWORD PTR _d
                movdqa xmm0, XMMWORD PTR _mm
                paddb xmm1, xmm1
                por xmm1, XMMWORD PTR _B
                movdqa XMMWORD PTR _d, xmm1
                pand xmm1, xmm0
                movdqa XMMWORD PTR _dm, xmm1
Measuring the execution time of calling this function 10000000 on a Pentium 4
Processor 2.66GHz. gives an execution time 0.28s for a serial version (/O2)
version against an execution time 0.11s for the vectorized version listed
above (/QxW), which is more than twice as fast. Likewise, when the arrays are
local arrays (as implied by your assembly code), as in:
doit() {
    char d[16], dm[16], mm[16], B[16]; /* 16-byte aligned of stack + arrays
will be enforced by the Intel compiler */
Then the Intel compiler will enforce a 16-byte alignment of the stack frame +
local arrays which yields similar code and speedup. Of course, neither
experiment is representative for how the loop may behave in your real-life
application (memory effects, for example, are not accounted for at all, the
effective alignment of some of the arrays may be suboptimal, or some of the
arrays may be accessed through pointers). But it shows the potential
performance boost that could be obtained.
If you could provide more information on the context in which your loop is
used, I may be able to help you get a similar performance boost.
Aart Bik, Senior Staff Engineer, SSG, Intel Corporation
2200 Mission College Blvd. SC12-301, Santa Clara CA 95052
email: URL:

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.