Programming Assignment I: SIMD Programming

The purpose of this assignment is to familiarize yourself with SIMD (single instruction, multiple data) programming. Most modern processors include some form of vector operations (i.e., SIMD instructions), and some applications may take advantage of SIMD instructions to improve performance through vectorization. Although modern compilers support automatic vectorization optimizations, the capabilities of compilers to fully auto-vectorize a given piece of code are often limited. Fortunately, almost all compilers (targeted to processors with SIMD extensions) provide SIMD intrinsics to allow programmers to vectorize their code explicitly.

Please download the code and unzip it:

$ wget https://nctu-sslab.github.io/PP-s21/HW1/HW1.zip
$ unzip HW1.zip -d HW1
$ chmod -R 700 HW1 # avoid cheating
$ cd HW1

1. Vectorizing Code with Automatic Vectorization Optimizations

Take the exercises below and answer questions Q1, Q2, and Q3.

We are going to start from scratch and try to let the compiler do the brunt of the work. You will notice that this is not a “flip a switch and everything is good” exercise, but it also requires effort from the developer to write the code in a way that the compiler knows it can do these optimizations. The goal of this assignment is to learn how to fully exploit the optimization capabilities of the compiler such that in the future when you write code, you write it in a way that gets you the best performance for the least amount of effort.

Auto-vectorization is enabled by default at optimization levels -O2 and -O3. We first use -fno-vectorize to disable automatic vectorization, and start with the following simple loop (in test1.cpp):

void test1(float* a, float* b, float* c, int N) {
  __builtin_assume(N == 1024);
  
  for (int i=0; i<I; i++) {
    for (int j=0; j<N; j++) {
      c[j] = a[j] + b[j];
    }
  }
}

We have added an outer loop over I whose purpose is to eliminate measurement error in gettime(). Notice that __builtin_assume(N == 1024) tells the compiler more about the inputs of the program—say this program is used in a mobile phone and always has the same input size—so that it can perform more optimizations.

You can compile this C++ code fragment with the following command and see the generated assembly code in assembly/test1.novec.s.

$ make clean; make test1.o ASSEMBLE=1

You are recommended to try out Compiler Explorer, a nifty online tool that provides an “interactive compiler”. This link is pre-configured for 10.0.1 version of clang and compiler flags from the makefile. (To manually configure yourself: select language C++, compiler version x86-64 clang 10.0.1 and enter flags -O3 -std=c++17 -Wall -fno-vectorize. A screenshot is shown below.

fno-vectorize

1.1 Turning on auto-vectorization

Let’s turn the compiler optimizations on and see how much the compiler can speed up the execution of the program.

We remove -fno-vectorize from the compiler option to turn on the compiler optimizations, and add -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize to get more information from clang about why it does or does not optimize code. This was done in the makefile, and you can enable auto-vectorization by typing the following command, which generates assembly/test1.vec.s.

$ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1

You should see the following output, informing you that the loop has been vectorized. Although clang does tell you this, you should always look at the assembly to see exactly how it has been vectorized, since it is not guaranteed to be using the vector registers optimally.

test1.cpp:14:5: remark: vectorized loop (vectorization width: 4, interleaved count: 2) [-Rpass=loop-vectorize]
    for (int j=0; j<N; j++) {
    ^

You can observe the difference between test1.vec.s and test1.novec.s with the following command or by changing the compiler flag on Compiler Explorer.

$ diff assembly/test1.vec.s assembly/test1.novec.s

1.2 Adding the __restrict qualifier

Now, if you inspect the assembly code—actually, you don’t need to do that, which is out of the scope of this assignment—you will see the code first checks if there is a partial overlap between arrays a and c or arrays b and c. If there is an overlap, then it does a simple non-vectorized code. If there is no overlap, it does a vectorized version. The above can, at best, be called partially vectorized.

The problem is that the compiler is constrained by what we tell it about the arrays. If we tell it more, then perhaps it can do more optimization. The most obvious thing is to inform the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers. By adding this type qualifier, you can hint to the compiler that for the lifetime of the pointer, only the pointer itself or a value directly derived from it (such as pointer + 1) will be used to access the object to which it points.

C++ does not have standard support for restrict, but many compilers have equivalents that usually work in both C++ and C, such as the GCC’s and clang’s __restrict__ (or __restrict), and Visual C++’s __declspec(restrict).

The code after adding the __restrict qualifier is shown as follows.

void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
  __builtin_assume(N == 1024);

  for (int i=0; i<I; i++) {
    for (int j=0; j<N; j++) {
      c[j] = a[j] + b[j];
    }
  }
}

Let’s modify test1.cpp accordingly and recompile it again with the following command, which generates assembly/test1.vec.restr.s.

$ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1

Now you should see the generated code is better—the code for checking possible overlap is gone—but it is assuming the data are NOT 16 bytes aligned (movups is unaligned move). It also means that the loop above can not assume that the arrays are aligned.

If clang were smart, it could test for the cases where the arrays are either all aligned, or all unaligned, and have a fast inner loop. However, it is unable to do that currently.🙁

1.3 Adding the __builtin_assume_aligned intrinsic

In order to get the performance we are looking for, we need to tell clang that the arrays are aligned. There are a couple of ways to do that. The first is to construct a (non-portable) aligned type, and use that in the function interface. The second is to add an intrinsic or three within the function itself. The second option is easier to implement on older code bases, as other functions calling the one to be vectorized do not have to be modified. The intrinsic has for this is called __builtin_assume_aligned:

void test(float* __restrict a, float* __restrict b, float* __restrict c, int N) {
  __builtin_assume(N == 1024);
  a = (float *)__builtin_assume_aligned(a, 16);
  b = (float *)__builtin_assume_aligned(b, 16);
  c = (float *)__builtin_assume_aligned(c, 16);
  
  for (int i=0; i<I; i++) {
    for (int j=0; j<N; j++) {
      c[j] = a[j] + b[j];
    }
  }
}

Let’s modify test1.cpp accordingly and recompile it again with the following command, which generates assembly/test1.vec.restr.align.s.

$ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1 ALIGN=1

Let’s see the difference:

$ diff assembly/test1.vec.restr.s assembly/test1.vec.restr.align.s

Now finally, we get the nice tight vectorized code (movaps is aligned move.) we were looking for, because clang has used packed SSE instructions to add 16 bytes at a time. It also manages load and store two at a time, which it did not do last time. The question is now that we understand what we need to tell the compiler, how much more complex can the loop be before auto-vectorization fails.

1.4 Turning on AVX2 instructions

Next, we try to turn on AVX2 instructions using the following command, which generates assembly/test1.vec.restr.align.avx2.s

$ make clean; make test1.o ASSEMBLE=1 VECTORIZE=1 RESTRICT=1 ALIGN=1 AVX2=1

Let’s see the difference:

$ diff assembly/test1.vec.restr.align.s assembly/test1.vec.restr.align.avx2.s

We can see instructions with prefix v*. That’s good. We confirm the compiler uses AVX2 instructions; however, this code is still not aligned when using AVX2 registers.

Q1: Fix the code to make sure it uses aligned moves for the best performance.

Hint: we want to see vmovaps rather than vmovups.

1.5 Performance impacts of vectorization

Let’s see what speedup we get from vectorization. Build and run the program with the following configurations, which run test1() many times, and record the elapsed execution time.

# case 1
$ make clean && make && ./test_auto_vectorize -t 1
# case 2
$ make clean && make VECTORIZE=1 && ./test_auto_vectorize -t 1
# case 3
$ make clean && make VECTORIZE=1 AVX2=1 && ./test_auto_vectorize -t 1

Note that you may wish to use the workstations provided by this course, which support AVX2; otherwise, you may get a message like “Illegal instruction (core dumped)”. You can check whether or not a machine supports the AVX2 instructions by looking for avx2 in the flags section of the output of cat /proc/cpuinfo.

$ cat /proc/cpuinfo | grep avx2

Q2: What speedup does the vectorized code achieve over the unvectorized code? What additional speedup does using -mavx2 give (AVX2=1 in the Makefile)? You may wish to run this experiment several times and take median elapsed times; you can report answers to the nearest 100% (e.g., 2×, 3×, etc). What can you infer about the bit width of the default vector registers on the PP machines? What about the bit width of the AVX2 vector registers.

Hint: Aside from speedup and the vectorization report, the most relevant information is that the data type for each array is float.

You may also run test2() and test3() with ./test_auto_vectorize -t 2 and ./test_auto_vectorize -t 2, respectively, before and after fixing the vectorization issues in Section 2.6.

1.6 More examples

1.6.1 Example 2

Take a look at the second example below in test2.cpp:


void test2(float *__restrict a, float *__restrict b, float *__restrict c, int N)
{
  __builtin_assume(N == 1024);
  a = (float *)__builtin_assume_aligned(a, 16);
  b = (float *)__builtin_assume_aligned(b, 16);
  c = (float *)__builtin_assume_aligned(c, 16);

  for (int i = 0; i < I; i++)
  {
    for (int j = 0; j < N; j++)
    {
      /* max() */
      c[j] = a[j];
      if (b[j] > a[j])
        c[j] = b[j];
    }
  }
}

Compile the code with the following command:

make clean; make test2.o ASSEMBLE=1 VECTORIZE=1

Note that the assembly was not vectorized. Now, change the function with a patch file (test2.cpp.patch), which is shown below, by running patch -i ./test2.cpp.patch.

--- test2.cpp
+++ test2.cpp
@@ -14,9 +14,8 @@
     for (int j = 0; j < N; j++)
     {
       /* max() */
-      c[j] = a[j];
-      if (b[j] > a[j])
-        c[j] = b[j];
+      if (b[j] > a[j]) c[j] = b[j];
+      else c[j] = a[j];
     }
   }

Now, you actually see the vectorized assembly with the movaps and maxps instructions.

Q3: Provide a theory for why the compiler is generating dramatically different assembly.

1.6.2 Example 3

Take a look at the third example below in test3.cpp:

double test3(double* __restrict a, int N) {
  __builtin_assume(N == 1024);
  a = (double *)__builtin_assume_aligned(a, 16);

  double b = 0;

  for (int i=0; i<I; i++) {
    for (int j=0; j<N; j++) {
      b += a[j];
    }
  }

  return b;
}

Compile the code with the following command:

$ make clean; make test3.o ASSEMBLE=1 VECTORIZE=1

You should see the non-vectorized code with the addsd instructions.

Notice that this does not actually vectorize as the xmm registers are operating on 8 byte chunks. The problem here is that clang is not allowed to re-order the operations we give it. Even though the the addition operation is associative with real numbers, they are not with floating point numbers. (Consider what happens with signed zeros, for example.)

Furthermore, we need to tell clang that reordering operations is okay with us. To do this, we need to add another compile-time flag, -ffast-math. Compile the program again with the following command:

$ make clean; make test3.o ASSEMBLE=1 VECTORIZE=1 FASTMATH=1

You should see the vectorized code with the addpd instructions.

2. Requirements

You will need to meet the following requirements and answer the questions (marked with “Q1~Q3”) in a REPORT using HackMD. (Markdown is a common format that is widely used for developer documentation (e.g., GitHub), and HackMD is a markdown service which is free, powerful, and the most importantly made in Taiwan. To learn Markdown, you may refer to the video listed in the references.)

Answer the three questions (Q1, Q2, and Q3). We don’t test your code. If you have code for answering the questions, show the code and explain it thoroughly in your report.

3. Grading Policy

NO CHEATING!! You will receive no credit if you are found cheating.

Total of 30%:

4. Evaluation Platform

Your program should be able to run on UNIX-like OS platforms. We will evaluate your programs on the workstations dedicated for this course. You can access these workstations by ssh with the following information. (To learn how to use ssh and scp, you can refer to the video listed in the references)

The workstations are based on Ubuntu 18.04 with Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz processors. g++-10(used in part1) and clang++-11(used in part2) have been installed.

IP Port User Name Password
140.113.215.195 37076 ~ 37080 {student_id} {Provided by TA}

Login example:

$ ssh <student_id>@140.113.215.195 -p <port>

6. Submission

All your files should be organized in the following hierarchy and zipped into a .zip file, named HW1_xxxxxxx.zip, where xxxxxxx is your student ID.

Directory structure inside the zipped file:

Zip the file:

$ zip HW1_xxxxxxx.zip -r url.txt

Notice that you just need to provide the URL of your HackMD report in url.txt, and enable the write permission for someone who knows the URL so that TAs can give you feedback directly in your report. If the URL is unreachable, you won’t get the points.

Be sure to upload your zipped file to new E3 e-Campus system by the due date.

You will get NO POINT if your ZIP’s name is wrong, the ZIP hierarchy is incorrect

Due Date: 23:59, March 22, Monday, 2021

7. References