We know already how to read data from memory (LOAD), how to perform basic vertical and horizontal operations, and how to write results back to memory (STORE). But operating on very short vectors is not particularly useful, as we usually want to handle large data sets. The solution is very simple: just use C++ loops. Consider following example:
Example: Given scalar coefficients a and b, and two arrays x and y, compute a set of variables given by N linear equations:
Solution:
Since we don’t know how big (or small) N is, we have to consider the case, when N is not a multiple of number of elements packed in SIMDVec (we will call this value simd stride). We therefore have to split N in two parts:
- part executed using SIMDVec code (peel loop), and
- part executed using scalar code (remainder loop).
For the actual code we will assume the SIMD stride to be 4 and packed precision type to be float.
#include <umesimd/UMESimd.h> using namespace UME::SIMD; //... void linear_equation_evaluator( float *y, float *x, float a, float b, int N) { SIMDVec<float, 4> x_vec, y_vec; SIMDVec<float, 4> a_vec(a), b_vec(b); int REMAINDER_COUNT=N%4; int REMAINDER_OFFSET=N-REMAINDER_COUNT; // Compute 'peel' part: for(int i=0; i<N; i+=4) { x_vec.load(&x[i]); y_vec=a_vec*x_vec+b_vec; y_vec.store(&y[i]); } // Compute 'remainder' part: for(int i=REMAINDER_OFFSET; i<N; i++) { y[i]=a*x[i]+b; } } //...
The above code structure is worth memorizing. It will appear in UME::SIMD codes very often. What we do first is to reserve some registers for x and y values. Wa also prepare vectors containing constants a and b. Additional necessary information in form of remainder loop offset is calculated as REMAINDER_OFFSET.
We start computations by iterating data elements in chunks of 4 (the SIMD stride). This peel loop is the major source of our performance-gain: instead of processing one element at a time (as happens in remainder loop), we are processing up to SIMD-stride elements. In theory this should give us up to 4x speedup! (In practice this will rarely be the case. We will discuss this in one of future posts dedicated to portability and selection of proper stride.) This speedup is due to the fact that UME::SIMD library is internally using hardware based mechanisms for data level parallelism, such as specialized CPU instructions.
Finally we calculate remainder loop, which performs the calculations over the remainig elements. Mind that remainder loop in this case is just regular scalar code you would normally write for solving the problem presented in the example. The only difference is the initial value of the loop index i. For practical scenarios, be advised to first write the scalar version of your kernel, as you might still need it for remainder part of your vectorized kernel. Having the scalar version is also important for results verification.
The function presented in our solution can only be simplified if we know from the definition of the problem that N is multiple of 4. In such scenario, we could completely remove remainder loop from the function definition. Alternatively if we know that N is always less than 4, we could completely remove the peel part, as it will never be executed. Unfortunately for most scenarios this will mean some sort of code duplication. And code duplication is always bad: twice the bugs, twice the testing, quadruple the maintenance time… The next post will show you how this code duplication could be avoided.