Category Archives: Vectorization

SIMD approaches in C++ – comparison

One of the key elements of developing modern software for performance is the use of SIMD instructions. I dedicated a series of posts to the library I’ve been developing for a while here: UME. This is just one approach that I like the most, but there are others.

In this post I will discuss briefly pros and cons of each of the popular approaches. This is a lightweight post with the intention to give you an overview of how you can approach your problem. I won’t delve into gore technical, but rather try to present you with some external links where you could learn more about a specific technique. I will also limit the discussion to whatever tools are available for C++ projects.

Autovectorization

From the programmers perspective simplest and most pleasing concept: write your code normally and rely on compiler to generate SIMD instructions.  This is the biggest advantage of this approach. Unfortunately it is not that simple in practice.

First of all the compiler needs to be aware of SIMD instructions. If using g++, clang++ or some proprietary compiler developed specifically for the purpose then you are on the safe side. If the compiler is developed in-house or it is a niche language, then this approach will probably not work.

Secondly, the compiler cannot optimize any kind of codes. One thing is that autovectorization is usually limited to loops. And not all loops are allowed to be vectorized. In most cases the code has to be re-written to meet some specific criteria:

  • number of iterations has to be countable – that is it has to be known before the loop starts running,
  • no data dependencies between iterations – results of next iterations cannot use results from preceding iterations,
  • no control flow that breaks the loop – in C++ context this means no break and goto statements.

Third problem is performance stability. Imagine that you are working in a project where you are responsible for providing performance components. Suddenly someone is given a task to extend functionality of primitives that you tediously tuned for performance.  They modify one hot loop with seemingly innocent additional statement. The change makes it to the baseline and go to customer. Suddenly you get tons of bug reports saying that between two versions there is say an 8x slow-down… Now guess who is going to be blamed for that?

Some more reading about autovectorization:
GCC gives a nice list of loops that can be autovectorized.
A nice post on how SIMD can be used in java can be found here.
A guide to use autovectorization with Intel compiler.

One additional technology that I would put also into this classification is OpenMP. (I know, some people will not agree, but we are only talking about #pragma omp simd which is a very small feature of the whole standard. ) The approach is based on annotating loops or functions with directives informing that the compiler can safely replace a scalar construct with its’ SIMD equivalent. The only difference with basic autovectorization is the fact that it is easier to inform the compiler that specific criteria for loop vectorization are fulfilled. It also requires a compliant compiler.

Assembly

The most crude way of programming. Regardless of what people say to you this is still used. There is a plethora of disadvantages of manual assembly: validation, portability, problems with compiler interaction, code maintenance etc. The way I see it there is one perfectly good reason why you might use it: WYWIWYG (What You Want Is What You Get). When you write a list of assembly instructions, you will get exactly what you requested. No transformations performed by compiler, except for mapping register names. This gives you almost total control over what will happen inside your kernels.

Using SIMD instructions with this approach does not differ from assembly programming, so you probably have to be familiar with inline assembly and instruction set details (Intel and ARM as examples).

SIMD Intrinsic functions

Intrinsic functions are a form of an unofficial extension to a language. Because auto-vectorization requires a lot of work on both compiler theory and engineering, a simple solution was offered in the early days of SIMD instruction sets. The basic idea is that you write your code using functions and special register types.  As opposed to auto-vectorization the use of SIMD operations is stated explicitly by the programmer. This reduces the problem of performance stability to just few corner cases.  This approach has been shown many times as a sufficient methodology for reaching near-peak performance.

There are few drawbacks. First of all intrinsics only expose instructions (or groups of instructions if you think of SVML). This means that you have to still program for a specific instruction set. So portability is heavily limited.

To address portability problem, you have to look at your project design, and decide which kernels you want to specialize for different architectures, and provide some dispatching mechanism. This is cumbersome as it requires you to conditionally include fragments of your implementation. While programming with intrinsics is not that difficult, the structural requirements for big libraries/applications can be pretty heavy.

Another problem is name mangling. Intrinsic functions were developed to address problem of direct access to the instructions from C/C++ level. This means that function declarations have to conform to C standard. A very simple operation such as addition cannot be simply named as _add(…), as this would have to be overloaded for all sort of register types. This means that there are multiple addition functions, each of them having different name and input types. For instance a search in Intel intrinsics guide gives something like this:

add.png

Encoding/decoding the names is not that difficult once you get familiar with the convention, but it is still slower than just writing ‘A + B’.  You can get used to that, but if you want to use whole horsepower of C++ you will have to develop wrappers just to make code little bit easier to handle.

Last thing, a problem in some situations and a feature in others, is that intrinsics don’t really match to individual instructions. The compiler has to perform some additional operations when managing registers (think of register spill/fill). So the code you will write with intrinsics will not be exactly what you would get writing inline assembly. Because the compiler is involved it can replace some operations with some others, for instance recognizing FMA or streaming store opportunities, actually improving performance.

If you ever want to use this approach on x86 platforms, this is the link you have to know. You can find similar documentation for ARM.

Explicit Vectorization Libraries aka. Wrapper Classes

In order to improve on maintenance and portability several attempts have been made on providing a better type-oriented interface for intrinsic functions. Few libraries you might be interested are: VC, VCL, boost::simd and of course UME::SIMD. What is the most important about this approach, is that it tries to squeeze all instruction set under an umbrella of uniform data types. It also makes use of C++ features such as operator overloading to make the notation shorter and easier to work with.

This approach is very nice in terms of performance portability, although it can still have few caveat.  For instance in-register permutations are heavily limited by instruction sets, so it might be possible to speed that up by a more direct approach, such as intrinsics.

In terms of platform portability – this is perfect. You can write code once, and run it anywhere, albeit with no performance promises. From what I’ve seen this is not a big issue, as most architectures are already covered by existing implementations.

Code written using this approach has to still be more direct then autovectorization, so it might require from the user the understanding of SIMD computation model. The user has to still manage peel and remainder loops as well as operate on vector registers. This is definitely a drawback, as it dictates a certain additional effort on the structural part of the user code.

Embedded Domain Specific Languages (EDSL)

This is barely a category as I only want to present here one specific library: UME::VECTOR. In a sense this library is unique, although the concepts have been used in some way in both VC and boost::simd, as well as it has an interface predecessor in C++ as std::valarray. The fundamental idea behind this approach is that while explicit SIMD libraries only address the problem of instruction selection, the DSL approach allows handling of any array-like containers.

The approach gives a very efficient code generation, provided that a given implementation can expose it. Secondly a more efficient, if not optimal, memory access pattern is handled by a library. Third of all the interface does not use concept of SIMD registers, but more of 1-D vector arithmetics.

For instance in UME::VECTOR the library exposes a set of 1D vector types and operations, that use lazy code generation: the code is generated not on a per-operation basis, but rather at the end of definition of a computational kernel. Once the whole kernel is known, the compiler uses it and some library defined code generation scheme to generate actual list of CPU instructions. The library uses UME::SIMD for generating target-specific code, so it does not have to re-implement everything from scratch.

The drawback is: expression templates and their debugging… Because the result of an EDSL operation is not a value, but rather a nested type representing the operations to be carried, the final operation and the way it is handled might seem like a gibberish even for advanced programmers. This is the main reason why std::valarray is not very popular. The ET technique is an idiom, that can be well understood only if you know template template meta programming and have some clue about code generation. If you know the concept this should be perfect approach for you. If don’t – well you either get some basic understanding or use  one of the previous approaches.

JIT

This approach is rising in popularity, and I think it already has pretty broad range of use- cases. Two popular libraries that implement JIT-Assembly and are ready to be used are: asmjit and xbyak. They  primarily solves the problems of code being generated statically, so it might be a solution for situations where all previous approaches failed. The basic idea is to build computational kernels during runtime, instead of at compile time. This requires an additional overheads but only at initialization time. Once kernels a rebuilt, they can be executed as normal functions from regular C/C++ code.

First benefit is, that you don’t need to provide any target-specific binaries to target different architectures. This removes some packaging problems and handling many build-system configurations. The instruction set can be decided at runtime, and only a dedicated version of a kernel has to be generated.

The second benefit is, that you can decide when kernels have to be generated. This means that you can wait until very last moment, and even generate a kernel suitable for your input data! Can you imagine new possibilities…?

Now the drawbacks: the first one is again portability. So far I haven’t seen a good standalone library that addresses code generation for multiple architectures. JIT assemblers are targeting specific instruction sets only. Compiler tool chains such as LLVM allow you to develop a JIT compiler but they don’t offer a deployable solution.

The second issue is programming complexity: with JIT assemblers you don’t have an abstraction available over an instruction set. If you want to use JIT compiler, then the problem might be the need to understand some compiler design practices as well as to develop a form of an embedded language.

Summary

Just to summarize. I would rank different approaches in following categories:

  • Performance in basic, most common scenarios (Like ‘A+B’)
  • Performance in corner-case situations (Like permutations or data-dependent control flow),
  • Performance portability (As how likely it is to see performance degradation of the same code on another platform?)
  • Performance stability (How reluctant is the performance of a specific fragment of code to changes in the near neighborhood?)
  • Target portability (Will the code be able to be re-compiled for another platform?)
  • User code structure (Does the user need to write any additional code except for the SIMD sensitive parts?)
  • Architecture knowledge (How much does the user need to know about specific instruction set?)
  • SIMD knowledge (How much does the user need to know to efficiently use SIMD instructions?)
  • Additional programming knowledge (Are there any framework conventions or specific idioms that need to be known?)

As for the notes that I am giving, those are completely subjective values in range 1-6. The higher value, the better an approach scores in a given category.

I might be biased by what I already know and I also considered the different ranking categories in the context of existing tools. Things might change as all approaches will evolve, so it is not a definitive comparison. If you disagree, let me know.

evaluation

In general all approaches should give you pretty high performance in most common scenarios. To really compare each approach you have to look also on how important portability is for your project, how much of an effort will it be to scale use of a specific technology across the whole project, and how comfortable do you feel with a specific way you approach task at hand. There is no silver bullet.

UME::SIMD Tutorial 9: gather/scatter operations

As you already know a SIMD registers can communicate with the scalar domain (or more precisely with the memory) using LOAD and STORE operations. The drawback of these operations is that they only allow you to move data elements, that are placed contiguously in the memory. However very often in our codes we have to deal with non-contiguous memory access. In this tutorial we will explain GATHER/SCATTER operations and how they generalize LOAD/STORE operations.

There are scenarios in which you might want to populate a register with data comming from non-consecutive memory locations. Few examples are:

  1. Accessing every second element of an array ( strided access
  2. Accessing element of an array at freshly computed offsets ( indexed access)
  3. Accessing elements in different order (shuffling access)

We will discuss first two situations in this post. The last class of problems needs more thorough discussion in the context of permutations, so we will discuss it in the next tutorial.

What then is a SCATTER operation? It is simply a reverse of a GATHER operation, which ‘scatters’ the content of a register over the memory. It is therefore similar to STORE operation, but capable of non-congiguous memory access.

Strided access

The memory access pattern is called strided when memory fields accessed are equally distant. This distance is called a stride (not to be mistaken with SIMD-stride!). A simple visualization of strided access:

gather

As you can see, the STRIDE-1 access is a special case of GATHER operation: a LOAD operation. Why then do we have separate LOAD and GATHER operations (and STORE and SCATTER) instead just simplifying things and using only GATHERs? There are two explanations. Firstly, there is the historical one: early processors only implemented LOAD instructions to move data between memory and scalar registers. Since in scalar domain you can access any element using a single scalar index, there was no need for more flexible operations.
Secondly, there is the performance aspect: apart from passing the base memory address, the gather instruction requires also information about how to calculate specific offsets to load from. No matter how the instruction is implemented within the processor, this is an additional degree of freedom, possibly meaning longer execution time, but for sure meaning additional circuitry.

PERFORMANCE HINT Use load/store operations whenever you can, and avoid gather/scatter if possible. In most scenarios this will mean that your data structures/algorithm have to be modified. Use gather/scatter only when you are certain, that re-designing structures is not possible.

When performing a strided access you need to know what are the base address, passed as a pointer to the beginning of your data, and the stride value, passed as a scalar integer. The stride is always passed as the number of elements, rarther than memory offset so that the programming can be simplified.

Following code example shows how to perform a strided memory accesses:

float a[LARGE_DATA_SIZE];
uint32_t STRIDE = 8;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Note that we have to scale the loop index.
  int offset = i*STRIDE;

  // 'load' the data to vec.
  vec.gather(&a[offset], STRIDE);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[offset], STRIDE);
}

When using strided access, we have to pass the address of the first element. This is done by calculation offset variable in every iteration. The gather operation then uses this local base address together with a scalar stride to calculate offsets of the corresponding elements.
Once the necessary calculations are over, the updated results are stored back at original locations.

Indexed access

The indexed access is even more general-purpose than strided access. The primary difference is, that instead of passing a scalar stride parameter, you have to pass a SIMDVec of unsigned integer indices. The indexed access can be visualised using following graphic:

indexed_gather_scatter

The adventage of this mode is, that every element can be retrieved using a dedicated index. Disadvantage is that the indexing this way might completely destroy hardware based memory prefetching, and in turn negatively impact your performance. Also mind that all elements might be far away, meaning that more cachelines have to be moved to the lowest cache level.

An example code using indexed access:

float a[LARGE_DATA_SIZE];
int indices[PROBLEM_SIZE];
uint32_t STRIDE = 4;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Here we are using precomputed indices,
  // but they can be computed on-the-fly if necessary.
  SIMDVec<uint32_t, 8> indices_vec(&indices[i];

  // 'load' the data to vec.
  vec.gather(&a[0], indices_vec);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[0], indices_vec);
}

The basic difference is that instead of passing a scalar stride we will use an unsigned integer vector of 32b indices.

NOTE: at the moment the library is using vector of unsigned integers having the same precision as the scalar elements of the gathered vector. This causes troubles when dealing with mixed precision and when small types (e.g. uint8_t) don’t have enough bits to represent a full range of indices. The library will be updated to always use vectors of uint32_t indices

Securing conditional access

One of the problems you might spot when writing your code is when you are trying to deal with conditional statements. Consider following scalar code:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
for(int i = 0; i < PROBLEM_SIZE; i++) {
 // Here we are checking if for some reason one of the 
 // values in (a[i],b[i]) pair is not determined properly.
 if (std::isfin(a[i] - b[i])) {
   // Calculate the index only if both 'a' and 'b' are well defined
   int index = int(a[i] - b[i]);
   // 'gather' single element at a time
   float temp = c[index]; 
   // Do something with the value
   temp += 3.14; 
   // Update the values of 'c'
   c[index] = temp;
 }
}

As you should already know (see UME::SIMD Tutorial 8: Conditional execution using masks) the vector code using masks will execute all operations within the ‘if’ branch, regardless of the fact whether ‘a’ and ‘b’ are well defined or not. It might be tempting to re-write the above code as:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. a design decision) one 
 // of the values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

// if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(&c[0], indices); // This is WRONG!!!
 temp.adda(condition, 3.14); // only change selected elements
 temp.scatter(&c[0], indices); // Again WRONG!!!
}

Why are the gather and scatter operations in this snippet wrong? The are both trying to access the memory, even if the indices are incorrect. But neither GATHER nor SCATTER happen to care about that. They have to trust us to provide them with correct indices, at least for the performance reasons. If the indices are incorrect, they will try to access memory that is possibly outside of the boundaries of ‘c’ dataset. This might result in a memory fault, and for that reason we have to secure both memory read and write using our conditional mask:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. by design?) one of the 
 // values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

 // if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(condition, &c[0], indices); // Now the read is CORRECT!!!
 temp += 3.14; // we don't have to mask this operation
 temp.scatter(condition, &c[0], indices); // Again no problems here.
}

So in the corrected version we have to simply pass an additional mask parameter to memory access operations, rather to arithmetic operations. The library will take care of securing the reads so that they access only allowed memory addresses.

Summary

In this tutorial we introduced concept of GATHER/SCATTER operations, and explained why they are a useful addition to our SIMD programming model. We looked at both strided and indexed memory access patterns, and explained how this concept generalizes LOAD/STORE operations.

While very usefull, GATHER/SCATTER operations are a double-edged sword that can both make our lives easier, and destroy our performance.

In the next tutorial we will talk about permutations. There we will discuss also the GATHER/SCATTER as one of possible strategies for re-arranging vector elements.

UME::SIMD Tutorial 8: Conditional execution using masks

The first real difference between scalar code and SIMD code relates to conditional execution of calculations. C++ is, by design, a language heavily supporting scalar operations. Apart from the arithmetic statements, the language offers also control flow statements, such as: if-then-else, switch and goto. All these can be mixed with UME::SIMD code, but there are few restrictions imposed by the language. We will only discuss here the ‘if-else’ block, but the same considerations apply for other conditional statmeents.

Scalar conditional statements

A simple scalar conditional statement would look like this:

int x;
int a = 2;
int b = 3;
int c;
...

if(x) {
    c = a + b;
}
else {
    c = a - b;
}

Now depending on the value of x, the variable c will take either value of ‘5’ or ‘-1’. To understand which of two branches ( if branch or else branch) will be selected, we have to look at the predicate expression in the if clause:

if(<predicate expression>)

According to C++ rules, the type of the predicate expression, needs to be converted into a boolean. That is, if the expression is using only bool variables, then the result will be prety straightforward to predict. A different case is when we use expressions that evaluate to integers (in our case x is an expression that evaluates to an integer!). In such case we need to know how the C++ converts this integer expression to a boolean expression. The rule to do so is called implicit conversion rule, and is defined within the language standard. The rule goes like this:

Zero value, null pointer value, or null member pointer value is converted to false; any other value is converted to true.

In our particular case the same predicate expression will be therefore interpreted as a boolean predicate (or a boolean expression):

   
bool predicateValue = (x!=0)
if(predicateValue)

Understanding this relation is critical in understanding why we can’t use vector types directly within if() or while() statements.

Vector conditional statements

Let’s visualise what we will be talking about in this part using a piece of code:

 
UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

if(x) { // Error: What does this mean?!
    c = a + b;
}
else {
    c = a - b;
}

In the above code we are running into a very interesting problem: how to convert a vector x into a boolean value? Unfortunately, since C++ is not aware of the vector types, it cannot make any assumptions as for the implicit conversions to boolean expression. More of that: you might want to express different operations by this piece of code. Do you want to execute each block for specific vector elements that fulfill the predicate:

 
UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

for(int i = 0; i < 2; i++)
{
  if(x[i] != 0) {
    c[i] = a[i] + b[i];
  }
  else {
    c[i] = a[i] - b[i];
  }
}
 

or perhaps you want to execute whole block only if all values in the vector fulfill certain condition:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) && (x[1] != 0)
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

or maybe you want to execute the blocks if any of the elements meets the predicate criteria:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) || (x[1] != 0)
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

As you can see, the original language is too ambiguous to let you simpy use the same syntax. At the same time, a library such as UME::SIMD does not have capabilities of a compiler, allowing it to look forward into the code and make a decision based on forward analysis. We therefore need more straightforward, yet unambiguous mechanisms to express our intent.

Mask types

UME::SIMD uses a concept of data flow apart from the already well established control flow to steer the way the actual numerical values are being passed through the program. The data flow concept doeasn’t require any conditional statements. It rather uses a concept of an execution mask. Mask is a vector consisting of only boolean values. A ‘true’ value means that specific vector lanes should execute operation, while a ‘false’ value means that operation should be blocked, that is the result should not be written to the output.

The library exposes a category of types named SIMDVecMask for date-flow control. We will now go through basic operations that are allowed with these types.

Mask initialization using scalars

A basic declaration of looks similar to a vector declaration:


// Broadcast single scalar value
UME::SIMD::SIMDVecMask<4> mask0(true);

// Initialize using scalars
UME::SIMD::SIMDVecMask<4> mask1(true, false, false, true);
...
bool maskValues[4] = {true, false, false, true};
// Initialize using an array
UME::SIMD::SIMDVecMask<4> mask2(&maskValues[0]);

Mask initialization using predicate operations

A mask can be also obtained using one of the predicate operations applied on arithmetic vectors:


UME::SIMD::SIMDVec<float, 8> vec1, vec2;

// 'CoMPare EQual'
UME::SIMD::SIMDVecMask<8> m0=vec1.cmpeq(vec2); // or 'm0=vec1==vec2'
// 'CoMPare Not EQual'
UME::SIMD::SIMDVecMask<8> m1=vec1.cmpneq(vec2); // or 'm1=vec1!=vec2'
// 'CoMPare Greater Than'
UME::SIMD::SIMDVecMask<8> m2=vec1.cmpgt(vec2); // or 'm2=vec1>vec2'
// 'CoMPare Greater than or Equal'
UME::SIMD::SIMDVecMask<8> m3=vec1.cmpge(vec2); // or 'm3=vec1>=vec2'
// 'CoMPare Less Than'
UME::SIMD::SIMDVecMask<8> m4=vec1.cmplt(vec2); // or 'm4=vec1<vec2'
// 'CoMPare Less than or Equal'
UME::SIMD::SIMDVecMask<8> m5=vec1.cmple(vec2); // or 'm5=vec1<=vec2'

For each predicate a specific comparison operation is applied in an elementwise manner for elements of vectors vec1 and vec2.

Mask arithmetics

Masks can be also operated bit there are slightly different operations than for arithmetic vectors. This is due to the fact, that for portability reasons. It is not possible to mix arithmetic vectors, and masks, except for few special situations. Code below shows most of the operations allowed on masks, and their interaction with arithmetic types:


// Some vertical operations:
UME::SIMD::SIMDVecMask<8> m0, m1, m2;
....
m0 = m1.land(m2);
m0 = m1 && m2;    // the same as above
m0 = m1.lor(m2);
m0 = m1 || m2;    // the same as above
m0 = m1.lxor(m2);
m0 = m1 ^ m2;     // the same as above
m0 = m1.landnot(m2); // equivalent to: !m1 && m2

// Also some horizontal (reduction) operations
bool b0;
...
b0 = m1.hland() // m1[0] && m1[1] && ... && m1[7]
b0 = m1.hlor() // m1[0] || m1[1] && ... && m1[7]

Using masks

Once the mask is calculated, it can be passed as an optional parameter to most of the arithmetic operations:


UME::SIMD::SIMDVec<float, 4> vec0, vec1, vec2, vec3;
UME::SIMD::SIMDVecMask<4> m0;
...
m0 = vec0 > vec1;
// Add only pairs where m0 is true
vec2 = vec0.add(m0, vec1);
// Similar expressed using operator syntax
vec2[m0] = vec0 + vec1

// 'Blend' two vectors using masking.
// 'vec2[i]' becomes 'vec1[i]' where 'mask[i]'
// is true, and 'vec0[i]' where false.
vec2 = vec0.blend(mask, vec1);

Now let’s look at specific examples we discussed before. Scenario 1 would look like this in SIMD arithmetic:


UME::SIMD::SIMDVec<int,2> x(0, 10);
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
UME::SIMD::SIMDVec<int,2> t0, t1;
UME::SIMD::SIMDVecMask<2> m0;
...

m0 = x!=0;                // (false, true)
t0 = a+b;                 // (5, 2) 
t1 = a-b;                 // (-1, 8)
c = t1.blend(m0, t0);     // (-1, 2)

// Or briefly:
c=(a-b).blend(x!=0, a+b); // (-1, 2)

PERFORAMNCE PITFALL: when executing scalar code, only one branch of code gets executed for every element of a vector. The decision about which branch to execute is made on a per-iteration basis. This allows the microprocessor to completely skip certain parts of the code.
In SIMD world, an arithmetic operation happens atomicaly for all elements of the vector, and it is not possible to ‘skip’ the execution. As a result we have to execute both if and ‘else’ statements in every iteration. For long ‘if-else’ blocks this might mean executing large number of additional operations (as compared to scalar code), and therefore might decrease the expected vectorization gains.

The second example is the execution of the variant with ‘if all true’. Mind that in this case, we might still use the same ‘if-else’ statement:


UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...
// The original code was using following formula:
// bool predicate = (x[0] != 0) && (x[1] != 0)

// We first declare the mask, and perform a proper
// comparison for values of 'x':
UME::SIMD::SIMDVecMask<4> m0 = x.cmpneq(0);
// Now we have to perform a 'Horizontal Logical AND'
// (reduction) operation:
bool predicate = m0.hland();

// At this moment we already know that
// we have to execute either 'if' or 'else' block
// for all elements of a vector.
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

The same format can be used for Scenario #3 but using ‘Horizontal Logical OR’ operation. In general case it is up to you, to decide which construct is more usefull in a given situation.

PERFORMANCE PITFALL Using reduction operations is expensive on it’s own, so it should be avoided. While we might be able to decrease the number of instructions executed in scenarios #2 and #3, the cost of reduction together with the cost of handling ‘if-else’ statement might be actually higher than execution of equivalent scalar code.

Summary

This tutorial discussed basic usage of masks. While control-flow still remains an important part of UME::SIMD based programs, the SIMD programming model requires also data flow to open performance opportunities to user programs.
Masks can be treated as vectors of boolean values, and obtained either by explicit initialization or by calling a specific comparison operation engaging arithmetic vectors. Then they can be used to block the flow of data within a UME::SIMD program.

UME::SIMD Tutorial 7: Dynamic alignment

In the previous post we discussed how to control the alignment of statically declared variables, and struct/class members. In this post we will look at ways of performing dynamic allocations with alignment control.

The C++ way

Unfortunately so far there is no convenient, standard way for aligned allocations. Current C++ requires the users to perform overallocation and then to use only the data at aligned offset. This can be done using std::align function.

Example code:

#include <iostream>
#include <iomanip>
#include <memory>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i <= 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  const int ALIGNMENT = 64; // align to 64B (512b) boundary
  int ARRAY_LENGTH = 5;     // size in number of array fields
  void *original_ptr;       // this has to be 'void*' so that we could use it with std::align
  void *aligned_ptr;
  
  std::size_t SIZE = ARRAY_LENGTH*sizeof(float);
  std::size_t OVERALLOCATED_SIZE    // overallocated size in bytes 
    = (SIZE + ALIGNMENT - sizeof(float));
  
  original_ptr = (float*) malloc(OVERALLOCATED_SIZE);
  
  std::cout << "[original_ptr]" 
  << "\naddress: ";
  print_address((uint64_t)original_ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)original_ptr)
  << "\nrequired type alignment: " << alignof(decltype(*(float*)original_ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(original_ptr)) << "\n";

  // We would like to retain the 'original_ptr' 
  // to be able to free the buffer in the future
  aligned_ptr = original_ptr;
  aligned_ptr = std::align(
                    ALIGNMENT,
                    sizeof(float),
                    aligned_ptr,
                    OVERALLOCATED_SIZE);  

  if(aligned_ptr == nullptr) {
     std::cout << "std::align failed: no sufficient space in the buffer.\n";
  }

  std::cout << "[aligned_ptr]" 
  << "\naddress: ";
  print_address((uint64_t)aligned_ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)aligned_ptr)
  << "\nrequired type alignment: " << alignof(decltype(*(float*)aligned_ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(aligned_ptr)) << "\n";
  
  free(original_ptr);
}

Exemplary output:

[original_ptr]
address: 0x1109c20
variable alignment: 32
required type alignment: 4
required ptr alignment: 8
[aligned_ptr]
address: 0x1109c40
variable alignment: 64
required type alignment: 4
required ptr alignment: 8

What the code does is:

  1. it calculates the overallocation size ( OVERALLOCATED_SIZE) taking into account additional padding,
  2. allocates an oversized buffer using a standard allocation method,
  3. creates a new pointer to point at properly aligned location within the buffer,
  4. uses the allocated data to do something (possibly useful),
  5. frees the data using the original pointer.

This method is pretty troublesome. First of all we need to manually calculate the size of the new buffer. It is not difficult, however it might be a source of error, and is troublesome in case that aligned allocations have to be frequent in the code. The allocated size can be calculated from the following formula:

  std::size_t OVERALLOCATED_SIZE = (SIZE + ALIGNMENT - sizeof(float));

A small graphics to show why this formula would work:
overallocation

In a worst case scenario we could end up with malloc returning an address that is pointing just after a 64B aligned address. This means, that the first element would have to be shifted 60B to the right ( ALIGNMENT - sizeof(float)). The last element would also have to be shifted, so the span between original location of the first element, and the new location of the last data element would be: 60 + 5*sizeof(float) that is 80 bytes. This is exactly how much space we need to allocate.

Secondly, we have to make sure that we retain the knowledge about original location of the memory buffer. This requires us to actually control two pointers: one for the original buffer, and one for the aligned buffer. Even with that it is easy to fall into a PITFALL of passing the original pointer to the std::align function, as it might be modified by the function! We can avoid this by copying the original pointer into the new pointer and passing it instead:

aligned_ptr = original_ptr;
  aligned_ptr = std::align(
                    ALIGNMENT,
                    sizeof(float),
                    aligned_ptr,
                    OVERALLOCATED_SIZE);  

Dynamic allocation using UME

UME offers you a set of portable wrappers for aligned allocation. You can acces these using UME::DynamicMemory::AlignedMalloc() and UME::DynamicMemory::AlignedFree() methods. The behaviour is almost identical to standard malloc call, with the difference that also additional ALIGNMENT parameter has to be defined. Here’s a code example:

#include <iostream>
#include <iomanip>

#include <umesimd/UMESimd.h>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i <= 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  const int ALIGNMENT = 128; // align to 128B boundary
  int ARRAY_LENGTH = 5;     // size in number of array fields   
  std::size_t SIZE = ARRAY_LENGTH*sizeof(float);

  float *ptr = (float*) UME::DynamicMemory::AlignedMalloc(SIZE, ALIGNMENT);

  std::cout << "[ptr]" 
  << "\naddress: ";
  print_address((uint64_t)ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)ptr)
  << "\nrequired type alignment: " << alignof(decltype(*ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(ptr)) << "\n";

  UME::DynamicMemory::AlignedFree(ptr);
}

Nothing more, nothing less. No additional calculations required, no additional pointers to track, simple syntax.

UME::SIMD Tutorials 6: Static alignment

As explained briefly in previous tutorial, data alignment can be an important factor in performance sensitive codes. In this tutorial we will go through direct techniques that can be used for static alignment control. Static alignment is the alignment of variables, including class members, which can be deduced at compile time.

Variable alignment

The static alignment can be controlled in C++ using alignas specifier. Let’s look at an example containing basic scalar variables only:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  uint8_t a;
  uint32_t b;
  alignas(16) uint32_t d;
  alignas(4) uint8_t c;

  std::cout << "[a]"
    << "\naddress: ";
    print_address((uint64_t)&a);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&a)
    << "\nrequired type alignment: " << alignof(decltype(a)) << "\n";
  std::cout << "\n[b]"
    << "\naddress: ";
    print_address((uint64_t)&b);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&b)
    << "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
  std::cout << "\n[c]"
    << "\naddress: ";
    print_address((uint64_t)&c);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&c)
    << "\nrequired type alignment: " << alignof(decltype(c)) << "\n";

  std::cout << "\n[d]"
    << "\naddress: ";
    print_address((uint64_t)&d);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&d)
    << "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}

Running above code will give an output looking like this:

[a]
address: 0x7ffdf0452c04
variable alignment: 4
required type alignment: 1
[b]
address: 0x7ffdf0452c00
variable alignment: 1024
required type alignment: 4
[c]
address: 0x7ffdf0452c10
variable alignment: 16
required type alignment: 4
[d]
address: 0x7ffdf0452c08
variable alignment: 8
required type alignment: 1

I recommend re-running the example couple of times to observe how the specific alignments change.

The first variable a is required to be aligned to 1B boundary, as it’s natural alignment is 1B (sizeof(uint8_t)). The actual alignment is 4B, but this is OK, as it respects 1B requirement (as well as 2B). In other runs it can ba aligned to an arbitrary allowed boundary.

The second variable b also respects the natural alignment, which is in this case 4B (sizeof(uint32_t)). In this case the alignment of the variable should never be smaller than 4B.

For variables c and d we artificially force the alignment to be 16B and 4B respectively. In the first case the actual alignment ends up to be as expected. For d we end up with a stronger alignment but, as previously, it is also acceptable.

Array alignment

We can also use the alignas specifier for statically declared arrays.

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  uint8_t a[10];
  uint32_t b[20];
  alignas(16) uint32_t c[7];
  alignas(4) uint8_t d[13];

  std::cout << "[a]"
    << "\naddress: ";
    print_address((uint64_t)&a);
    std::cout << "\narray alignment: " << alignment((uint64_t)&a)
    << "\narray size: " << sizeof(a)
    << "\nrequired type alignment: " << alignof(decltype(a));
  std::cout << "\n[b]"
    << "\naddress: ";
    print_address((uint64_t)&b);
    std::cout << "\narray alignment: " << alignment((uint64_t)&b)
    << "\narray size: " << sizeof(b)
    << "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
  std::cout << "\n[c]"
    << "\naddress: ";
    print_address((uint64_t)&c);
    std::cout << "\narray alignment: " << alignment((uint64_t)&c)
    << "\narray size: " << sizeof(c)
    << "\nrequired type alignment: " << alignof(decltype(c)) << "\n";
  std::cout << "\n[d]"
    << "\naddress: ";
    print_address((uint64_t)&d);
    std::cout << "\narray alignment: " << alignment((uint64_t)&d)
    << "\narray size: " << sizeof(d)
    << "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}

An exemplary output:

[a]
address: 0x7ffcfddbfc79
array alignment: 1
array size: 10
required type alignment: 1
[b]
address: 0x7ffcfddbfc00
array alignment: 1024
array size: 80
required type alignment: 4
[c]
address: 0x7ffcfddbfc50
array alignment: 16
array size: 28
required type alignment: 4
[d]
address: 0x7ffcfddbfc6c
array alignment: 4
array size: 13
required type alignment: 1

As defined in C++, the minimal alignment requirement for an array, is the natural alignment of its elements. So for the array of 8b integers we obtain the required alignment of the arrays to be 1B for a and 4B for b. Forcing the alignment of an array results only in forcing the alignment of the first element. In this particular case, the alignment is forced on variables c and d.

PITFALL: Mind that the rest of the elements are guaranteed to be aligned only to their natural alignment, since C++ requires the contiguous packing of plain array elements. In other words: alignment of c[0] will be at least 16, but the alignment of c[1] might be only 4.

Member alignment

We will only discuss structures, but the general conclusions apply also for classes.

Again, the code you can use to tinker a little:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
  for(int i = 0; i < length; i++) {
    std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
  }
  std::cout << std::dec << std::endl;
}

int alignment(uint64_t addr) {
 uint64_t result = 2;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

struct S_unaligned{
 uint8_t a;
 uint8_t b;

 // We initialize the data structure with some data
 // so that the output is more readable
 S_unaligned() : a(3), b(7) {}
};

struct S_aligned_a{
 alignas(16) uint8_t a;
 uint8_t b;

 S_aligned_a(): a(3), b(7) {}
};

struct S_aligned_b{
 uint8_t a;
 alignas(16) uint8_t b;

 S_aligned_b(): a(3), b(7) {}
};

int main()
{
  S_unaligned x0;

  std::cout << "[S_unaligned] size: " << sizeof(x0)
    << "\naddress: " << &x0
    << "\nrequired type alignment: " << alignof(S_unaligned)
    << "\nobject alignment: " << alignment((uint64_t)&x0)
    << "\ndata:";
  print_raw((uint8_t*) &x0, sizeof(x0));

  alignas(128) S_unaligned x0_a;
  std::cout << "\n[aligned S_unaligned] size: " << sizeof(x0_a)
    << "\naddress: " << &x0_a
    << "\nrequired type alignment: " << alignof(S_unaligned)
    << "\nobject alignment: " << alignment((uint64_t)&x0_a)
    << "\ndata:";
  print_raw((uint8_t*) &x0_a, sizeof(x0_a));

  S_aligned_a x1;
  std::cout << "\n[S_aligned_a] size: " << sizeof(x1)
    << "\naddress: " << &x1
    << "\nrequire dtype alignment: " << alignof(S_aligned_a)
    << "\nobject alignment: " << alignment((uint64_t)&x1)
    << "\ndata:";
  print_raw((uint8_t*) &x1, sizeof(x1));

  alignas(128) S_aligned_a x1_a;
  std::cout << "\n[aligned S_aligned_a] size: " << sizeof(x1_a)
    << "\naddress: " << &x1_a
    << "\ntype alignment: " << alignof(S_aligned_a)
    << "\nobject alignment: " << alignment((uint64_t)&x1_a)
    << "\ndata:";
  print_raw((uint8_t*) &x1_a, sizeof(x1_a));  

  S_aligned_b x2;
  std::cout << "\n[S_aligned_b] size: " << sizeof(x2)
    << "\naddress: " << &x2
    << "\nrequire type alignment: " << alignof(S_aligned_b)
    << "\nobject alignment: " << alignment((uint64_t)&x2)
    << "\ndata:";
  print_raw((uint8_t*) &x2, sizeof(x2));

  alignas(128) S_aligned_b x2_a;
  std::cout << "\n[aligned S_aligned_b] size: " << sizeof(x2_a)
    << "\naddress: " << &x2_a
    << "\nrequire type alignment: " << alignof(S_aligned_b)
    << "\nobject alignment: " << alignment((uint64_t)&x2_a)
    << "\n data:";
  print_raw((uint8_t*) &x2_a, sizeof(x2_a));
}

We will now discuss few selected structures.

Structure #1: No members aligned

For the S_unaligned x0 the output is;

[S_unaligned] size: 2
address: 0x7fff2c36ca78
required type alignment: 1
object alignment: 8
data:0x03 0x07

Few things to point out here. First of all, the size of the S_unaligned structure is 2 bytes. This means, that also the natural alignment of this structure is 2B. At the same time, the structure has two fields, each of them 1B long (natural alignment equal to 1B). According to C++, the structure itself has to be aligned to the natural alignment of its first member. And indeed it is, as indicated by:

type alignment: 1

While the structure itself has to be aligned to 1B boundary, in this particular case it is aligned to 8B boundary:

object alignment: 8

Again, this is OK as higher alignment requirement also respects lower alignments, i.e. 8B aligned structure is also 4B, 2B and 1B aligned.

When we align an object of this structure, we obtain something like:

[aligned S_unaligned] size: 2
address: 0x7ffe25de5080
required type alignment: 1
object alignment: 128
data:0x03 0x07

Nothing changed, except for the minimal value the object alignment can take.

Structure #2: only the first member aligned

Now let’s modify the structure to have the a member aligned to some boundary. In this particular example, we are aligning the first field to 16B boundary.

  struct S_aligned_a{
   alignas(16) uint8_t a;
   uint8_t b;

   S_aligned_a(): a(3), b(7) {}
  };

The output for a specific execution is:

[S_aligned_a] size: 16
address: 0x7ffc5b06a0f0
require dtype alignment: 16
object alignment: 16
data:0x03 0x07 0xea 0xe9 0x21 0x7f 0000 0000 0x90 0x77 0xea 0xe9 0x21 0x7f 0000 0000

Few interesting things happened here. First of all, the size of the structure immediately got larger. The size is now 16B, which means 8 times longer than the initial one! But we also can see from data: 0x03 0x07 ... that a and b are still in the same place within the structure. What is happening? Let’s imagine an example with pointer arithmetic:

S_aligned_a data[10];
S_aligned_a* ptr = &data[10];

for (int i = 0; i <10; i++) {
 std::cout <a << std::endl;
 ptr++;
}

Due to how pointer arithmetics work, when incrementing the pointer, such as in the loop using ptr, the value stored in the pointer (the address at which the pointer is pointing) has to be increased with the size of the base type of the pointer. In this specific case if the size was 2B, as for the original structure, then the pointer would be increased by 2B each time. This means that we would end up accessing structures of S_aligned_a type at following addresses:


0x7ffc5b06a0f0,
0x7ffc5b06a0f2,
0x7ffc5b06a0f4,
0x7ffc5b06a0f8,
0x7ffc5b06a0fa,
0x7ffc5b06a0fc,
0x7ffc5b06a0fe,
0x7ffc5b06a100,
...

Out of all of above adresses, only 0x7ffc5b06a0f0 and 0x7ffc5b06a100 are aligned to 16B boundary! This means that a would not be aligned for data[1], data[2],…, data[6], data[8] etc.
Now the obvious solution is to add padding at the end of each structure object so that the next structure, starting directly after the previous one, was automatically aligned to required boundary. This is exactly how it is done. Unfortunately in some specific cases, as the one presented, this might grow the required size enormously.

As for the previous case, aligning the whole structure doesn’t change much:

[aligned S_aligned_a] size: 16
address: 0x7ffe25de5000
type alignment: 16
object alignment: 2048
data:0x03 0x07 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Structure #3: only the second member aligned

Now let’s try defining another structure, this time with only b aligned:

struct S_aligned_b{
 uint8_t a;
 alignas(16) uint8_t b;

 S_aligned_b(): a(3), b(7) {}
};

A particular test execution gives us:


[S_aligned_b] size: 32
address: 0x7ffc5b06a160
require type alignment: 16
object alignment: 32
data:0x03 0000 0000 0000 0x10 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x07 0x19 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

What?! Yet another size increase!? Again, C++ is not allowed to reorder the members (due to serialization problems it might cause), so the only way it can behave is by padding with some unused fields between a and b. Let’s think for a bit, why is this necessary.

The a member, is aligned to 1B boundary, that is to its’ natural alignment. Let’s assume that an object of the S_aligned_b structure is aligned to 16B boundary. a would then be aligned to 16B, but the next 15B would only be guaranteed to be aligned to 1, 2, 4 and 8 byte boundaries. This means that the first unocupied address after a, aligned to 16B boundary is starting at the offset 16 from the beginning of the structure. Hence the data fields between two structures have to be padded with unused space, so that the alignment requirement for b are met.

We could use the same rationalization for why the required type alignment is 16: it has to be to guarantee that b is properly aligned.

Yet one more time aligned object does not have a drastic impact:

[aligned S_aligned_b] size: 32
address: 0x7ffe25de4f80
require type alignment: 16
object alignment: 128
data:0x03 0x8e 0x9f 0xa2 0x01 0x7f 0000 0000 0x90 0x17 0x9f 0xa2 0x01 0x7f 0000 0000 0x07 0xc0 0x66 0xa2 0x01 0x7f 0000 0000 0x80 0x7c 0x9f 0xa2 0x01 0x7f 0000 0000

Structure packing

As shown before, the size of the structure can grow surprisingly quickly. Can we fight it, and how? We can reorder the members of a class/struct so that they have tighter packing. The code to show this:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
  for(int i = 0; i < length; i++) {
    std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
  }
  std::cout << std::dec << std::endl;
}

int alignment(uint64_t addr) {
 uint64_t result = 2;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

struct S{
 uint8_t a;
 uint16_t b;
 uint32_t c;
 uint64_t d;
 alignas(256) uint64_t e;

 S(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed{
 alignas(256) uint64_t e;
 uint64_t d;
 uint32_t c;
 uint16_t b;
 uint8_t a;

 S_packed(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed_noalign{
 uint64_t e;
 uint64_t d;
 uint32_t c;
 uint16_t b;
 uint8_t a;

 S_packed_noalign(): a(3), b(7), c(9), d(1), e(6) {}
};


int main()
{
  S x0;

  std::cout << "[S] size: " << sizeof(x0) 
    << "\naddress: " << &x0
    << "\nrequired type alignment: " << alignof(S)
    << "\nobject alignment: " << alignment((uint64_t)&x0)
    << "\ndata:";
  print_raw((uint8_t*) &x0, sizeof(x0));
  
  S_packed x1;
  
  std::cout << "[S_packed] size: " << sizeof(x1) 
    << "\naddress: " << &x1
    << "\nrequired type alignment: " << alignof(S_packed)
    << "\nobject alignment: " << alignment((uint64_t)&x1)
    << "\ndata:";
  print_raw((uint8_t*) &x1, sizeof(x1));
  
  S_packed_noalign x2;
  
  std::cout << "[S_packed_noalign] size: " << sizeof(x2) 
    << "\naddress: " << &x2
    << "\nrequired type alignment: " << alignof(S_packed_noalign)
    << "\nobject alignment: " << alignment((uint64_t)&x2)
    << "\ndata:";
  print_raw((uint8_t*) &x2, sizeof(x2));

}

The output (i marked the actual data that matters for us):

[S] size: 512
address: 0x7fff1beb2b00
required type alignment: 256
object alignment: 256
data:0x03 0x34 0x07 0000 0x09 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x46 0x1d 0x41 0x11 0x6a 0x7f 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x98 0x66 0x40 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xec 0x07 0000 0000 0x01 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xa8 0x45 0x99 0x11 0x6a 0x7f 0000 0000 0xc0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x5e 0x96 0x93 0x1c 0000 0000 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x5b 0x1c 0xc2 0000 0000 0000 0000 0000 0x1a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x90 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xb0 0xc3 0x40 0x11 0x6a 0x7f 0000 0000 0x70 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x06 0000 0000 0000 0000 0000 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0x62 0x0d 0x78 0x11 0x6a 0x7f 0000 0000 0x62 0x85 0x43 0x11 0x6a 0x7f 0000 0000 0xa7 0x0f 0xad 0000 0000 0000 0000 0000 0x2d 0000 0000 0000 0x6a 0x7f 0000 0000 0xb8 0xd6 0xbb 0x10 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xf8 0x35 0xbc 0x10 0x6a 0x7f 0000 0000 0xe0 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0x08 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x31 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xf8 0x44 0x96 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x10 0x04 0x40 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0x12 0x03 0000 0000 0x01 0000 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000
[S_packed] size: 256
address: 0x7fff1beb2d00
required type alignment: 256
object alignment: 256
data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0x3a 0x02 0000 0000 0000 0000 0000 0000 0x64 0x4c 0x41 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xd5 0xb9 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x38 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x85 0x4d 0x02 0x04 0x01 0000 0000 0000 0x50 0xea 0x40 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x48 0x96 0x38 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x94 0x5e 0x77 0x11 0x6a 0x7f 0000 0000 0x90 0xe7 0x76 0x11 0x6a 0x7f 0000 0000 0000 0x90 0x3e 0x11 0x6a 0x7f 0000 0000 0x80 0x4c 0x77 0x11 0x6a 0x7f 0000 0000 0x18 0xf2 0x76 0x11 0x6a 0x7f 0000 0000 0x50 0x95 0x76 0x11 0x6a 0x7f 0000 0000
[S_packed_noalign] size: 24
address: 0x7fff1beb2e00
required type alignment: 8
object alignment: 512
data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000

The general rule for reordering is to calculate max(required_alignment(member), sizeof(member)) and declare the members with higher values first. For this particular case this packing technique halved the size of the structure when one of the field requires a very high alignment.
Also if we can relax the alignment requirement, as in the structure S_packed_noalign, then the amount of space required will be even smaller. We might however run into problems with efficient register LOAD/STORE operations if the alignment criteria are not met.

PITFALL: If the structures are nested, modifications made in components might change their size/alignment requirements. Therefore it might also require rearrangement of derived structures. A common, nasty scenario is when the alignment problems appear after a remote commit, not directly related to affected structure. In such scenarios, seemingly innocent changes can drasticaly decrease the performance of an application.

Summary

We discussed topic of static alignment. Few performance hints to be remembered:

  1. The alignment requirement can be controlled using alignas keyword.
  2. The natural alignment can be obtained using alignof keyword.
  3. Alignment of array is set only for the first element. Therefor consecutive elements might not respect the necessary alignment constraint.
  4. When aligning members of structures the size of the structure might increase substantially due to padding introduced by the compiler.
  5. Structures can be packed densely using manual reordering of members, so that the resulting size is decreased.

I don’t think that the topic of static alignment is exhausted here, but the post already got eXtra Large. So let me know if something requires more thorough explanation.

In the next post I will discuss how can alignment be controlled in a dynamic way.

UME::SIMD Tutorial #5: Memory subsystem and alignment

In this tutorial we will discuss few basic ideas related to memory management in the context of memory hierarchy. While efficient code generation is an issue, the real performance problems are related to memory. This topic is very broad, and possibly deserves another tutorial series, but some basic knowledge is required to understand basic aspects of vector code development. The material here is theoretical, as we will be taking look at specific techniques closer in next posts.

Cache subsystem

If you ever took basic computer architecture or digital electronics courses, you should be already aware of two basic computer architectures: Von Neumann architecture and Harvard architecture. It should be noted that modern computer systems are a variation of both of them.

von_neumann

The Von Neumann architecture in a simplest implementation consists of: memory device, a CPU and I/O interfaces, possibly connected to some external devices. In this tutorial we won’t discuss in detail the topic of I/O, but we will be focusing on memory and CPU interaction.

The whole system can work like this:

  1. CPU loads a value from specific memory cells into registers,
  2. CPU performs some calculations on registers,
  3. CPU writes back the result into the memory.

This is very similar to what we did in previous tutorial. Unfortunately this mode of operation was already proven to be very hurtful for performance. Imagine that CPU carries some computations, and then tries to reuse results generated in the past. The number of registers available is very small, even in modern systems. If the data does not fit the register bank then possibly new data has to be loaded and the old data has to be stored back to memory. Once the result is needed again, it has to be loaded again. Both memory reads and writes are expensive in comparison to moving data between registers, simply because of physical distance between: arithmetic/control units, RAM and register bank.

Often our programs iterate over large memory containers, with data layed out over consecutive memory addresses. An additional optimization opportunity opens: loading the data that will be needed in the future while CPU works on previous data. This basic concurrency technique allows hiding memory latency, and therefore decreasing overall computation time. Unfortunately with the original architecture, the size of register bank limits us to loading only few additional fields.

An engineering approach for solving the problem of memory access latency was invention of a cache, a small additional memory placed somewhere between the CPU and RAM, and preferrably having a dedicated, direct connection to CPU. Initially the cache was placed on motherboards, slowly moving closer and closer towards CPU cores. As the cache was moving closer to cores, the amount of physical space available was dropping, limiting the size of cache. At the same time, initially single cache, has been split into a hierarchy of cache levels.

cache

The above picture gives a brief overview of how does the memory subsystem work nowadays. The data is moved from RAM through a series of cache levels, each of them replicating the original data, and being able to communicate only with immediately higher and lower memory layers. Cache level L1 is the smallest one and the only one capable of exchanging information with CPU registers. Cache level L2 is larger than L1, but also restricted to single CPU core. The main reason why L1 and L2 are distinct in some CPUs is that L1 might come in two pieces: data cache and instruction cache. Cache level L3 is usually much larger than L2 and is shared between multiple cores of a single CPU. Existence of shared cache can improve inter-core communication, and thus is a very interesting topic for high-performance engineering.

Now how does the cache system work? Imagine a scenario: a value (let’s say 64b integer) is loaded from RAM to L3, then from L3 to L2, then from L2 to L1, finally landing in one of the registers. CPU performs some computations and, if it runs out of registers, uses L1 cache to temporarily store the data. Once the data is needed again, it only has to be loaded from L1 and not from RAM! OK, but L1 is very small. So what happens when L1 is depleted? Simply some of the data is moved from L1 to L2, and moved back from L2 to L1 and then to registers when needed. In a similar way the data from L2 is being pushed back to L3 if L2 is full. Finally if L3 is full, the only thing that can happen is to write the data to RAM.

It is very possible that you, a programmer, never had to deal with cache before. Why is that? The cache subsystem is managed primarily by the hardware and systems software. The hardware decides when and what data has to be moved between different levels of cache. The system software, including compilers or some privilaged modules, might have some form of direct effect on cache behaviour by issuing hints or by configuring the cache management hardware. Regardless of that, from the software perspective the cache is almost completely transparent. Almost, as it might be visible when measuring application performance.

Register alignment

When moving data between registers and memory, it is important to be aware, that the data is easiest to load, when it is register aligned, that is:

[address of data element] modulo [size of register] = 0

This can be depicted using following example:

Register alignment.png

In the diagram, we have some structure definition with a possible layout of that structure in the contiguous memory. Field a is aligned to it’s natural alignment defined as:

[address of data element] modulo [size of data element] = 0

Field b is an array starting immediately after the a field and being aligned according to C++ rules, to the first address that meets natural alignment of its’ elements, that is 64b. Mind that all fields of b are 64b aligned, and the array itself is 64b aligned as its’ first fields’ highest alignment is 64b.

On some architectures, only register aligned load operations are permitted. Other architectures offer unaligned loads, but penalise them with longer execution time. For both performance and correctness reasons it is best to always align the data to the register alignment when performing computations. As depicted in the example above:

  1. loading b[i] to any 64b register is always possible, as each element of b is register aligned;
  2. loading a pair b[0]-b[1] to a 128b register (i.e. a vector register) might cause alignment issues, as the field is not aligned to 128b boundary;
  3. loading a pair b[1]-b[2] to a 128b register will not cause the alignment issues, as both fields together are aligned to 128b boundary.

At the same time, aligning too much might waste some space, as padding will be added to fill the gaps between aligned fields. Knowing what needs to be aligned, and what should be left untouched comes with some practice, and with understanding of internals of an application.

In the case presented above, we also assumed, that the structure S itself is at least 128b aligned. This does not have to be the case in a general scenario.

Cachelines

To simplify the hardware a bit (and make things faster), instead of passing a single byte at a time, the data is being transfered in chunks called cachelines. In x86 the cache line size is always 64B. In arm32 it’s 32B and in arm64 it is 64B. Other processors might use different values.

Existence of cache has an impact on performance. Reading first byte of a cacheline will take some time, as the data has to be transported through the cache. Reading next 63 bytes, regardless of whether they contain the same data set or not, will introduce very little if no latency. This means that the most efficient memory layout is a contiguous array of data. If you are using an array of structures (AoS) layout, then accessing only a single field of each structure will not be an optimal access pattern: more cachelines will have to be loaded to L0, and the percentage of utilisation of each cacheline will be very low.

We will say that a data element is cacheline aligned if it begins with a new cacheline, that is:

[address of data element] modulo [cacheline size] = 0

A large portion of application optimization is revolving around making the data layouts cache friendly. This means both data reuse and maximizing cacheline consumption.

Pages

Computer systems have very limited resources in terms of RAM, compared to parmanent storage such as hard drives. Running out of RAM is bad for everyone, as it most often results in application and sometimes even in system crash. A common way of avoiding insufficient memory is known as memory virtualization. Memory virtualization is also a very useful mechanism to protect other applications from malicious influence of other, possibly untrusted applications.

In great, great simplification, memory virtualization works as follows:

  1. When application starts, the operating system allocates certain amount of space required by static requirements of the program (e.g. the program stack). The space is starting at a virtual address 0x0, and ends at some virtual address 0xff..ff. The precise size of this space depends on the OS and hardware. These address spaces are not unique between applications, but the addresses are unique within a single system process.
  2. The virtual address space is divided into 4kB blocks, called pages. Each page of virtual memory that is being considered to be useful is registered in virtual address translation table. The table consists of: process ID, virtual page address and physical page address. Physical page address is unique in the scale of the operating system.
  3. When the process allocates some memory, the OS adds entries to the table, and reserves some space in the physical memory. When the process frees up the memory, the OS removes specific entries from the table.
  4. When the OS runs out of physical memory it allocates a virtual memory file somewhere on a hard drive. This file extends the RAM capacity by allowing OS directed movement of pages between physical memory and virtual memory file.
  5. When a user process (i.e. an application) accesses a specific address in virtual memory, the OS checks the physical address of the page. If the address fits into the RAM, it starts the process of address translation. If the address is not in the RAM, the OS moves specific page from virtual memory file to RAM, possibly replacing some other process’ pages, and re-assigning the physical address in the virtual address translation table. Once this process of memory migration happens, the OS calculates the physical address and returns it to the application.

In some sense the paging mechanism acts as a OS based caching mechanism. While initially paging was implemented purely in the software, a large amount of effort was put into hardware support, primarilly for performance reasons. We won’t be talking about paging too much in the context of this series, but it is an important topic to be considered in a context of parallel aplications.

For the sake of completeness: we will call a memory element to be page aligned if it meets the following equation:

[address of memory element] modulo [page size] = 0

You can read more about memory virtualization here.

Summary

At this moment you should be already aware of the cache subsystem and the paging mechanism. You should also be able to distinguish between natural alignment, register alignment, cacheline alignment and page alignment.

Few performance hints to be remembered:

  1. Accessing data in a contiguous manner is usually the most efficient way as it allows complete consumption of cachelines.
  2. Maximizing data reusal is the key to exploit cache subsystem. Perform as many operations as possible with the most recently used/created data.
  3. Try to meet alignment requirements, as it could reduce number of transported cachelines and make cache-registers communication faster

In following posts we will be looking at some practical aspects of memory alignment and efficient cache utilisation. We will also look at data structures and some memory layouts that are critical from the perspective of performance.

UME::SIMD Tutorial 4: Handling longer datasets

We know already how to read data from memory (LOAD), how to perform basic vertical and horizontal operations, and how to write results back to memory (STORE). But operating on very short vectors is not particularly useful, as we usually want to handle large data sets. The solution is very simple: just use C++ loops. Consider following example:

Example: Given scalar coefficients a and b, and two arrays x and y, compute a set of y_i variables given by N linear equations:

y_i = a_i*x_i + b_i , i \in <0,N)

Solution:
Since we don’t know how big (or small) N is, we have to consider the case, when N is not a multiple of number of elements packed in SIMDVec (we will call this value simd stride). We therefore have to split N in two parts:

  1. part executed using SIMDVec code (peel loop), and
  2. part executed using scalar code (remainder loop).

For the actual code we will assume the SIMD stride to be 4 and packed precision type to be float.

#include <umesimd/UMESimd.h>
using namespace UME::SIMD;
//...
void linear_equation_evaluator(
    float *y, 
    float *x,
    float a,
    float b,
    int N)
{
  SIMDVec<float, 4> x_vec, y_vec;
  SIMDVec<float, 4> a_vec(a), b_vec(b);

  int REMAINDER_COUNT=N%4;
  int REMAINDER_OFFSET=N-REMAINDER_COUNT;

  // Compute 'peel' part:
  for(int i=0; i<N; i+=4)
  {
    x_vec.load(&x[i]);
    y_vec=a_vec*x_vec+b_vec;
    y_vec.store(&y[i]);
  }

  // Compute 'remainder' part:
  for(int i=REMAINDER_OFFSET; i<N; i++)
  {
    y[i]=a*x[i]+b;
  }
}
//...

The above code structure is worth memorizing. It will appear in UME::SIMD codes very often. What we do first is to reserve some registers for x and y values. Wa also prepare vectors containing constants a and b. Additional necessary information in form of remainder loop offset is calculated as REMAINDER_OFFSET.

We start computations by iterating data elements in chunks of 4 (the SIMD stride). This peel loop is the major source of our performance-gain: instead of processing one element at a time (as happens in remainder loop), we are processing up to SIMD-stride elements. In theory this should give us up to 4x speedup! (In practice this will rarely be the case. We will discuss this in one of future posts dedicated to portability and selection of proper stride.) This speedup is due to the fact that UME::SIMD library is internally using hardware based mechanisms for data level parallelism, such as specialized CPU instructions.

Finally we calculate remainder loop, which performs the calculations over the remainig elements. Mind that remainder loop in this case is just regular scalar code you would normally write for solving the problem presented in the example. The only difference is the initial value of the loop index i. For practical scenarios, be advised to first write the scalar version of your kernel, as you might still need it for remainder part of your vectorized kernel. Having the scalar version is also important for results verification.

The function presented in our solution can only be simplified if we know from the definition of the problem that N is multiple of 4. In such scenario, we could completely remove remainder loop from the function definition. Alternatively if we know that N is always less than 4, we could completely remove the peel part, as it will never be executed. Unfortunately for most scenarios this will mean some sort of code duplication. And code duplication is always bad: twice the bugs, twice the testing, quadruple the maintenance time… The next post will show you how this code duplication could be avoided.