# UME::SIMD Tutorial 9: gather/scatter operations

As you already know a SIMD registers can communicate with the scalar domain (or more precisely with the memory) using LOAD and STORE operations. The drawback of these operations is that they only allow you to move data elements, that are placed contiguously in the memory. However very often in our codes we have to deal with non-contiguous memory access. In this tutorial we will explain GATHER/SCATTER operations and how they generalize LOAD/STORE operations.

There are scenarios in which you might want to populate a register with data comming from non-consecutive memory locations. Few examples are:

1. Accessing every second element of an array ( strided access
2. Accessing element of an array at freshly computed offsets ( indexed access)
3. Accessing elements in different order (shuffling access)

We will discuss first two situations in this post. The last class of problems needs more thorough discussion in the context of permutations, so we will discuss it in the next tutorial.

What then is a SCATTER operation? It is simply a reverse of a GATHER operation, which ‘scatters’ the content of a register over the memory. It is therefore similar to STORE operation, but capable of non-congiguous memory access.

## Strided access

The memory access pattern is called strided when memory fields accessed are equally distant. This distance is called a stride (not to be mistaken with SIMD-stride!). A simple visualization of strided access:

As you can see, the STRIDE-1 access is a special case of GATHER operation: a LOAD operation. Why then do we have separate LOAD and GATHER operations (and STORE and SCATTER) instead just simplifying things and using only GATHERs? There are two explanations. Firstly, there is the historical one: early processors only implemented LOAD instructions to move data between memory and scalar registers. Since in scalar domain you can access any element using a single scalar index, there was no need for more flexible operations.
Secondly, there is the performance aspect: apart from passing the base memory address, the gather instruction requires also information about how to calculate specific offsets to load from. No matter how the instruction is implemented within the processor, this is an additional degree of freedom, possibly meaning longer execution time, but for sure meaning additional circuitry.

PERFORMANCE HINT Use load/store operations whenever you can, and avoid gather/scatter if possible. In most scenarios this will mean that your data structures/algorithm have to be modified. Use gather/scatter only when you are certain, that re-designing structures is not possible.

When performing a strided access you need to know what are the base address, passed as a pointer to the beginning of your data, and the stride value, passed as a scalar integer. The stride is always passed as the number of elements, rarther than memory offset so that the programming can be simplified.

Following code example shows how to perform a strided memory accesses:

float a[LARGE_DATA_SIZE];
uint32_t STRIDE = 8;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
SIMDVec<float, 8> vec;

// Note that we have to scale the loop index.
int offset = i*STRIDE;

// 'load' the data to vec.
vec.gather(&a[offset], STRIDE);
// do something useful
vec += 3.14;
// store the result at original locations
vec.scatter(&a[offset], STRIDE);
}


When using strided access, we have to pass the address of the first element. This is done by calculation offset variable in every iteration. The gather operation then uses this local base address together with a scalar stride to calculate offsets of the corresponding elements.
Once the necessary calculations are over, the updated results are stored back at original locations.

## Indexed access

The indexed access is even more general-purpose than strided access. The primary difference is, that instead of passing a scalar stride parameter, you have to pass a SIMDVec of unsigned integer indices. The indexed access can be visualised using following graphic:

The adventage of this mode is, that every element can be retrieved using a dedicated index. Disadvantage is that the indexing this way might completely destroy hardware based memory prefetching, and in turn negatively impact your performance. Also mind that all elements might be far away, meaning that more cachelines have to be moved to the lowest cache level.

An example code using indexed access:

float a[LARGE_DATA_SIZE];
int indices[PROBLEM_SIZE];
uint32_t STRIDE = 4;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
SIMDVec<float, 8> vec;

// Here we are using precomputed indices,
// but they can be computed on-the-fly if necessary.
SIMDVec<uint32_t, 8> indices_vec(&indices[i];

// 'load' the data to vec.
vec.gather(&a[0], indices_vec);
// do something useful
vec += 3.14;
// store the result at original locations
vec.scatter(&a[0], indices_vec);
}


The basic difference is that instead of passing a scalar stride we will use an unsigned integer vector of 32b indices.

NOTE: at the moment the library is using vector of unsigned integers having the same precision as the scalar elements of the gathered vector. This causes troubles when dealing with mixed precision and when small types (e.g. uint8_t) don’t have enough bits to represent a full range of indices. The library will be updated to always use vectors of uint32_t indices

## Securing conditional access

One of the problems you might spot when writing your code is when you are trying to deal with conditional statements. Consider following scalar code:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
for(int i = 0; i < PROBLEM_SIZE; i++) {
// Here we are checking if for some reason one of the
// values in (a[i],b[i]) pair is not determined properly.
if (std::isfin(a[i] - b[i])) {
// Calculate the index only if both 'a' and 'b' are well defined
int index = int(a[i] - b[i]);
// 'gather' single element at a time
float temp = c[index];
// Do something with the value
temp += 3.14;
// Update the values of 'c'
c[index] = temp;
}
}


As you should already know (see UME::SIMD Tutorial 8: Conditional execution using masks) the vector code using masks will execute all operations within the ‘if’ branch, regardless of the fact whether ‘a’ and ‘b’ are well defined or not. It might be tempting to re-write the above code as:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
// Here we are checking if for some reason (e.g. a design decision) one
// of the values in (a[i],b[i]) pair is not determined properly.

SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

// if (std::isfin(a[i] - b[i])) {
SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
SIMDVec<float, 8> temp;

temp.gather(&c[0], indices); // This is WRONG!!!
temp.adda(condition, 3.14); // only change selected elements
temp.scatter(&c[0], indices); // Again WRONG!!!
}


Why are the gather and scatter operations in this snippet wrong? The are both trying to access the memory, even if the indices are incorrect. But neither GATHER nor SCATTER happen to care about that. They have to trust us to provide them with correct indices, at least for the performance reasons. If the indices are incorrect, they will try to access memory that is possibly outside of the boundaries of ‘c’ dataset. This might result in a memory fault, and for that reason we have to secure both memory read and write using our conditional mask:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
// Here we are checking if for some reason (e.g. by design?) one of the
// values in (a[i],b[i]) pair is not determined properly.

SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

// if (std::isfin(a[i] - b[i])) {
SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
SIMDVec<float, 8> temp;

temp.gather(condition, &c[0], indices); // Now the read is CORRECT!!!
temp += 3.14; // we don't have to mask this operation
temp.scatter(condition, &c[0], indices); // Again no problems here.
}


So in the corrected version we have to simply pass an additional mask parameter to memory access operations, rather to arithmetic operations. The library will take care of securing the reads so that they access only allowed memory addresses.

## Summary

In this tutorial we introduced concept of GATHER/SCATTER operations, and explained why they are a useful addition to our SIMD programming model. We looked at both strided and indexed memory access patterns, and explained how this concept generalizes LOAD/STORE operations.

While very usefull, GATHER/SCATTER operations are a double-edged sword that can both make our lives easier, and destroy our performance.

In the next tutorial we will talk about permutations. There we will discuss also the GATHER/SCATTER as one of possible strategies for re-arranging vector elements.

# UME::SIMD Tutorial 8: Conditional execution using masks

The first real difference between scalar code and SIMD code relates to conditional execution of calculations. C++ is, by design, a language heavily supporting scalar operations. Apart from the arithmetic statements, the language offers also control flow statements, such as: if-then-else, switch and goto. All these can be mixed with UME::SIMD code, but there are few restrictions imposed by the language. We will only discuss here the ‘if-else’ block, but the same considerations apply for other conditional statmeents.

## Scalar conditional statements

A simple scalar conditional statement would look like this:

int x;
int a = 2;
int b = 3;
int c;
...

if(x) {
c = a + b;
}
else {
c = a - b;
}


Now depending on the value of x, the variable c will take either value of ‘5’ or ‘-1’. To understand which of two branches ( if branch or else branch) will be selected, we have to look at the predicate expression in the if clause:

if(<predicate expression>)


According to C++ rules, the type of the predicate expression, needs to be converted into a boolean. That is, if the expression is using only bool variables, then the result will be prety straightforward to predict. A different case is when we use expressions that evaluate to integers (in our case x is an expression that evaluates to an integer!). In such case we need to know how the C++ converts this integer expression to a boolean expression. The rule to do so is called implicit conversion rule, and is defined within the language standard. The rule goes like this:

Zero value, null pointer value, or null member pointer value is converted to false; any other value is converted to true.

In our particular case the same predicate expression will be therefore interpreted as a boolean predicate (or a boolean expression):


bool predicateValue = (x!=0)
if(predicateValue)


Understanding this relation is critical in understanding why we can’t use vector types directly within if() or while() statements.

## Vector conditional statements

Let’s visualise what we will be talking about in this part using a piece of code:


UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

if(x) { // Error: What does this mean?!
c = a + b;
}
else {
c = a - b;
}


In the above code we are running into a very interesting problem: how to convert a vector x into a boolean value? Unfortunately, since C++ is not aware of the vector types, it cannot make any assumptions as for the implicit conversions to boolean expression. More of that: you might want to express different operations by this piece of code. Do you want to execute each block for specific vector elements that fulfill the predicate:


UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

for(int i = 0; i < 2; i++)
{
if(x[i] != 0) {
c[i] = a[i] + b[i];
}
else {
c[i] = a[i] - b[i];
}
}



or perhaps you want to execute whole block only if all values in the vector fulfill certain condition:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) && (x[1] != 0)
if(predicate) {
c = a + b;
}
else {
c = a - b;
}


or maybe you want to execute the blocks if any of the elements meets the predicate criteria:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) || (x[1] != 0)
if(predicate) {
c = a + b;
}
else {
c = a - b;
}


As you can see, the original language is too ambiguous to let you simpy use the same syntax. At the same time, a library such as UME::SIMD does not have capabilities of a compiler, allowing it to look forward into the code and make a decision based on forward analysis. We therefore need more straightforward, yet unambiguous mechanisms to express our intent.

UME::SIMD uses a concept of data flow apart from the already well established control flow to steer the way the actual numerical values are being passed through the program. The data flow concept doeasn’t require any conditional statements. It rather uses a concept of an execution mask. Mask is a vector consisting of only boolean values. A ‘true’ value means that specific vector lanes should execute operation, while a ‘false’ value means that operation should be blocked, that is the result should not be written to the output.

The library exposes a category of types named SIMDVecMask for date-flow control. We will now go through basic operations that are allowed with these types.

A basic declaration of looks similar to a vector declaration:



// Initialize using scalars
...
bool maskValues[4] = {true, false, false, true};
// Initialize using an array



### Mask initialization using predicate operations

A mask can be also obtained using one of the predicate operations applied on arithmetic vectors:


UME::SIMD::SIMDVec<float, 8> vec1, vec2;

// 'CoMPare EQual'
// 'CoMPare Not EQual'
// 'CoMPare Greater Than'
// 'CoMPare Greater than or Equal'
// 'CoMPare Less Than'
// 'CoMPare Less than or Equal'



For each predicate a specific comparison operation is applied in an elementwise manner for elements of vectors vec1 and vec2.

Masks can be also operated bit there are slightly different operations than for arithmetic vectors. This is due to the fact, that for portability reasons. It is not possible to mix arithmetic vectors, and masks, except for few special situations. Code below shows most of the operations allowed on masks, and their interaction with arithmetic types:


// Some vertical operations:
....
m0 = m1.land(m2);
m0 = m1 && m2;    // the same as above
m0 = m1.lor(m2);
m0 = m1 || m2;    // the same as above
m0 = m1.lxor(m2);
m0 = m1 ^ m2;     // the same as above
m0 = m1.landnot(m2); // equivalent to: !m1 && m2

// Also some horizontal (reduction) operations
bool b0;
...
b0 = m1.hland() // m1[0] && m1[1] && ... && m1[7]
b0 = m1.hlor() // m1[0] || m1[1] && ... && m1[7]



Once the mask is calculated, it can be passed as an optional parameter to most of the arithmetic operations:


UME::SIMD::SIMDVec<float, 4> vec0, vec1, vec2, vec3;
...
m0 = vec0 > vec1;
// Add only pairs where m0 is true
// Similar expressed using operator syntax
vec2[m0] = vec0 + vec1

// 'Blend' two vectors using masking.
// 'vec2[i]' becomes 'vec1[i]' where 'mask[i]'
// is true, and 'vec0[i]' where false.



Now let’s look at specific examples we discussed before. Scenario 1 would look like this in SIMD arithmetic:


UME::SIMD::SIMDVec<int,2> x(0, 10);
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
UME::SIMD::SIMDVec<int,2> t0, t1;
...

m0 = x!=0;                // (false, true)
t0 = a+b;                 // (5, 2)
t1 = a-b;                 // (-1, 8)
c = t1.blend(m0, t0);     // (-1, 2)

// Or briefly:
c=(a-b).blend(x!=0, a+b); // (-1, 2)



PERFORAMNCE PITFALL: when executing scalar code, only one branch of code gets executed for every element of a vector. The decision about which branch to execute is made on a per-iteration basis. This allows the microprocessor to completely skip certain parts of the code.
In SIMD world, an arithmetic operation happens atomicaly for all elements of the vector, and it is not possible to ‘skip’ the execution. As a result we have to execute both if and ‘else’ statements in every iteration. For long ‘if-else’ blocks this might mean executing large number of additional operations (as compared to scalar code), and therefore might decrease the expected vectorization gains.

The second example is the execution of the variant with ‘if all true’. Mind that in this case, we might still use the same ‘if-else’ statement:


UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...
// The original code was using following formula:
// bool predicate = (x[0] != 0) && (x[1] != 0)

// We first declare the mask, and perform a proper
// comparison for values of 'x':
// Now we have to perform a 'Horizontal Logical AND'
// (reduction) operation:
bool predicate = m0.hland();

// At this moment we already know that
// we have to execute either 'if' or 'else' block
// for all elements of a vector.
if(predicate) {
c = a + b;
}
else {
c = a - b;
}



The same format can be used for Scenario #3 but using ‘Horizontal Logical OR’ operation. In general case it is up to you, to decide which construct is more usefull in a given situation.

PERFORMANCE PITFALL Using reduction operations is expensive on it’s own, so it should be avoided. While we might be able to decrease the number of instructions executed in scenarios #2 and #3, the cost of reduction together with the cost of handling ‘if-else’ statement might be actually higher than execution of equivalent scalar code.

## Summary

This tutorial discussed basic usage of masks. While control-flow still remains an important part of UME::SIMD based programs, the SIMD programming model requires also data flow to open performance opportunities to user programs.
Masks can be treated as vectors of boolean values, and obtained either by explicit initialization or by calling a specific comparison operation engaging arithmetic vectors. Then they can be used to block the flow of data within a UME::SIMD program.

# UME::SIMD Tutorial 7: Dynamic alignment

In the previous post we discussed how to control the alignment of statically declared variables, and struct/class members. In this post we will look at ways of performing dynamic allocations with alignment control.

## The C++ way

Unfortunately so far there is no convenient, standard way for aligned allocations. Current C++ requires the users to perform overallocation and then to use only the data at aligned offset. This can be done using std::align function.

Example code:

#include <iostream>
#include <iomanip>
#include <memory>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

uint64_t result = 1;
for(uint64_t i = 2; i <= 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

int main()
{
const int ALIGNMENT = 64; // align to 64B (512b) boundary
int ARRAY_LENGTH = 5;     // size in number of array fields
void *original_ptr;       // this has to be 'void*' so that we could use it with std::align
void *aligned_ptr;

std::size_t SIZE = ARRAY_LENGTH*sizeof(float);
std::size_t OVERALLOCATED_SIZE    // overallocated size in bytes
= (SIZE + ALIGNMENT - sizeof(float));

original_ptr = (float*) malloc(OVERALLOCATED_SIZE);

std::cout << "[original_ptr]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)original_ptr)
<< "\nrequired type alignment: " << alignof(decltype(*(float*)original_ptr))
<< "\nrequired ptr alignment: " << alignof(decltype(original_ptr)) << "\n";

// We would like to retain the 'original_ptr'
// to be able to free the buffer in the future
aligned_ptr = original_ptr;
aligned_ptr = std::align(
ALIGNMENT,
sizeof(float),
aligned_ptr,
OVERALLOCATED_SIZE);

if(aligned_ptr == nullptr) {
std::cout << "std::align failed: no sufficient space in the buffer.\n";
}

std::cout << "[aligned_ptr]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)aligned_ptr)
<< "\nrequired type alignment: " << alignof(decltype(*(float*)aligned_ptr))
<< "\nrequired ptr alignment: " << alignof(decltype(aligned_ptr)) << "\n";

free(original_ptr);
}


Exemplary output:
 [original_ptr] address: 0x1109c20 variable alignment: 32 required type alignment: 4 required ptr alignment: 8 [aligned_ptr] address: 0x1109c40 variable alignment: 64 required type alignment: 4 required ptr alignment: 8 

What the code does is:

1. it calculates the overallocation size ( OVERALLOCATED_SIZE) taking into account additional padding,
2. allocates an oversized buffer using a standard allocation method,
3. creates a new pointer to point at properly aligned location within the buffer,
4. uses the allocated data to do something (possibly useful),
5. frees the data using the original pointer.

This method is pretty troublesome. First of all we need to manually calculate the size of the new buffer. It is not difficult, however it might be a source of error, and is troublesome in case that aligned allocations have to be frequent in the code. The allocated size can be calculated from the following formula:

  std::size_t OVERALLOCATED_SIZE = (SIZE + ALIGNMENT - sizeof(float));


A small graphics to show why this formula would work:

In a worst case scenario we could end up with malloc returning an address that is pointing just after a 64B aligned address. This means, that the first element would have to be shifted 60B to the right ( ALIGNMENT - sizeof(float)). The last element would also have to be shifted, so the span between original location of the first element, and the new location of the last data element would be: 60 + 5*sizeof(float) that is 80 bytes. This is exactly how much space we need to allocate.

Secondly, we have to make sure that we retain the knowledge about original location of the memory buffer. This requires us to actually control two pointers: one for the original buffer, and one for the aligned buffer. Even with that it is easy to fall into a PITFALL of passing the original pointer to the std::align function, as it might be modified by the function! We can avoid this by copying the original pointer into the new pointer and passing it instead:

aligned_ptr = original_ptr;
aligned_ptr = std::align(
ALIGNMENT,
sizeof(float),
aligned_ptr,
OVERALLOCATED_SIZE);


## Dynamic allocation using UME

UME offers you a set of portable wrappers for aligned allocation. You can acces these using UME::DynamicMemory::AlignedMalloc() and UME::DynamicMemory::AlignedFree() methods. The behaviour is almost identical to standard malloc call, with the difference that also additional ALIGNMENT parameter has to be defined. Here’s a code example:

#include <iostream>
#include <iomanip>

#include <umesimd/UMESimd.h>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

uint64_t result = 1;
for(uint64_t i = 2; i <= 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

int main()
{
const int ALIGNMENT = 128; // align to 128B boundary
int ARRAY_LENGTH = 5;     // size in number of array fields
std::size_t SIZE = ARRAY_LENGTH*sizeof(float);

float *ptr = (float*) UME::DynamicMemory::AlignedMalloc(SIZE, ALIGNMENT);

std::cout << "[ptr]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)ptr)
<< "\nrequired type alignment: " << alignof(decltype(*ptr))
<< "\nrequired ptr alignment: " << alignof(decltype(ptr)) << "\n";

UME::DynamicMemory::AlignedFree(ptr);
}


Nothing more, nothing less. No additional calculations required, no additional pointers to track, simple syntax.

# UME::SIMD Tutorials 6: Static alignment

As explained briefly in previous tutorial, data alignment can be an important factor in performance sensitive codes. In this tutorial we will go through direct techniques that can be used for static alignment control. Static alignment is the alignment of variables, including class members, which can be deduced at compile time.

## Variable alignment

The static alignment can be controlled in C++ using alignas specifier. Let’s look at an example containing basic scalar variables only:

#include <iostream>
#include <iomanip>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

uint64_t result = 1;
for(uint64_t i = 2; i < 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

int main()
{
uint8_t a;
uint32_t b;
alignas(16) uint32_t d;
alignas(4) uint8_t c;

std::cout << "[a]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)&a)
<< "\nrequired type alignment: " << alignof(decltype(a)) << "\n";
std::cout << "\n[b]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)&b)
<< "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
std::cout << "\n[c]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)&c)
<< "\nrequired type alignment: " << alignof(decltype(c)) << "\n";

std::cout << "\n[d]"
std::cout << "\nvariable alignment: " << alignment((uint64_t)&d)
<< "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}


Running above code will give an output looking like this:
 [a] address: 0x7ffdf0452c04 variable alignment: 4 required type alignment: 1 [b] address: 0x7ffdf0452c00 variable alignment: 1024 required type alignment: 4 [c] address: 0x7ffdf0452c10 variable alignment: 16 required type alignment: 4 [d] address: 0x7ffdf0452c08 variable alignment: 8 required type alignment: 1 

I recommend re-running the example couple of times to observe how the specific alignments change.

The first variable a is required to be aligned to 1B boundary, as it’s natural alignment is 1B (sizeof(uint8_t)). The actual alignment is 4B, but this is OK, as it respects 1B requirement (as well as 2B). In other runs it can ba aligned to an arbitrary allowed boundary.

The second variable b also respects the natural alignment, which is in this case 4B (sizeof(uint32_t)). In this case the alignment of the variable should never be smaller than 4B.

For variables c and d we artificially force the alignment to be 16B and 4B respectively. In the first case the actual alignment ends up to be as expected. For d we end up with a stronger alignment but, as previously, it is also acceptable.

## Array alignment

We can also use the alignas specifier for statically declared arrays.

#include <iostream>
#include <iomanip>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

uint64_t result = 1;
for(uint64_t i = 2; i < 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

int main()
{
uint8_t a[10];
uint32_t b[20];
alignas(16) uint32_t c[7];
alignas(4) uint8_t d[13];

std::cout << "[a]"
std::cout << "\narray alignment: " << alignment((uint64_t)&a)
<< "\narray size: " << sizeof(a)
<< "\nrequired type alignment: " << alignof(decltype(a));
std::cout << "\n[b]"
std::cout << "\narray alignment: " << alignment((uint64_t)&b)
<< "\narray size: " << sizeof(b)
<< "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
std::cout << "\n[c]"
std::cout << "\narray alignment: " << alignment((uint64_t)&c)
<< "\narray size: " << sizeof(c)
<< "\nrequired type alignment: " << alignof(decltype(c)) << "\n";
std::cout << "\n[d]"
std::cout << "\narray alignment: " << alignment((uint64_t)&d)
<< "\narray size: " << sizeof(d)
<< "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}


An exemplary output:
 [a] address: 0x7ffcfddbfc79 array alignment: 1 array size: 10 required type alignment: 1 [b] address: 0x7ffcfddbfc00 array alignment: 1024 array size: 80 required type alignment: 4 [c] address: 0x7ffcfddbfc50 array alignment: 16 array size: 28 required type alignment: 4 [d] address: 0x7ffcfddbfc6c array alignment: 4 array size: 13 required type alignment: 1 

As defined in C++, the minimal alignment requirement for an array, is the natural alignment of its elements. So for the array of 8b integers we obtain the required alignment of the arrays to be 1B for a and 4B for b. Forcing the alignment of an array results only in forcing the alignment of the first element. In this particular case, the alignment is forced on variables c and d.

PITFALL: Mind that the rest of the elements are guaranteed to be aligned only to their natural alignment, since C++ requires the contiguous packing of plain array elements. In other words: alignment of c[0] will be at least 16, but the alignment of c[1] might be only 4.

## Member alignment

We will only discuss structures, but the general conclusions apply also for classes.

Again, the code you can use to tinker a little:

#include <iostream>
#include <iomanip>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
for(int i = 0; i < length; i++) {
std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
}
std::cout << std::dec << std::endl;
}

uint64_t result = 2;
for(uint64_t i = 2; i < 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

struct S_unaligned{
uint8_t a;
uint8_t b;

// We initialize the data structure with some data
// so that the output is more readable
S_unaligned() : a(3), b(7) {}
};

struct S_aligned_a{
alignas(16) uint8_t a;
uint8_t b;

S_aligned_a(): a(3), b(7) {}
};

struct S_aligned_b{
uint8_t a;
alignas(16) uint8_t b;

S_aligned_b(): a(3), b(7) {}
};

int main()
{
S_unaligned x0;

std::cout << "[S_unaligned] size: " << sizeof(x0)
<< "\nrequired type alignment: " << alignof(S_unaligned)
<< "\nobject alignment: " << alignment((uint64_t)&x0)
<< "\ndata:";
print_raw((uint8_t*) &x0, sizeof(x0));

alignas(128) S_unaligned x0_a;
std::cout << "\n[aligned S_unaligned] size: " << sizeof(x0_a)
<< "\nrequired type alignment: " << alignof(S_unaligned)
<< "\nobject alignment: " << alignment((uint64_t)&x0_a)
<< "\ndata:";
print_raw((uint8_t*) &x0_a, sizeof(x0_a));

S_aligned_a x1;
std::cout << "\n[S_aligned_a] size: " << sizeof(x1)
<< "\nrequire dtype alignment: " << alignof(S_aligned_a)
<< "\nobject alignment: " << alignment((uint64_t)&x1)
<< "\ndata:";
print_raw((uint8_t*) &x1, sizeof(x1));

alignas(128) S_aligned_a x1_a;
std::cout << "\n[aligned S_aligned_a] size: " << sizeof(x1_a)
<< "\ntype alignment: " << alignof(S_aligned_a)
<< "\nobject alignment: " << alignment((uint64_t)&x1_a)
<< "\ndata:";
print_raw((uint8_t*) &x1_a, sizeof(x1_a));

S_aligned_b x2;
std::cout << "\n[S_aligned_b] size: " << sizeof(x2)
<< "\nrequire type alignment: " << alignof(S_aligned_b)
<< "\nobject alignment: " << alignment((uint64_t)&x2)
<< "\ndata:";
print_raw((uint8_t*) &x2, sizeof(x2));

alignas(128) S_aligned_b x2_a;
std::cout << "\n[aligned S_aligned_b] size: " << sizeof(x2_a)
<< "\nrequire type alignment: " << alignof(S_aligned_b)
<< "\nobject alignment: " << alignment((uint64_t)&x2_a)
<< "\n data:";
print_raw((uint8_t*) &x2_a, sizeof(x2_a));
}


We will now discuss few selected structures.

#### Structure #1: No members aligned

For the S_unaligned x0 the output is;
 [S_unaligned] size: 2 address: 0x7fff2c36ca78 required type alignment: 1 object alignment: 8 data:0x03 0x07 

Few things to point out here. First of all, the size of the S_unaligned structure is 2 bytes. This means, that also the natural alignment of this structure is 2B. At the same time, the structure has two fields, each of them 1B long (natural alignment equal to 1B). According to C++, the structure itself has to be aligned to the natural alignment of its first member. And indeed it is, as indicated by:

type alignment: 1

While the structure itself has to be aligned to 1B boundary, in this particular case it is aligned to 8B boundary:

object alignment: 8

Again, this is OK as higher alignment requirement also respects lower alignments, i.e. 8B aligned structure is also 4B, 2B and 1B aligned.

When we align an object of this structure, we obtain something like:
 [aligned S_unaligned] size: 2 address: 0x7ffe25de5080 required type alignment: 1 object alignment: 128 data:0x03 0x07 

Nothing changed, except for the minimal value the object alignment can take.

#### Structure #2: only the first member aligned

Now let’s modify the structure to have the a member aligned to some boundary. In this particular example, we are aligning the first field to 16B boundary.

  struct S_aligned_a{
alignas(16) uint8_t a;
uint8_t b;

S_aligned_a(): a(3), b(7) {}
};


The output for a specific execution is:
 [S_aligned_a] size: 16 address: 0x7ffc5b06a0f0 require dtype alignment: 16 object alignment: 16 data:0x03 0x07 0xea 0xe9 0x21 0x7f 0000 0000 0x90 0x77 0xea 0xe9 0x21 0x7f 0000 0000 

Few interesting things happened here. First of all, the size of the structure immediately got larger. The size is now 16B, which means 8 times longer than the initial one! But we also can see from data: 0x03 0x07 ... that a and b are still in the same place within the structure. What is happening? Let’s imagine an example with pointer arithmetic:

S_aligned_a data[10];
S_aligned_a* ptr = &data[10];

for (int i = 0; i <10; i++) {
std::cout <a << std::endl;
ptr++;
}


Due to how pointer arithmetics work, when incrementing the pointer, such as in the loop using ptr, the value stored in the pointer (the address at which the pointer is pointing) has to be increased with the size of the base type of the pointer. In this specific case if the size was 2B, as for the original structure, then the pointer would be increased by 2B each time. This means that we would end up accessing structures of S_aligned_a type at following addresses:

 0x7ffc5b06a0f0, 0x7ffc5b06a0f2, 0x7ffc5b06a0f4, 0x7ffc5b06a0f8, 0x7ffc5b06a0fa, 0x7ffc5b06a0fc, 0x7ffc5b06a0fe, 0x7ffc5b06a100, ... 

Out of all of above adresses, only 0x7ffc5b06a0f0 and 0x7ffc5b06a100 are aligned to 16B boundary! This means that a would not be aligned for data[1], data[2],…, data[6], data[8] etc.
Now the obvious solution is to add padding at the end of each structure object so that the next structure, starting directly after the previous one, was automatically aligned to required boundary. This is exactly how it is done. Unfortunately in some specific cases, as the one presented, this might grow the required size enormously.

As for the previous case, aligning the whole structure doesn’t change much:
 [aligned S_aligned_a] size: 16 address: 0x7ffe25de5000 type alignment: 16 object alignment: 2048 data:0x03 0x07 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 

#### Structure #3: only the second member aligned

Now let’s try defining another structure, this time with only b aligned:

struct S_aligned_b{
uint8_t a;
alignas(16) uint8_t b;

S_aligned_b(): a(3), b(7) {}
};


A particular test execution gives us:

 [S_aligned_b] size: 32 address: 0x7ffc5b06a160 require type alignment: 16 object alignment: 32 data:0x03 0000 0000 0000 0x10 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x07 0x19 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 

What?! Yet another size increase!? Again, C++ is not allowed to reorder the members (due to serialization problems it might cause), so the only way it can behave is by padding with some unused fields between a and b. Let’s think for a bit, why is this necessary.

The a member, is aligned to 1B boundary, that is to its’ natural alignment. Let’s assume that an object of the S_aligned_b structure is aligned to 16B boundary. a would then be aligned to 16B, but the next 15B would only be guaranteed to be aligned to 1, 2, 4 and 8 byte boundaries. This means that the first unocupied address after a, aligned to 16B boundary is starting at the offset 16 from the beginning of the structure. Hence the data fields between two structures have to be padded with unused space, so that the alignment requirement for b are met.

We could use the same rationalization for why the required type alignment is 16: it has to be to guarantee that b is properly aligned.

Yet one more time aligned object does not have a drastic impact:
 [aligned S_aligned_b] size: 32 address: 0x7ffe25de4f80 require type alignment: 16 object alignment: 128 data:0x03 0x8e 0x9f 0xa2 0x01 0x7f 0000 0000 0x90 0x17 0x9f 0xa2 0x01 0x7f 0000 0000 0x07 0xc0 0x66 0xa2 0x01 0x7f 0000 0000 0x80 0x7c 0x9f 0xa2 0x01 0x7f 0000 0000 

## Structure packing

As shown before, the size of the structure can grow surprisingly quickly. Can we fight it, and how? We can reorder the members of a class/struct so that they have tighter packing. The code to show this:

#include <iostream>
#include <iomanip>

std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
for(int i = 0; i < length; i++) {
std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
}
std::cout << std::dec << std::endl;
}

uint64_t result = 2;
for(uint64_t i = 2; i < 4096; i*=2) {
if(addr % i != 0) break;
result = i;
}
return result;
}

struct S{
uint8_t a;
uint16_t b;
uint32_t c;
uint64_t d;
alignas(256) uint64_t e;

S(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed{
alignas(256) uint64_t e;
uint64_t d;
uint32_t c;
uint16_t b;
uint8_t a;

S_packed(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed_noalign{
uint64_t e;
uint64_t d;
uint32_t c;
uint16_t b;
uint8_t a;

S_packed_noalign(): a(3), b(7), c(9), d(1), e(6) {}
};

int main()
{
S x0;

std::cout << "[S] size: " << sizeof(x0)
<< "\nrequired type alignment: " << alignof(S)
<< "\nobject alignment: " << alignment((uint64_t)&x0)
<< "\ndata:";
print_raw((uint8_t*) &x0, sizeof(x0));

S_packed x1;

std::cout << "[S_packed] size: " << sizeof(x1)
<< "\nrequired type alignment: " << alignof(S_packed)
<< "\nobject alignment: " << alignment((uint64_t)&x1)
<< "\ndata:";
print_raw((uint8_t*) &x1, sizeof(x1));

S_packed_noalign x2;

std::cout << "[S_packed_noalign] size: " << sizeof(x2)
<< "\nrequired type alignment: " << alignof(S_packed_noalign)
<< "\nobject alignment: " << alignment((uint64_t)&x2)
<< "\ndata:";
print_raw((uint8_t*) &x2, sizeof(x2));

}



The output (i marked the actual data that matters for us):
 [S] size: 512 address: 0x7fff1beb2b00 required type alignment: 256 object alignment: 256 data:0x03 0x34 0x07 0000 0x09 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x46 0x1d 0x41 0x11 0x6a 0x7f 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x98 0x66 0x40 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xec 0x07 0000 0000 0x01 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xa8 0x45 0x99 0x11 0x6a 0x7f 0000 0000 0xc0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x5e 0x96 0x93 0x1c 0000 0000 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x5b 0x1c 0xc2 0000 0000 0000 0000 0000 0x1a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x90 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xb0 0xc3 0x40 0x11 0x6a 0x7f 0000 0000 0x70 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x06 0000 0000 0000 0000 0000 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0x62 0x0d 0x78 0x11 0x6a 0x7f 0000 0000 0x62 0x85 0x43 0x11 0x6a 0x7f 0000 0000 0xa7 0x0f 0xad 0000 0000 0000 0000 0000 0x2d 0000 0000 0000 0x6a 0x7f 0000 0000 0xb8 0xd6 0xbb 0x10 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xf8 0x35 0xbc 0x10 0x6a 0x7f 0000 0000 0xe0 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0x08 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x31 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xf8 0x44 0x96 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x10 0x04 0x40 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0x12 0x03 0000 0000 0x01 0000 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000 [S_packed] size: 256 address: 0x7fff1beb2d00 required type alignment: 256 object alignment: 256 data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0x3a 0x02 0000 0000 0000 0000 0000 0000 0x64 0x4c 0x41 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xd5 0xb9 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x38 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x85 0x4d 0x02 0x04 0x01 0000 0000 0000 0x50 0xea 0x40 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x48 0x96 0x38 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x94 0x5e 0x77 0x11 0x6a 0x7f 0000 0000 0x90 0xe7 0x76 0x11 0x6a 0x7f 0000 0000 0000 0x90 0x3e 0x11 0x6a 0x7f 0000 0000 0x80 0x4c 0x77 0x11 0x6a 0x7f 0000 0000 0x18 0xf2 0x76 0x11 0x6a 0x7f 0000 0000 0x50 0x95 0x76 0x11 0x6a 0x7f 0000 0000 [S_packed_noalign] size: 24 address: 0x7fff1beb2e00 required type alignment: 8 object alignment: 512 data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000 

The general rule for reordering is to calculate max(required_alignment(member), sizeof(member)) and declare the members with higher values first. For this particular case this packing technique halved the size of the structure when one of the field requires a very high alignment.
Also if we can relax the alignment requirement, as in the structure S_packed_noalign, then the amount of space required will be even smaller. We might however run into problems with efficient register LOAD/STORE operations if the alignment criteria are not met.

PITFALL: If the structures are nested, modifications made in components might change their size/alignment requirements. Therefore it might also require rearrangement of derived structures. A common, nasty scenario is when the alignment problems appear after a remote commit, not directly related to affected structure. In such scenarios, seemingly innocent changes can drasticaly decrease the performance of an application.

## Summary

We discussed topic of static alignment. Few performance hints to be remembered:

1. The alignment requirement can be controlled using alignas keyword.
2. The natural alignment can be obtained using alignof keyword.
3. Alignment of array is set only for the first element. Therefor consecutive elements might not respect the necessary alignment constraint.
4. When aligning members of structures the size of the structure might increase substantially due to padding introduced by the compiler.
5. Structures can be packed densely using manual reordering of members, so that the resulting size is decreased.

I don’t think that the topic of static alignment is exhausted here, but the post already got eXtra Large. So let me know if something requires more thorough explanation.

In the next post I will discuss how can alignment be controlled in a dynamic way.

# UME::SIMD Tutorial #5: Memory subsystem and alignment

In this tutorial we will discuss few basic ideas related to memory management in the context of memory hierarchy. While efficient code generation is an issue, the real performance problems are related to memory. This topic is very broad, and possibly deserves another tutorial series, but some basic knowledge is required to understand basic aspects of vector code development. The material here is theoretical, as we will be taking look at specific techniques closer in next posts.

## Cache subsystem

If you ever took basic computer architecture or digital electronics courses, you should be already aware of two basic computer architectures: Von Neumann architecture and Harvard architecture. It should be noted that modern computer systems are a variation of both of them.

The Von Neumann architecture in a simplest implementation consists of: memory device, a CPU and I/O interfaces, possibly connected to some external devices. In this tutorial we won’t discuss in detail the topic of I/O, but we will be focusing on memory and CPU interaction.

The whole system can work like this:

1. CPU loads a value from specific memory cells into registers,
2. CPU performs some calculations on registers,
3. CPU writes back the result into the memory.

This is very similar to what we did in previous tutorial. Unfortunately this mode of operation was already proven to be very hurtful for performance. Imagine that CPU carries some computations, and then tries to reuse results generated in the past. The number of registers available is very small, even in modern systems. If the data does not fit the register bank then possibly new data has to be loaded and the old data has to be stored back to memory. Once the result is needed again, it has to be loaded again. Both memory reads and writes are expensive in comparison to moving data between registers, simply because of physical distance between: arithmetic/control units, RAM and register bank.

Often our programs iterate over large memory containers, with data layed out over consecutive memory addresses. An additional optimization opportunity opens: loading the data that will be needed in the future while CPU works on previous data. This basic concurrency technique allows hiding memory latency, and therefore decreasing overall computation time. Unfortunately with the original architecture, the size of register bank limits us to loading only few additional fields.

An engineering approach for solving the problem of memory access latency was invention of a cache, a small additional memory placed somewhere between the CPU and RAM, and preferrably having a dedicated, direct connection to CPU. Initially the cache was placed on motherboards, slowly moving closer and closer towards CPU cores. As the cache was moving closer to cores, the amount of physical space available was dropping, limiting the size of cache. At the same time, initially single cache, has been split into a hierarchy of cache levels.

The above picture gives a brief overview of how does the memory subsystem work nowadays. The data is moved from RAM through a series of cache levels, each of them replicating the original data, and being able to communicate only with immediately higher and lower memory layers. Cache level L1 is the smallest one and the only one capable of exchanging information with CPU registers. Cache level L2 is larger than L1, but also restricted to single CPU core. The main reason why L1 and L2 are distinct in some CPUs is that L1 might come in two pieces: data cache and instruction cache. Cache level L3 is usually much larger than L2 and is shared between multiple cores of a single CPU. Existence of shared cache can improve inter-core communication, and thus is a very interesting topic for high-performance engineering.

Now how does the cache system work? Imagine a scenario: a value (let’s say 64b integer) is loaded from RAM to L3, then from L3 to L2, then from L2 to L1, finally landing in one of the registers. CPU performs some computations and, if it runs out of registers, uses L1 cache to temporarily store the data. Once the data is needed again, it only has to be loaded from L1 and not from RAM! OK, but L1 is very small. So what happens when L1 is depleted? Simply some of the data is moved from L1 to L2, and moved back from L2 to L1 and then to registers when needed. In a similar way the data from L2 is being pushed back to L3 if L2 is full. Finally if L3 is full, the only thing that can happen is to write the data to RAM.

It is very possible that you, a programmer, never had to deal with cache before. Why is that? The cache subsystem is managed primarily by the hardware and systems software. The hardware decides when and what data has to be moved between different levels of cache. The system software, including compilers or some privilaged modules, might have some form of direct effect on cache behaviour by issuing hints or by configuring the cache management hardware. Regardless of that, from the software perspective the cache is almost completely transparent. Almost, as it might be visible when measuring application performance.

## Register alignment

When moving data between registers and memory, it is important to be aware, that the data is easiest to load, when it is register aligned, that is:

[address of data element] modulo [size of register] = 0

This can be depicted using following example:

In the diagram, we have some structure definition with a possible layout of that structure in the contiguous memory. Field a is aligned to it’s natural alignment defined as:

[address of data element] modulo [size of data element] = 0

Field b is an array starting immediately after the a field and being aligned according to C++ rules, to the first address that meets natural alignment of its’ elements, that is 64b. Mind that all fields of b are 64b aligned, and the array itself is 64b aligned as its’ first fields’ highest alignment is 64b.

On some architectures, only register aligned load operations are permitted. Other architectures offer unaligned loads, but penalise them with longer execution time. For both performance and correctness reasons it is best to always align the data to the register alignment when performing computations. As depicted in the example above:

1. loading b[i] to any 64b register is always possible, as each element of b is register aligned;
2. loading a pair b[0]-b[1] to a 128b register (i.e. a vector register) might cause alignment issues, as the field is not aligned to 128b boundary;
3. loading a pair b[1]-b[2] to a 128b register will not cause the alignment issues, as both fields together are aligned to 128b boundary.

At the same time, aligning too much might waste some space, as padding will be added to fill the gaps between aligned fields. Knowing what needs to be aligned, and what should be left untouched comes with some practice, and with understanding of internals of an application.

In the case presented above, we also assumed, that the structure S itself is at least 128b aligned. This does not have to be the case in a general scenario.

## Cachelines

To simplify the hardware a bit (and make things faster), instead of passing a single byte at a time, the data is being transfered in chunks called cachelines. In x86 the cache line size is always 64B. In arm32 it’s 32B and in arm64 it is 64B. Other processors might use different values.

Existence of cache has an impact on performance. Reading first byte of a cacheline will take some time, as the data has to be transported through the cache. Reading next 63 bytes, regardless of whether they contain the same data set or not, will introduce very little if no latency. This means that the most efficient memory layout is a contiguous array of data. If you are using an array of structures (AoS) layout, then accessing only a single field of each structure will not be an optimal access pattern: more cachelines will have to be loaded to L0, and the percentage of utilisation of each cacheline will be very low.

We will say that a data element is cacheline aligned if it begins with a new cacheline, that is:

[address of data element] modulo [cacheline size] = 0

A large portion of application optimization is revolving around making the data layouts cache friendly. This means both data reuse and maximizing cacheline consumption.

## Pages

Computer systems have very limited resources in terms of RAM, compared to parmanent storage such as hard drives. Running out of RAM is bad for everyone, as it most often results in application and sometimes even in system crash. A common way of avoiding insufficient memory is known as memory virtualization. Memory virtualization is also a very useful mechanism to protect other applications from malicious influence of other, possibly untrusted applications.

In great, great simplification, memory virtualization works as follows:

1. When application starts, the operating system allocates certain amount of space required by static requirements of the program (e.g. the program stack). The space is starting at a virtual address 0x0, and ends at some virtual address 0xff..ff. The precise size of this space depends on the OS and hardware. These address spaces are not unique between applications, but the addresses are unique within a single system process.
2. The virtual address space is divided into 4kB blocks, called pages. Each page of virtual memory that is being considered to be useful is registered in virtual address translation table. The table consists of: process ID, virtual page address and physical page address. Physical page address is unique in the scale of the operating system.
3. When the process allocates some memory, the OS adds entries to the table, and reserves some space in the physical memory. When the process frees up the memory, the OS removes specific entries from the table.
4. When the OS runs out of physical memory it allocates a virtual memory file somewhere on a hard drive. This file extends the RAM capacity by allowing OS directed movement of pages between physical memory and virtual memory file.
5. When a user process (i.e. an application) accesses a specific address in virtual memory, the OS checks the physical address of the page. If the address fits into the RAM, it starts the process of address translation. If the address is not in the RAM, the OS moves specific page from virtual memory file to RAM, possibly replacing some other process’ pages, and re-assigning the physical address in the virtual address translation table. Once this process of memory migration happens, the OS calculates the physical address and returns it to the application.

In some sense the paging mechanism acts as a OS based caching mechanism. While initially paging was implemented purely in the software, a large amount of effort was put into hardware support, primarilly for performance reasons. We won’t be talking about paging too much in the context of this series, but it is an important topic to be considered in a context of parallel aplications.

For the sake of completeness: we will call a memory element to be page aligned if it meets the following equation:

[address of memory element] modulo [page size] = 0

## Summary

At this moment you should be already aware of the cache subsystem and the paging mechanism. You should also be able to distinguish between natural alignment, register alignment, cacheline alignment and page alignment.

Few performance hints to be remembered:

1. Accessing data in a contiguous manner is usually the most efficient way as it allows complete consumption of cachelines.
2. Maximizing data reusal is the key to exploit cache subsystem. Perform as many operations as possible with the most recently used/created data.
3. Try to meet alignment requirements, as it could reduce number of transported cachelines and make cache-registers communication faster

In following posts we will be looking at some practical aspects of memory alignment and efficient cache utilisation. We will also look at data structures and some memory layouts that are critical from the perspective of performance.

# UME::SIMD Tutorial 4: Handling longer datasets

We know already how to read data from memory (LOAD), how to perform basic vertical and horizontal operations, and how to write results back to memory (STORE). But operating on very short vectors is not particularly useful, as we usually want to handle large data sets. The solution is very simple: just use C++ loops. Consider following example:

Example: Given scalar coefficients a and b, and two arrays x and y, compute a set of $y_i$ variables given by N linear equations:

$y_i = a_i*x_i + b_i , i \in <0,N)$

Solution:
Since we don’t know how big (or small) N is, we have to consider the case, when N is not a multiple of number of elements packed in SIMDVec (we will call this value simd stride). We therefore have to split N in two parts:

1. part executed using SIMDVec code (peel loop), and
2. part executed using scalar code (remainder loop).

For the actual code we will assume the SIMD stride to be 4 and packed precision type to be float.

#include <umesimd/UMESimd.h>
using namespace UME::SIMD;
//...
void linear_equation_evaluator(
float *y,
float *x,
float a,
float b,
int N)
{
SIMDVec<float, 4> x_vec, y_vec;
SIMDVec<float, 4> a_vec(a), b_vec(b);

int REMAINDER_COUNT=N%4;
int REMAINDER_OFFSET=N-REMAINDER_COUNT;

// Compute 'peel' part:
for(int i=0; i<N; i+=4)
{
y_vec=a_vec*x_vec+b_vec;
y_vec.store(&y[i]);
}

// Compute 'remainder' part:
for(int i=REMAINDER_OFFSET; i<N; i++)
{
y[i]=a*x[i]+b;
}
}
//...


The above code structure is worth memorizing. It will appear in UME::SIMD codes very often. What we do first is to reserve some registers for x and y values. Wa also prepare vectors containing constants a and b. Additional necessary information in form of remainder loop offset is calculated as REMAINDER_OFFSET.

We start computations by iterating data elements in chunks of 4 (the SIMD stride). This peel loop is the major source of our performance-gain: instead of processing one element at a time (as happens in remainder loop), we are processing up to SIMD-stride elements. In theory this should give us up to 4x speedup! (In practice this will rarely be the case. We will discuss this in one of future posts dedicated to portability and selection of proper stride.) This speedup is due to the fact that UME::SIMD library is internally using hardware based mechanisms for data level parallelism, such as specialized CPU instructions.

Finally we calculate remainder loop, which performs the calculations over the remainig elements. Mind that remainder loop in this case is just regular scalar code you would normally write for solving the problem presented in the example. The only difference is the initial value of the loop index i. For practical scenarios, be advised to first write the scalar version of your kernel, as you might still need it for remainder part of your vectorized kernel. Having the scalar version is also important for results verification.

The function presented in our solution can only be simplified if we know from the definition of the problem that N is multiple of 4. In such scenario, we could completely remove remainder loop from the function definition. Alternatively if we know that N is always less than 4, we could completely remove the peel part, as it will never be executed. Unfortunately for most scenarios this will mean some sort of code duplication. And code duplication is always bad: twice the bugs, twice the testing, quadruple the maintenance time… The next post will show you how this code duplication could be avoided.

# UME::SIMD Tutorial 0: Installation

The library is provided in a header-only form, which makes its’ installation trivial. We will present installation procedures for both Linux and Windows operating systems. Mind that as there is no perfectly portable build system, we only limit ourselves to most common tools used.

This installation procedure shows only steps required for UME::SIMD installation as a standalone library, but we make some small adjustments so that switching to full UME framework would be possible in the future.

# Linux installation procedure

The library in its’ primitive form doesn’t require any build system. The only configuration that has to be made is by passing proper compilation flags to the compiler. The specific flags used will depend on both: hardware, compiler and operating system. We will discuss specific configurations in future tutorials in this series.

This installation procedure should work for most of the Linux based systems and requires a GIT client (https://git-scm.com/).

1. Navigate to the directory where you want to store downloaded the current version of the library and create a directory for the framework:
$mkdir ume$ cd ume 
1. Clone the repository:
$git clone https://github.com/edanor/umesimd.git  1. You might want to checkout a tagged version to keep track of library updates: $ cd umesimd
$git checkout tags/v0.8.1$ cd ..

1. export the library directory
$pwd /home/ume$ export CPLUS_INCLUDE_PATH=\$CPLUS_INCLUDE_PATH:/home/ume

1. Include the library code into your C++ project:

Now you should be able to include the library into your program using following directive:

 #include <umesimd/UMESimd.h>

That’s it!

# Windows installation procedure

Windows installation procedure is similar, but requires additional steps in configuring the specific IDE. We will discuss this configuration for MS Visual Studio 2015 only, but similar rules apply for others.

1. Navigate to the directory where you want to store downloaded the current version of the library and create a directory for the framework:
c:\> mkdir ume
c:\> cd ume 
1. Clone the repository:
c:\ume> git clone https://github.com/edanor/umesimd.git

1. You might want to checkout a tagged version to keep track of library updates:

c:\ume> cd umesimd
c:\ume\umesimd> git checkout tags/v0.8.1
c:\ume> cd ..

1. Create a new project (or open an existing one):

[Ctrl+Shift+N] or File->New->Project…

1. Add new source file to the project:

1. Fill the main.cpp with:
#include <umesimd/UMESimd.h>

int main()
{
return 0;
}

1. Open project properties and configure path to the ‘ume’ directory created before

[Alt+Enter] or Project->Properties

There you go!