Tag Archives: Mask

UME::SIMD Tutorial 9: gather/scatter operations

As you already know a SIMD registers can communicate with the scalar domain (or more precisely with the memory) using LOAD and STORE operations. The drawback of these operations is that they only allow you to move data elements, that are placed contiguously in the memory. However very often in our codes we have to deal with non-contiguous memory access. In this tutorial we will explain GATHER/SCATTER operations and how they generalize LOAD/STORE operations.

There are scenarios in which you might want to populate a register with data comming from non-consecutive memory locations. Few examples are:

  1. Accessing every second element of an array ( strided access
  2. Accessing element of an array at freshly computed offsets ( indexed access)
  3. Accessing elements in different order (shuffling access)

We will discuss first two situations in this post. The last class of problems needs more thorough discussion in the context of permutations, so we will discuss it in the next tutorial.

What then is a SCATTER operation? It is simply a reverse of a GATHER operation, which ‘scatters’ the content of a register over the memory. It is therefore similar to STORE operation, but capable of non-congiguous memory access.

Strided access

The memory access pattern is called strided when memory fields accessed are equally distant. This distance is called a stride (not to be mistaken with SIMD-stride!). A simple visualization of strided access:

gather

As you can see, the STRIDE-1 access is a special case of GATHER operation: a LOAD operation. Why then do we have separate LOAD and GATHER operations (and STORE and SCATTER) instead just simplifying things and using only GATHERs? There are two explanations. Firstly, there is the historical one: early processors only implemented LOAD instructions to move data between memory and scalar registers. Since in scalar domain you can access any element using a single scalar index, there was no need for more flexible operations.
Secondly, there is the performance aspect: apart from passing the base memory address, the gather instruction requires also information about how to calculate specific offsets to load from. No matter how the instruction is implemented within the processor, this is an additional degree of freedom, possibly meaning longer execution time, but for sure meaning additional circuitry.

PERFORMANCE HINT Use load/store operations whenever you can, and avoid gather/scatter if possible. In most scenarios this will mean that your data structures/algorithm have to be modified. Use gather/scatter only when you are certain, that re-designing structures is not possible.

When performing a strided access you need to know what are the base address, passed as a pointer to the beginning of your data, and the stride value, passed as a scalar integer. The stride is always passed as the number of elements, rarther than memory offset so that the programming can be simplified.

Following code example shows how to perform a strided memory accesses:

float a[LARGE_DATA_SIZE];
uint32_t STRIDE = 8;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Note that we have to scale the loop index.
  int offset = i*STRIDE;

  // 'load' the data to vec.
  vec.gather(&a[offset], STRIDE);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[offset], STRIDE);
}

When using strided access, we have to pass the address of the first element. This is done by calculation offset variable in every iteration. The gather operation then uses this local base address together with a scalar stride to calculate offsets of the corresponding elements.
Once the necessary calculations are over, the updated results are stored back at original locations.

Indexed access

The indexed access is even more general-purpose than strided access. The primary difference is, that instead of passing a scalar stride parameter, you have to pass a SIMDVec of unsigned integer indices. The indexed access can be visualised using following graphic:

indexed_gather_scatter

The adventage of this mode is, that every element can be retrieved using a dedicated index. Disadvantage is that the indexing this way might completely destroy hardware based memory prefetching, and in turn negatively impact your performance. Also mind that all elements might be far away, meaning that more cachelines have to be moved to the lowest cache level.

An example code using indexed access:

float a[LARGE_DATA_SIZE];
int indices[PROBLEM_SIZE];
uint32_t STRIDE = 4;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Here we are using precomputed indices,
  // but they can be computed on-the-fly if necessary.
  SIMDVec<uint32_t, 8> indices_vec(&indices[i];

  // 'load' the data to vec.
  vec.gather(&a[0], indices_vec);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[0], indices_vec);
}

The basic difference is that instead of passing a scalar stride we will use an unsigned integer vector of 32b indices.

NOTE: at the moment the library is using vector of unsigned integers having the same precision as the scalar elements of the gathered vector. This causes troubles when dealing with mixed precision and when small types (e.g. uint8_t) don’t have enough bits to represent a full range of indices. The library will be updated to always use vectors of uint32_t indices

Securing conditional access

One of the problems you might spot when writing your code is when you are trying to deal with conditional statements. Consider following scalar code:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
for(int i = 0; i < PROBLEM_SIZE; i++) {
 // Here we are checking if for some reason one of the 
 // values in (a[i],b[i]) pair is not determined properly.
 if (std::isfin(a[i] - b[i])) {
   // Calculate the index only if both 'a' and 'b' are well defined
   int index = int(a[i] - b[i]);
   // 'gather' single element at a time
   float temp = c[index]; 
   // Do something with the value
   temp += 3.14; 
   // Update the values of 'c'
   c[index] = temp;
 }
}

As you should already know (see UME::SIMD Tutorial 8: Conditional execution using masks) the vector code using masks will execute all operations within the ‘if’ branch, regardless of the fact whether ‘a’ and ‘b’ are well defined or not. It might be tempting to re-write the above code as:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. a design decision) one 
 // of the values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

// if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(&c[0], indices); // This is WRONG!!!
 temp.adda(condition, 3.14); // only change selected elements
 temp.scatter(&c[0], indices); // Again WRONG!!!
}

Why are the gather and scatter operations in this snippet wrong? The are both trying to access the memory, even if the indices are incorrect. But neither GATHER nor SCATTER happen to care about that. They have to trust us to provide them with correct indices, at least for the performance reasons. If the indices are incorrect, they will try to access memory that is possibly outside of the boundaries of ‘c’ dataset. This might result in a memory fault, and for that reason we have to secure both memory read and write using our conditional mask:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. by design?) one of the 
 // values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

 // if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(condition, &c[0], indices); // Now the read is CORRECT!!!
 temp += 3.14; // we don't have to mask this operation
 temp.scatter(condition, &c[0], indices); // Again no problems here.
}

So in the corrected version we have to simply pass an additional mask parameter to memory access operations, rather to arithmetic operations. The library will take care of securing the reads so that they access only allowed memory addresses.

Summary

In this tutorial we introduced concept of GATHER/SCATTER operations, and explained why they are a useful addition to our SIMD programming model. We looked at both strided and indexed memory access patterns, and explained how this concept generalizes LOAD/STORE operations.

While very usefull, GATHER/SCATTER operations are a double-edged sword that can both make our lives easier, and destroy our performance.

In the next tutorial we will talk about permutations. There we will discuss also the GATHER/SCATTER as one of possible strategies for re-arranging vector elements.

UME::SIMD Tutorial 8: Conditional execution using masks

The first real difference between scalar code and SIMD code relates to conditional execution of calculations. C++ is, by design, a language heavily supporting scalar operations. Apart from the arithmetic statements, the language offers also control flow statements, such as: if-then-else, switch and goto. All these can be mixed with UME::SIMD code, but there are few restrictions imposed by the language. We will only discuss here the ‘if-else’ block, but the same considerations apply for other conditional statmeents.

Scalar conditional statements

A simple scalar conditional statement would look like this:

int x;
int a = 2;
int b = 3;
int c;
...

if(x) {
    c = a + b;
}
else {
    c = a - b;
}

Now depending on the value of x, the variable c will take either value of ‘5’ or ‘-1’. To understand which of two branches ( if branch or else branch) will be selected, we have to look at the predicate expression in the if clause:

if(<predicate expression>)

According to C++ rules, the type of the predicate expression, needs to be converted into a boolean. That is, if the expression is using only bool variables, then the result will be prety straightforward to predict. A different case is when we use expressions that evaluate to integers (in our case x is an expression that evaluates to an integer!). In such case we need to know how the C++ converts this integer expression to a boolean expression. The rule to do so is called implicit conversion rule, and is defined within the language standard. The rule goes like this:

Zero value, null pointer value, or null member pointer value is converted to false; any other value is converted to true.

In our particular case the same predicate expression will be therefore interpreted as a boolean predicate (or a boolean expression):

   
bool predicateValue = (x!=0)
if(predicateValue)

Understanding this relation is critical in understanding why we can’t use vector types directly within if() or while() statements.

Vector conditional statements

Let’s visualise what we will be talking about in this part using a piece of code:

 
UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

if(x) { // Error: What does this mean?!
    c = a + b;
}
else {
    c = a - b;
}

In the above code we are running into a very interesting problem: how to convert a vector x into a boolean value? Unfortunately, since C++ is not aware of the vector types, it cannot make any assumptions as for the implicit conversions to boolean expression. More of that: you might want to express different operations by this piece of code. Do you want to execute each block for specific vector elements that fulfill the predicate:

 
UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

for(int i = 0; i < 2; i++)
{
  if(x[i] != 0) {
    c[i] = a[i] + b[i];
  }
  else {
    c[i] = a[i] - b[i];
  }
}
 

or perhaps you want to execute whole block only if all values in the vector fulfill certain condition:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) && (x[1] != 0)
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

or maybe you want to execute the blocks if any of the elements meets the predicate criteria:

UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...

bool predicate = (x[0] != 0) || (x[1] != 0)
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

As you can see, the original language is too ambiguous to let you simpy use the same syntax. At the same time, a library such as UME::SIMD does not have capabilities of a compiler, allowing it to look forward into the code and make a decision based on forward analysis. We therefore need more straightforward, yet unambiguous mechanisms to express our intent.

Mask types

UME::SIMD uses a concept of data flow apart from the already well established control flow to steer the way the actual numerical values are being passed through the program. The data flow concept doeasn’t require any conditional statements. It rather uses a concept of an execution mask. Mask is a vector consisting of only boolean values. A ‘true’ value means that specific vector lanes should execute operation, while a ‘false’ value means that operation should be blocked, that is the result should not be written to the output.

The library exposes a category of types named SIMDVecMask for date-flow control. We will now go through basic operations that are allowed with these types.

Mask initialization using scalars

A basic declaration of looks similar to a vector declaration:


// Broadcast single scalar value
UME::SIMD::SIMDVecMask<4> mask0(true);

// Initialize using scalars
UME::SIMD::SIMDVecMask<4> mask1(true, false, false, true);
...
bool maskValues[4] = {true, false, false, true};
// Initialize using an array
UME::SIMD::SIMDVecMask<4> mask2(&maskValues[0]);

Mask initialization using predicate operations

A mask can be also obtained using one of the predicate operations applied on arithmetic vectors:


UME::SIMD::SIMDVec<float, 8> vec1, vec2;

// 'CoMPare EQual'
UME::SIMD::SIMDVecMask<8> m0=vec1.cmpeq(vec2); // or 'm0=vec1==vec2'
// 'CoMPare Not EQual'
UME::SIMD::SIMDVecMask<8> m1=vec1.cmpneq(vec2); // or 'm1=vec1!=vec2'
// 'CoMPare Greater Than'
UME::SIMD::SIMDVecMask<8> m2=vec1.cmpgt(vec2); // or 'm2=vec1>vec2'
// 'CoMPare Greater than or Equal'
UME::SIMD::SIMDVecMask<8> m3=vec1.cmpge(vec2); // or 'm3=vec1>=vec2'
// 'CoMPare Less Than'
UME::SIMD::SIMDVecMask<8> m4=vec1.cmplt(vec2); // or 'm4=vec1<vec2'
// 'CoMPare Less than or Equal'
UME::SIMD::SIMDVecMask<8> m5=vec1.cmple(vec2); // or 'm5=vec1<=vec2'

For each predicate a specific comparison operation is applied in an elementwise manner for elements of vectors vec1 and vec2.

Mask arithmetics

Masks can be also operated bit there are slightly different operations than for arithmetic vectors. This is due to the fact, that for portability reasons. It is not possible to mix arithmetic vectors, and masks, except for few special situations. Code below shows most of the operations allowed on masks, and their interaction with arithmetic types:


// Some vertical operations:
UME::SIMD::SIMDVecMask<8> m0, m1, m2;
....
m0 = m1.land(m2);
m0 = m1 && m2;    // the same as above
m0 = m1.lor(m2);
m0 = m1 || m2;    // the same as above
m0 = m1.lxor(m2);
m0 = m1 ^ m2;     // the same as above
m0 = m1.landnot(m2); // equivalent to: !m1 && m2

// Also some horizontal (reduction) operations
bool b0;
...
b0 = m1.hland() // m1[0] && m1[1] && ... && m1[7]
b0 = m1.hlor() // m1[0] || m1[1] && ... && m1[7]

Using masks

Once the mask is calculated, it can be passed as an optional parameter to most of the arithmetic operations:


UME::SIMD::SIMDVec<float, 4> vec0, vec1, vec2, vec3;
UME::SIMD::SIMDVecMask<4> m0;
...
m0 = vec0 > vec1;
// Add only pairs where m0 is true
vec2 = vec0.add(m0, vec1);
// Similar expressed using operator syntax
vec2[m0] = vec0 + vec1

// 'Blend' two vectors using masking.
// 'vec2[i]' becomes 'vec1[i]' where 'mask[i]'
// is true, and 'vec0[i]' where false.
vec2 = vec0.blend(mask, vec1);

Now let’s look at specific examples we discussed before. Scenario 1 would look like this in SIMD arithmetic:


UME::SIMD::SIMDVec<int,2> x(0, 10);
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
UME::SIMD::SIMDVec<int,2> t0, t1;
UME::SIMD::SIMDVecMask<2> m0;
...

m0 = x!=0;                // (false, true)
t0 = a+b;                 // (5, 2) 
t1 = a-b;                 // (-1, 8)
c = t1.blend(m0, t0);     // (-1, 2)

// Or briefly:
c=(a-b).blend(x!=0, a+b); // (-1, 2)

PERFORAMNCE PITFALL: when executing scalar code, only one branch of code gets executed for every element of a vector. The decision about which branch to execute is made on a per-iteration basis. This allows the microprocessor to completely skip certain parts of the code.
In SIMD world, an arithmetic operation happens atomicaly for all elements of the vector, and it is not possible to ‘skip’ the execution. As a result we have to execute both if and ‘else’ statements in every iteration. For long ‘if-else’ blocks this might mean executing large number of additional operations (as compared to scalar code), and therefore might decrease the expected vectorization gains.

The second example is the execution of the variant with ‘if all true’. Mind that in this case, we might still use the same ‘if-else’ statement:


UME::SIMD::SIMDVec<int,2> x;
UME::SIMD::SIMDVec<int,2> a(2, 5);
UME::SIMD::SIMDVec<int,2> b(3, -3);
UME::SIMD::SIMDVec<int,2> c;
...
// The original code was using following formula:
// bool predicate = (x[0] != 0) && (x[1] != 0)

// We first declare the mask, and perform a proper
// comparison for values of 'x':
UME::SIMD::SIMDVecMask<4> m0 = x.cmpneq(0);
// Now we have to perform a 'Horizontal Logical AND'
// (reduction) operation:
bool predicate = m0.hland();

// At this moment we already know that
// we have to execute either 'if' or 'else' block
// for all elements of a vector.
if(predicate) {
    c = a + b;
}
else {
    c = a - b;
}

The same format can be used for Scenario #3 but using ‘Horizontal Logical OR’ operation. In general case it is up to you, to decide which construct is more usefull in a given situation.

PERFORMANCE PITFALL Using reduction operations is expensive on it’s own, so it should be avoided. While we might be able to decrease the number of instructions executed in scenarios #2 and #3, the cost of reduction together with the cost of handling ‘if-else’ statement might be actually higher than execution of equivalent scalar code.

Summary

This tutorial discussed basic usage of masks. While control-flow still remains an important part of UME::SIMD based programs, the SIMD programming model requires also data flow to open performance opportunities to user programs.
Masks can be treated as vectors of boolean values, and obtained either by explicit initialization or by calling a specific comparison operation engaging arithmetic vectors. Then they can be used to block the flow of data within a UME::SIMD program.