UME::SIMD Tutorial 9: gather/scatter operations

As you already know a SIMD registers can communicate with the scalar domain (or more precisely with the memory) using LOAD and STORE operations. The drawback of these operations is that they only allow you to move data elements, that are placed contiguously in the memory. However very often in our codes we have to deal with non-contiguous memory access. In this tutorial we will explain GATHER/SCATTER operations and how they generalize LOAD/STORE operations.

There are scenarios in which you might want to populate a register with data comming from non-consecutive memory locations. Few examples are:

  1. Accessing every second element of an array ( strided access
  2. Accessing element of an array at freshly computed offsets ( indexed access)
  3. Accessing elements in different order (shuffling access)

We will discuss first two situations in this post. The last class of problems needs more thorough discussion in the context of permutations, so we will discuss it in the next tutorial.

What then is a SCATTER operation? It is simply a reverse of a GATHER operation, which ‘scatters’ the content of a register over the memory. It is therefore similar to STORE operation, but capable of non-congiguous memory access.

Strided access

The memory access pattern is called strided when memory fields accessed are equally distant. This distance is called a stride (not to be mistaken with SIMD-stride!). A simple visualization of strided access:

gather

As you can see, the STRIDE-1 access is a special case of GATHER operation: a LOAD operation. Why then do we have separate LOAD and GATHER operations (and STORE and SCATTER) instead just simplifying things and using only GATHERs? There are two explanations. Firstly, there is the historical one: early processors only implemented LOAD instructions to move data between memory and scalar registers. Since in scalar domain you can access any element using a single scalar index, there was no need for more flexible operations.
Secondly, there is the performance aspect: apart from passing the base memory address, the gather instruction requires also information about how to calculate specific offsets to load from. No matter how the instruction is implemented within the processor, this is an additional degree of freedom, possibly meaning longer execution time, but for sure meaning additional circuitry.

PERFORMANCE HINT Use load/store operations whenever you can, and avoid gather/scatter if possible. In most scenarios this will mean that your data structures/algorithm have to be modified. Use gather/scatter only when you are certain, that re-designing structures is not possible.

When performing a strided access you need to know what are the base address, passed as a pointer to the beginning of your data, and the stride value, passed as a scalar integer. The stride is always passed as the number of elements, rarther than memory offset so that the programming can be simplified.

Following code example shows how to perform a strided memory accesses:

float a[LARGE_DATA_SIZE];
uint32_t STRIDE = 8;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Note that we have to scale the loop index.
  int offset = i*STRIDE;

  // 'load' the data to vec.
  vec.gather(&a[offset], STRIDE);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[offset], STRIDE);
}

When using strided access, we have to pass the address of the first element. This is done by calculation offset variable in every iteration. The gather operation then uses this local base address together with a scalar stride to calculate offsets of the corresponding elements.
Once the necessary calculations are over, the updated results are stored back at original locations.

Indexed access

The indexed access is even more general-purpose than strided access. The primary difference is, that instead of passing a scalar stride parameter, you have to pass a SIMDVec of unsigned integer indices. The indexed access can be visualised using following graphic:

indexed_gather_scatter

The adventage of this mode is, that every element can be retrieved using a dedicated index. Disadvantage is that the indexing this way might completely destroy hardware based memory prefetching, and in turn negatively impact your performance. Also mind that all elements might be far away, meaning that more cachelines have to be moved to the lowest cache level.

An example code using indexed access:

float a[LARGE_DATA_SIZE];
int indices[PROBLEM_SIZE];
uint32_t STRIDE = 4;
...
for(int i = 0; i < PROBLEM_SIZE; i+=8) {
  SIMDVec<float, 8> vec;

  // Here we are using precomputed indices,
  // but they can be computed on-the-fly if necessary.
  SIMDVec<uint32_t, 8> indices_vec(&indices[i];

  // 'load' the data to vec.
  vec.gather(&a[0], indices_vec);
  // do something useful
  vec += 3.14;
  // store the result at original locations
  vec.scatter(&a[0], indices_vec);
}

The basic difference is that instead of passing a scalar stride we will use an unsigned integer vector of 32b indices.

NOTE: at the moment the library is using vector of unsigned integers having the same precision as the scalar elements of the gathered vector. This causes troubles when dealing with mixed precision and when small types (e.g. uint8_t) don’t have enough bits to represent a full range of indices. The library will be updated to always use vectors of uint32_t indices

Securing conditional access

One of the problems you might spot when writing your code is when you are trying to deal with conditional statements. Consider following scalar code:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
for(int i = 0; i < PROBLEM_SIZE; i++) {
 // Here we are checking if for some reason one of the 
 // values in (a[i],b[i]) pair is not determined properly.
 if (std::isfin(a[i] - b[i])) {
   // Calculate the index only if both 'a' and 'b' are well defined
   int index = int(a[i] - b[i]);
   // 'gather' single element at a time
   float temp = c[index]; 
   // Do something with the value
   temp += 3.14; 
   // Update the values of 'c'
   c[index] = temp;
 }
}

As you should already know (see UME::SIMD Tutorial 8: Conditional execution using masks) the vector code using masks will execute all operations within the ‘if’ branch, regardless of the fact whether ‘a’ and ‘b’ are well defined or not. It might be tempting to re-write the above code as:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. a design decision) one 
 // of the values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

// if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(&c[0], indices); // This is WRONG!!!
 temp.adda(condition, 3.14); // only change selected elements
 temp.scatter(&c[0], indices); // Again WRONG!!!
}

Why are the gather and scatter operations in this snippet wrong? The are both trying to access the memory, even if the indices are incorrect. But neither GATHER nor SCATTER happen to care about that. They have to trust us to provide them with correct indices, at least for the performance reasons. If the indices are incorrect, they will try to access memory that is possibly outside of the boundaries of ‘c’ dataset. This might result in a memory fault, and for that reason we have to secure both memory read and write using our conditional mask:

float a[PROBLEM_SIZE], b[PROBLEM_SIZE];
float c[LARGE_DATASET_SIZE];
...
// For simplification we are assuming that: ((PROBLEM_SIZE % 8) == 0)
for(int i = 0; i < PROBLEM_SIZE; i+= 8) {
 // Here we are checking if for some reason (e.g. by design?) one of the 
 // values in (a[i],b[i]) pair is not determined properly.

 SIMDVec<float, 8> a_vec(&a[i]), b_vec(&b[i]);
 SIMDVecMask<8> condition = (a_vec - b_vec).isfin();

 // if (std::isfin(a[i] - b[i])) {
 SIMDVec<uint32_t, 8> indices = a_vec - b_vec;
 SIMDVec<float, 8> temp;

 temp.gather(condition, &c[0], indices); // Now the read is CORRECT!!!
 temp += 3.14; // we don't have to mask this operation
 temp.scatter(condition, &c[0], indices); // Again no problems here.
}

So in the corrected version we have to simply pass an additional mask parameter to memory access operations, rather to arithmetic operations. The library will take care of securing the reads so that they access only allowed memory addresses.

Summary

In this tutorial we introduced concept of GATHER/SCATTER operations, and explained why they are a useful addition to our SIMD programming model. We looked at both strided and indexed memory access patterns, and explained how this concept generalizes LOAD/STORE operations.

While very usefull, GATHER/SCATTER operations are a double-edged sword that can both make our lives easier, and destroy our performance.

In the next tutorial we will talk about permutations. There we will discuss also the GATHER/SCATTER as one of possible strategies for re-arranging vector elements.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s