Tag Archives: Alignment

UME::SIMD Tutorial 7: Dynamic alignment

In the previous post we discussed how to control the alignment of statically declared variables, and struct/class members. In this post we will look at ways of performing dynamic allocations with alignment control.

The C++ way

Unfortunately so far there is no convenient, standard way for aligned allocations. Current C++ requires the users to perform overallocation and then to use only the data at aligned offset. This can be done using std::align function.

Example code:

#include <iostream>
#include <iomanip>
#include <memory>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i <= 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  const int ALIGNMENT = 64; // align to 64B (512b) boundary
  int ARRAY_LENGTH = 5;     // size in number of array fields
  void *original_ptr;       // this has to be 'void*' so that we could use it with std::align
  void *aligned_ptr;
  
  std::size_t SIZE = ARRAY_LENGTH*sizeof(float);
  std::size_t OVERALLOCATED_SIZE    // overallocated size in bytes 
    = (SIZE + ALIGNMENT - sizeof(float));
  
  original_ptr = (float*) malloc(OVERALLOCATED_SIZE);
  
  std::cout << "[original_ptr]" 
  << "\naddress: ";
  print_address((uint64_t)original_ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)original_ptr)
  << "\nrequired type alignment: " << alignof(decltype(*(float*)original_ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(original_ptr)) << "\n";

  // We would like to retain the 'original_ptr' 
  // to be able to free the buffer in the future
  aligned_ptr = original_ptr;
  aligned_ptr = std::align(
                    ALIGNMENT,
                    sizeof(float),
                    aligned_ptr,
                    OVERALLOCATED_SIZE);  

  if(aligned_ptr == nullptr) {
     std::cout << "std::align failed: no sufficient space in the buffer.\n";
  }

  std::cout << "[aligned_ptr]" 
  << "\naddress: ";
  print_address((uint64_t)aligned_ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)aligned_ptr)
  << "\nrequired type alignment: " << alignof(decltype(*(float*)aligned_ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(aligned_ptr)) << "\n";
  
  free(original_ptr);
}

Exemplary output:

[original_ptr]
address: 0x1109c20
variable alignment: 32
required type alignment: 4
required ptr alignment: 8
[aligned_ptr]
address: 0x1109c40
variable alignment: 64
required type alignment: 4
required ptr alignment: 8

What the code does is:

  1. it calculates the overallocation size ( OVERALLOCATED_SIZE) taking into account additional padding,
  2. allocates an oversized buffer using a standard allocation method,
  3. creates a new pointer to point at properly aligned location within the buffer,
  4. uses the allocated data to do something (possibly useful),
  5. frees the data using the original pointer.

This method is pretty troublesome. First of all we need to manually calculate the size of the new buffer. It is not difficult, however it might be a source of error, and is troublesome in case that aligned allocations have to be frequent in the code. The allocated size can be calculated from the following formula:

  std::size_t OVERALLOCATED_SIZE = (SIZE + ALIGNMENT - sizeof(float));

A small graphics to show why this formula would work:
overallocation

In a worst case scenario we could end up with malloc returning an address that is pointing just after a 64B aligned address. This means, that the first element would have to be shifted 60B to the right ( ALIGNMENT - sizeof(float)). The last element would also have to be shifted, so the span between original location of the first element, and the new location of the last data element would be: 60 + 5*sizeof(float) that is 80 bytes. This is exactly how much space we need to allocate.

Secondly, we have to make sure that we retain the knowledge about original location of the memory buffer. This requires us to actually control two pointers: one for the original buffer, and one for the aligned buffer. Even with that it is easy to fall into a PITFALL of passing the original pointer to the std::align function, as it might be modified by the function! We can avoid this by copying the original pointer into the new pointer and passing it instead:

aligned_ptr = original_ptr;
  aligned_ptr = std::align(
                    ALIGNMENT,
                    sizeof(float),
                    aligned_ptr,
                    OVERALLOCATED_SIZE);  

Dynamic allocation using UME

UME offers you a set of portable wrappers for aligned allocation. You can acces these using UME::DynamicMemory::AlignedMalloc() and UME::DynamicMemory::AlignedFree() methods. The behaviour is almost identical to standard malloc call, with the difference that also additional ALIGNMENT parameter has to be defined. Here’s a code example:

#include <iostream>
#include <iomanip>

#include <umesimd/UMESimd.h>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i <= 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  const int ALIGNMENT = 128; // align to 128B boundary
  int ARRAY_LENGTH = 5;     // size in number of array fields   
  std::size_t SIZE = ARRAY_LENGTH*sizeof(float);

  float *ptr = (float*) UME::DynamicMemory::AlignedMalloc(SIZE, ALIGNMENT);

  std::cout << "[ptr]" 
  << "\naddress: ";
  print_address((uint64_t)ptr);
  std::cout << "\nvariable alignment: " << alignment((uint64_t)ptr)
  << "\nrequired type alignment: " << alignof(decltype(*ptr))
  << "\nrequired ptr alignment: " << alignof(decltype(ptr)) << "\n";

  UME::DynamicMemory::AlignedFree(ptr);
}

Nothing more, nothing less. No additional calculations required, no additional pointers to track, simple syntax.

UME::SIMD Tutorials 6: Static alignment

As explained briefly in previous tutorial, data alignment can be an important factor in performance sensitive codes. In this tutorial we will go through direct techniques that can be used for static alignment control. Static alignment is the alignment of variables, including class members, which can be deduced at compile time.

Variable alignment

The static alignment can be controlled in C++ using alignas specifier. Let’s look at an example containing basic scalar variables only:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  uint8_t a;
  uint32_t b;
  alignas(16) uint32_t d;
  alignas(4) uint8_t c;

  std::cout << "[a]"
    << "\naddress: ";
    print_address((uint64_t)&a);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&a)
    << "\nrequired type alignment: " << alignof(decltype(a)) << "\n";
  std::cout << "\n[b]"
    << "\naddress: ";
    print_address((uint64_t)&b);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&b)
    << "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
  std::cout << "\n[c]"
    << "\naddress: ";
    print_address((uint64_t)&c);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&c)
    << "\nrequired type alignment: " << alignof(decltype(c)) << "\n";

  std::cout << "\n[d]"
    << "\naddress: ";
    print_address((uint64_t)&d);
    std::cout << "\nvariable alignment: " << alignment((uint64_t)&d)
    << "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}

Running above code will give an output looking like this:

[a]
address: 0x7ffdf0452c04
variable alignment: 4
required type alignment: 1
[b]
address: 0x7ffdf0452c00
variable alignment: 1024
required type alignment: 4
[c]
address: 0x7ffdf0452c10
variable alignment: 16
required type alignment: 4
[d]
address: 0x7ffdf0452c08
variable alignment: 8
required type alignment: 1

I recommend re-running the example couple of times to observe how the specific alignments change.

The first variable a is required to be aligned to 1B boundary, as it’s natural alignment is 1B (sizeof(uint8_t)). The actual alignment is 4B, but this is OK, as it respects 1B requirement (as well as 2B). In other runs it can ba aligned to an arbitrary allowed boundary.

The second variable b also respects the natural alignment, which is in this case 4B (sizeof(uint32_t)). In this case the alignment of the variable should never be smaller than 4B.

For variables c and d we artificially force the alignment to be 16B and 4B respectively. In the first case the actual alignment ends up to be as expected. For d we end up with a stronger alignment but, as previously, it is also acceptable.

Array alignment

We can also use the alignas specifier for statically declared arrays.

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

int alignment(uint64_t addr) {
 uint64_t result = 1;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

int main()
{
  uint8_t a[10];
  uint32_t b[20];
  alignas(16) uint32_t c[7];
  alignas(4) uint8_t d[13];

  std::cout << "[a]"
    << "\naddress: ";
    print_address((uint64_t)&a);
    std::cout << "\narray alignment: " << alignment((uint64_t)&a)
    << "\narray size: " << sizeof(a)
    << "\nrequired type alignment: " << alignof(decltype(a));
  std::cout << "\n[b]"
    << "\naddress: ";
    print_address((uint64_t)&b);
    std::cout << "\narray alignment: " << alignment((uint64_t)&b)
    << "\narray size: " << sizeof(b)
    << "\nrequired type alignment: " << alignof(decltype(b)) << "\n";
  std::cout << "\n[c]"
    << "\naddress: ";
    print_address((uint64_t)&c);
    std::cout << "\narray alignment: " << alignment((uint64_t)&c)
    << "\narray size: " << sizeof(c)
    << "\nrequired type alignment: " << alignof(decltype(c)) << "\n";
  std::cout << "\n[d]"
    << "\naddress: ";
    print_address((uint64_t)&d);
    std::cout << "\narray alignment: " << alignment((uint64_t)&d)
    << "\narray size: " << sizeof(d)
    << "\nrequired type alignment: " << alignof(decltype(d)) << "\n";
}

An exemplary output:

[a]
address: 0x7ffcfddbfc79
array alignment: 1
array size: 10
required type alignment: 1
[b]
address: 0x7ffcfddbfc00
array alignment: 1024
array size: 80
required type alignment: 4
[c]
address: 0x7ffcfddbfc50
array alignment: 16
array size: 28
required type alignment: 4
[d]
address: 0x7ffcfddbfc6c
array alignment: 4
array size: 13
required type alignment: 1

As defined in C++, the minimal alignment requirement for an array, is the natural alignment of its elements. So for the array of 8b integers we obtain the required alignment of the arrays to be 1B for a and 4B for b. Forcing the alignment of an array results only in forcing the alignment of the first element. In this particular case, the alignment is forced on variables c and d.

PITFALL: Mind that the rest of the elements are guaranteed to be aligned only to their natural alignment, since C++ requires the contiguous packing of plain array elements. In other words: alignment of c[0] will be at least 16, but the alignment of c[1] might be only 4.

Member alignment

We will only discuss structures, but the general conclusions apply also for classes.

Again, the code you can use to tinker a little:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
  for(int i = 0; i < length; i++) {
    std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
  }
  std::cout << std::dec << std::endl;
}

int alignment(uint64_t addr) {
 uint64_t result = 2;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

struct S_unaligned{
 uint8_t a;
 uint8_t b;

 // We initialize the data structure with some data
 // so that the output is more readable
 S_unaligned() : a(3), b(7) {}
};

struct S_aligned_a{
 alignas(16) uint8_t a;
 uint8_t b;

 S_aligned_a(): a(3), b(7) {}
};

struct S_aligned_b{
 uint8_t a;
 alignas(16) uint8_t b;

 S_aligned_b(): a(3), b(7) {}
};

int main()
{
  S_unaligned x0;

  std::cout << "[S_unaligned] size: " << sizeof(x0)
    << "\naddress: " << &x0
    << "\nrequired type alignment: " << alignof(S_unaligned)
    << "\nobject alignment: " << alignment((uint64_t)&x0)
    << "\ndata:";
  print_raw((uint8_t*) &x0, sizeof(x0));

  alignas(128) S_unaligned x0_a;
  std::cout << "\n[aligned S_unaligned] size: " << sizeof(x0_a)
    << "\naddress: " << &x0_a
    << "\nrequired type alignment: " << alignof(S_unaligned)
    << "\nobject alignment: " << alignment((uint64_t)&x0_a)
    << "\ndata:";
  print_raw((uint8_t*) &x0_a, sizeof(x0_a));

  S_aligned_a x1;
  std::cout << "\n[S_aligned_a] size: " << sizeof(x1)
    << "\naddress: " << &x1
    << "\nrequire dtype alignment: " << alignof(S_aligned_a)
    << "\nobject alignment: " << alignment((uint64_t)&x1)
    << "\ndata:";
  print_raw((uint8_t*) &x1, sizeof(x1));

  alignas(128) S_aligned_a x1_a;
  std::cout << "\n[aligned S_aligned_a] size: " << sizeof(x1_a)
    << "\naddress: " << &x1_a
    << "\ntype alignment: " << alignof(S_aligned_a)
    << "\nobject alignment: " << alignment((uint64_t)&x1_a)
    << "\ndata:";
  print_raw((uint8_t*) &x1_a, sizeof(x1_a));  

  S_aligned_b x2;
  std::cout << "\n[S_aligned_b] size: " << sizeof(x2)
    << "\naddress: " << &x2
    << "\nrequire type alignment: " << alignof(S_aligned_b)
    << "\nobject alignment: " << alignment((uint64_t)&x2)
    << "\ndata:";
  print_raw((uint8_t*) &x2, sizeof(x2));

  alignas(128) S_aligned_b x2_a;
  std::cout << "\n[aligned S_aligned_b] size: " << sizeof(x2_a)
    << "\naddress: " << &x2_a
    << "\nrequire type alignment: " << alignof(S_aligned_b)
    << "\nobject alignment: " << alignment((uint64_t)&x2_a)
    << "\n data:";
  print_raw((uint8_t*) &x2_a, sizeof(x2_a));
}

We will now discuss few selected structures.

Structure #1: No members aligned

For the S_unaligned x0 the output is;

[S_unaligned] size: 2
address: 0x7fff2c36ca78
required type alignment: 1
object alignment: 8
data:0x03 0x07

Few things to point out here. First of all, the size of the S_unaligned structure is 2 bytes. This means, that also the natural alignment of this structure is 2B. At the same time, the structure has two fields, each of them 1B long (natural alignment equal to 1B). According to C++, the structure itself has to be aligned to the natural alignment of its first member. And indeed it is, as indicated by:

type alignment: 1

While the structure itself has to be aligned to 1B boundary, in this particular case it is aligned to 8B boundary:

object alignment: 8

Again, this is OK as higher alignment requirement also respects lower alignments, i.e. 8B aligned structure is also 4B, 2B and 1B aligned.

When we align an object of this structure, we obtain something like:

[aligned S_unaligned] size: 2
address: 0x7ffe25de5080
required type alignment: 1
object alignment: 128
data:0x03 0x07

Nothing changed, except for the minimal value the object alignment can take.

Structure #2: only the first member aligned

Now let’s modify the structure to have the a member aligned to some boundary. In this particular example, we are aligning the first field to 16B boundary.

  struct S_aligned_a{
   alignas(16) uint8_t a;
   uint8_t b;

   S_aligned_a(): a(3), b(7) {}
  };

The output for a specific execution is:

[S_aligned_a] size: 16
address: 0x7ffc5b06a0f0
require dtype alignment: 16
object alignment: 16
data:0x03 0x07 0xea 0xe9 0x21 0x7f 0000 0000 0x90 0x77 0xea 0xe9 0x21 0x7f 0000 0000

Few interesting things happened here. First of all, the size of the structure immediately got larger. The size is now 16B, which means 8 times longer than the initial one! But we also can see from data: 0x03 0x07 ... that a and b are still in the same place within the structure. What is happening? Let’s imagine an example with pointer arithmetic:

S_aligned_a data[10];
S_aligned_a* ptr = &data[10];

for (int i = 0; i <10; i++) {
 std::cout <a << std::endl;
 ptr++;
}

Due to how pointer arithmetics work, when incrementing the pointer, such as in the loop using ptr, the value stored in the pointer (the address at which the pointer is pointing) has to be increased with the size of the base type of the pointer. In this specific case if the size was 2B, as for the original structure, then the pointer would be increased by 2B each time. This means that we would end up accessing structures of S_aligned_a type at following addresses:


0x7ffc5b06a0f0,
0x7ffc5b06a0f2,
0x7ffc5b06a0f4,
0x7ffc5b06a0f8,
0x7ffc5b06a0fa,
0x7ffc5b06a0fc,
0x7ffc5b06a0fe,
0x7ffc5b06a100,
...

Out of all of above adresses, only 0x7ffc5b06a0f0 and 0x7ffc5b06a100 are aligned to 16B boundary! This means that a would not be aligned for data[1], data[2],…, data[6], data[8] etc.
Now the obvious solution is to add padding at the end of each structure object so that the next structure, starting directly after the previous one, was automatically aligned to required boundary. This is exactly how it is done. Unfortunately in some specific cases, as the one presented, this might grow the required size enormously.

As for the previous case, aligning the whole structure doesn’t change much:

[aligned S_aligned_a] size: 16
address: 0x7ffe25de5000
type alignment: 16
object alignment: 2048
data:0x03 0x07 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

Structure #3: only the second member aligned

Now let’s try defining another structure, this time with only b aligned:

struct S_aligned_b{
 uint8_t a;
 alignas(16) uint8_t b;

 S_aligned_b(): a(3), b(7) {}
};

A particular test execution gives us:


[S_aligned_b] size: 32
address: 0x7ffc5b06a160
require type alignment: 16
object alignment: 32
data:0x03 0000 0000 0000 0x10 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x07 0x19 0x60 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

What?! Yet another size increase!? Again, C++ is not allowed to reorder the members (due to serialization problems it might cause), so the only way it can behave is by padding with some unused fields between a and b. Let’s think for a bit, why is this necessary.

The a member, is aligned to 1B boundary, that is to its’ natural alignment. Let’s assume that an object of the S_aligned_b structure is aligned to 16B boundary. a would then be aligned to 16B, but the next 15B would only be guaranteed to be aligned to 1, 2, 4 and 8 byte boundaries. This means that the first unocupied address after a, aligned to 16B boundary is starting at the offset 16 from the beginning of the structure. Hence the data fields between two structures have to be padded with unused space, so that the alignment requirement for b are met.

We could use the same rationalization for why the required type alignment is 16: it has to be to guarantee that b is properly aligned.

Yet one more time aligned object does not have a drastic impact:

[aligned S_aligned_b] size: 32
address: 0x7ffe25de4f80
require type alignment: 16
object alignment: 128
data:0x03 0x8e 0x9f 0xa2 0x01 0x7f 0000 0000 0x90 0x17 0x9f 0xa2 0x01 0x7f 0000 0000 0x07 0xc0 0x66 0xa2 0x01 0x7f 0000 0000 0x80 0x7c 0x9f 0xa2 0x01 0x7f 0000 0000

Structure packing

As shown before, the size of the structure can grow surprisingly quickly. Can we fight it, and how? We can reorder the members of a class/struct so that they have tighter packing. The code to show this:

#include <iostream>
#include <iomanip>

void print_address(uint64_t addr) {
  std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << addr << std::dec;
}

void print_raw(uint8_t *ptr, int length) {
  for(int i = 0; i < length; i++) {
    std::cout << std::showbase << std::internal << std::setfill('0') <<std::hex << std::setw(4) << (uint32_t)ptr[i] << " " ;
  }
  std::cout << std::dec << std::endl;
}

int alignment(uint64_t addr) {
 uint64_t result = 2;
 for(uint64_t i = 2; i < 4096; i*=2) {
  if(addr % i != 0) break;
  result = i;
 }
 return result;
}

struct S{
 uint8_t a;
 uint16_t b;
 uint32_t c;
 uint64_t d;
 alignas(256) uint64_t e;

 S(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed{
 alignas(256) uint64_t e;
 uint64_t d;
 uint32_t c;
 uint16_t b;
 uint8_t a;

 S_packed(): a(3), b(7), c(9), d(1), e(6) {}
};

struct S_packed_noalign{
 uint64_t e;
 uint64_t d;
 uint32_t c;
 uint16_t b;
 uint8_t a;

 S_packed_noalign(): a(3), b(7), c(9), d(1), e(6) {}
};


int main()
{
  S x0;

  std::cout << "[S] size: " << sizeof(x0) 
    << "\naddress: " << &x0
    << "\nrequired type alignment: " << alignof(S)
    << "\nobject alignment: " << alignment((uint64_t)&x0)
    << "\ndata:";
  print_raw((uint8_t*) &x0, sizeof(x0));
  
  S_packed x1;
  
  std::cout << "[S_packed] size: " << sizeof(x1) 
    << "\naddress: " << &x1
    << "\nrequired type alignment: " << alignof(S_packed)
    << "\nobject alignment: " << alignment((uint64_t)&x1)
    << "\ndata:";
  print_raw((uint8_t*) &x1, sizeof(x1));
  
  S_packed_noalign x2;
  
  std::cout << "[S_packed_noalign] size: " << sizeof(x2) 
    << "\naddress: " << &x2
    << "\nrequired type alignment: " << alignof(S_packed_noalign)
    << "\nobject alignment: " << alignment((uint64_t)&x2)
    << "\ndata:";
  print_raw((uint8_t*) &x2, sizeof(x2));

}

The output (i marked the actual data that matters for us):

[S] size: 512
address: 0x7fff1beb2b00
required type alignment: 256
object alignment: 256
data:0x03 0x34 0x07 0000 0x09 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x46 0x1d 0x41 0x11 0x6a 0x7f 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x98 0x66 0x40 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xec 0x07 0000 0000 0x01 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xa8 0x45 0x99 0x11 0x6a 0x7f 0000 0000 0xc0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x5e 0x96 0x93 0x1c 0000 0000 0000 0000 0x10 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0x25 0x0c 0x78 0x11 0x6a 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x5b 0x1c 0xc2 0000 0000 0000 0000 0000 0x1a 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0x90 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xb0 0xc3 0x40 0x11 0x6a 0x7f 0000 0000 0x70 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x16 0x04 0x78 0x11 0x6a 0x7f 0000 0000 0xa8 0x06 0x3f 0x11 0x6a 0x7f 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x06 0000 0000 0000 0000 0000 0000 0000 0xa0 0x2c 0xeb 0x1b 0xff 0x7f 0000 0000 0x07 0000 0000 0000 0000 0000 0000 0000 0xb8 0x4f 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0x62 0x0d 0x78 0x11 0x6a 0x7f 0000 0000 0x62 0x85 0x43 0x11 0x6a 0x7f 0000 0000 0xa7 0x0f 0xad 0000 0000 0000 0000 0000 0x2d 0000 0000 0000 0x6a 0x7f 0000 0000 0xb8 0xd6 0xbb 0x10 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x48 0xec 0xbb 0x10 0x6a 0x7f 0000 0000 0xf8 0x35 0xbc 0x10 0x6a 0x7f 0000 0000 0xe0 0x2d 0xeb 0x1b 0xff 0x7f 0000 0000 0x08 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x31 0x96 0x11 0x6a 0x7f 0000 0000 0xda 0x16 0x87 0x30 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xf8 0x44 0x96 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x58 0xb7 0xbc 0x10 0x6a 0x7f 0000 0000 0x10 0x04 0x40 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0x12 0x03 0000 0000 0x01 0000 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000
[S_packed] size: 256
address: 0x7fff1beb2d00
required type alignment: 256
object alignment: 256
data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0xed 0xe9 0x43 0x2b 0000 0000 0000 0000 0xea 0x0f 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x60 0x30 0x96 0x11 0x6a 0x7f 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0xff 0x7f 0000 0000 0x88 0x91 0x99 0x11 0x6a 0x7f 0000 0000 0x3a 0x02 0000 0000 0000 0000 0000 0000 0x64 0x4c 0x41 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xd5 0xb9 0x78 0x11 0x6a 0x7f 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0xe0 0x94 0x99 0x11 0x6a 0x7f 0000 0000 0x20 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x50 0x42 0x99 0x11 0x6a 0x7f 0000 0000 0x38 0x2e 0xeb 0x1b 0xff 0x7f 0000 0000 0x85 0x4d 0x02 0x04 0x01 0000 0000 0000 0x50 0xea 0x40 0x11 0x6a 0x7f 0000 0000 0xe7 0x08 0x40 0000 0000 0000 0000 0000 0x48 0x96 0x38 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0x94 0x5e 0x77 0x11 0x6a 0x7f 0000 0000 0x90 0xe7 0x76 0x11 0x6a 0x7f 0000 0000 0000 0x90 0x3e 0x11 0x6a 0x7f 0000 0000 0x80 0x4c 0x77 0x11 0x6a 0x7f 0000 0000 0x18 0xf2 0x76 0x11 0x6a 0x7f 0000 0000 0x50 0x95 0x76 0x11 0x6a 0x7f 0000 0000
[S_packed_noalign] size: 24
address: 0x7fff1beb2e00
required type alignment: 8
object alignment: 512
data:0x06 0000 0000 0000 0000 0000 0000 0000 0x01 0000 0000 0000 0000 0000 0000 0000 0x09 0000 0000 0000 0x07 0000 0x03 0000

The general rule for reordering is to calculate max(required_alignment(member), sizeof(member)) and declare the members with higher values first. For this particular case this packing technique halved the size of the structure when one of the field requires a very high alignment.
Also if we can relax the alignment requirement, as in the structure S_packed_noalign, then the amount of space required will be even smaller. We might however run into problems with efficient register LOAD/STORE operations if the alignment criteria are not met.

PITFALL: If the structures are nested, modifications made in components might change their size/alignment requirements. Therefore it might also require rearrangement of derived structures. A common, nasty scenario is when the alignment problems appear after a remote commit, not directly related to affected structure. In such scenarios, seemingly innocent changes can drasticaly decrease the performance of an application.

Summary

We discussed topic of static alignment. Few performance hints to be remembered:

  1. The alignment requirement can be controlled using alignas keyword.
  2. The natural alignment can be obtained using alignof keyword.
  3. Alignment of array is set only for the first element. Therefor consecutive elements might not respect the necessary alignment constraint.
  4. When aligning members of structures the size of the structure might increase substantially due to padding introduced by the compiler.
  5. Structures can be packed densely using manual reordering of members, so that the resulting size is decreased.

I don’t think that the topic of static alignment is exhausted here, but the post already got eXtra Large. So let me know if something requires more thorough explanation.

In the next post I will discuss how can alignment be controlled in a dynamic way.

UME::SIMD Tutorial #5: Memory subsystem and alignment

In this tutorial we will discuss few basic ideas related to memory management in the context of memory hierarchy. While efficient code generation is an issue, the real performance problems are related to memory. This topic is very broad, and possibly deserves another tutorial series, but some basic knowledge is required to understand basic aspects of vector code development. The material here is theoretical, as we will be taking look at specific techniques closer in next posts.

Cache subsystem

If you ever took basic computer architecture or digital electronics courses, you should be already aware of two basic computer architectures: Von Neumann architecture and Harvard architecture. It should be noted that modern computer systems are a variation of both of them.

von_neumann

The Von Neumann architecture in a simplest implementation consists of: memory device, a CPU and I/O interfaces, possibly connected to some external devices. In this tutorial we won’t discuss in detail the topic of I/O, but we will be focusing on memory and CPU interaction.

The whole system can work like this:

  1. CPU loads a value from specific memory cells into registers,
  2. CPU performs some calculations on registers,
  3. CPU writes back the result into the memory.

This is very similar to what we did in previous tutorial. Unfortunately this mode of operation was already proven to be very hurtful for performance. Imagine that CPU carries some computations, and then tries to reuse results generated in the past. The number of registers available is very small, even in modern systems. If the data does not fit the register bank then possibly new data has to be loaded and the old data has to be stored back to memory. Once the result is needed again, it has to be loaded again. Both memory reads and writes are expensive in comparison to moving data between registers, simply because of physical distance between: arithmetic/control units, RAM and register bank.

Often our programs iterate over large memory containers, with data layed out over consecutive memory addresses. An additional optimization opportunity opens: loading the data that will be needed in the future while CPU works on previous data. This basic concurrency technique allows hiding memory latency, and therefore decreasing overall computation time. Unfortunately with the original architecture, the size of register bank limits us to loading only few additional fields.

An engineering approach for solving the problem of memory access latency was invention of a cache, a small additional memory placed somewhere between the CPU and RAM, and preferrably having a dedicated, direct connection to CPU. Initially the cache was placed on motherboards, slowly moving closer and closer towards CPU cores. As the cache was moving closer to cores, the amount of physical space available was dropping, limiting the size of cache. At the same time, initially single cache, has been split into a hierarchy of cache levels.

cache

The above picture gives a brief overview of how does the memory subsystem work nowadays. The data is moved from RAM through a series of cache levels, each of them replicating the original data, and being able to communicate only with immediately higher and lower memory layers. Cache level L1 is the smallest one and the only one capable of exchanging information with CPU registers. Cache level L2 is larger than L1, but also restricted to single CPU core. The main reason why L1 and L2 are distinct in some CPUs is that L1 might come in two pieces: data cache and instruction cache. Cache level L3 is usually much larger than L2 and is shared between multiple cores of a single CPU. Existence of shared cache can improve inter-core communication, and thus is a very interesting topic for high-performance engineering.

Now how does the cache system work? Imagine a scenario: a value (let’s say 64b integer) is loaded from RAM to L3, then from L3 to L2, then from L2 to L1, finally landing in one of the registers. CPU performs some computations and, if it runs out of registers, uses L1 cache to temporarily store the data. Once the data is needed again, it only has to be loaded from L1 and not from RAM! OK, but L1 is very small. So what happens when L1 is depleted? Simply some of the data is moved from L1 to L2, and moved back from L2 to L1 and then to registers when needed. In a similar way the data from L2 is being pushed back to L3 if L2 is full. Finally if L3 is full, the only thing that can happen is to write the data to RAM.

It is very possible that you, a programmer, never had to deal with cache before. Why is that? The cache subsystem is managed primarily by the hardware and systems software. The hardware decides when and what data has to be moved between different levels of cache. The system software, including compilers or some privilaged modules, might have some form of direct effect on cache behaviour by issuing hints or by configuring the cache management hardware. Regardless of that, from the software perspective the cache is almost completely transparent. Almost, as it might be visible when measuring application performance.

Register alignment

When moving data between registers and memory, it is important to be aware, that the data is easiest to load, when it is register aligned, that is:

[address of data element] modulo [size of register] = 0

This can be depicted using following example:

Register alignment.png

In the diagram, we have some structure definition with a possible layout of that structure in the contiguous memory. Field a is aligned to it’s natural alignment defined as:

[address of data element] modulo [size of data element] = 0

Field b is an array starting immediately after the a field and being aligned according to C++ rules, to the first address that meets natural alignment of its’ elements, that is 64b. Mind that all fields of b are 64b aligned, and the array itself is 64b aligned as its’ first fields’ highest alignment is 64b.

On some architectures, only register aligned load operations are permitted. Other architectures offer unaligned loads, but penalise them with longer execution time. For both performance and correctness reasons it is best to always align the data to the register alignment when performing computations. As depicted in the example above:

  1. loading b[i] to any 64b register is always possible, as each element of b is register aligned;
  2. loading a pair b[0]-b[1] to a 128b register (i.e. a vector register) might cause alignment issues, as the field is not aligned to 128b boundary;
  3. loading a pair b[1]-b[2] to a 128b register will not cause the alignment issues, as both fields together are aligned to 128b boundary.

At the same time, aligning too much might waste some space, as padding will be added to fill the gaps between aligned fields. Knowing what needs to be aligned, and what should be left untouched comes with some practice, and with understanding of internals of an application.

In the case presented above, we also assumed, that the structure S itself is at least 128b aligned. This does not have to be the case in a general scenario.

Cachelines

To simplify the hardware a bit (and make things faster), instead of passing a single byte at a time, the data is being transfered in chunks called cachelines. In x86 the cache line size is always 64B. In arm32 it’s 32B and in arm64 it is 64B. Other processors might use different values.

Existence of cache has an impact on performance. Reading first byte of a cacheline will take some time, as the data has to be transported through the cache. Reading next 63 bytes, regardless of whether they contain the same data set or not, will introduce very little if no latency. This means that the most efficient memory layout is a contiguous array of data. If you are using an array of structures (AoS) layout, then accessing only a single field of each structure will not be an optimal access pattern: more cachelines will have to be loaded to L0, and the percentage of utilisation of each cacheline will be very low.

We will say that a data element is cacheline aligned if it begins with a new cacheline, that is:

[address of data element] modulo [cacheline size] = 0

A large portion of application optimization is revolving around making the data layouts cache friendly. This means both data reuse and maximizing cacheline consumption.

Pages

Computer systems have very limited resources in terms of RAM, compared to parmanent storage such as hard drives. Running out of RAM is bad for everyone, as it most often results in application and sometimes even in system crash. A common way of avoiding insufficient memory is known as memory virtualization. Memory virtualization is also a very useful mechanism to protect other applications from malicious influence of other, possibly untrusted applications.

In great, great simplification, memory virtualization works as follows:

  1. When application starts, the operating system allocates certain amount of space required by static requirements of the program (e.g. the program stack). The space is starting at a virtual address 0x0, and ends at some virtual address 0xff..ff. The precise size of this space depends on the OS and hardware. These address spaces are not unique between applications, but the addresses are unique within a single system process.
  2. The virtual address space is divided into 4kB blocks, called pages. Each page of virtual memory that is being considered to be useful is registered in virtual address translation table. The table consists of: process ID, virtual page address and physical page address. Physical page address is unique in the scale of the operating system.
  3. When the process allocates some memory, the OS adds entries to the table, and reserves some space in the physical memory. When the process frees up the memory, the OS removes specific entries from the table.
  4. When the OS runs out of physical memory it allocates a virtual memory file somewhere on a hard drive. This file extends the RAM capacity by allowing OS directed movement of pages between physical memory and virtual memory file.
  5. When a user process (i.e. an application) accesses a specific address in virtual memory, the OS checks the physical address of the page. If the address fits into the RAM, it starts the process of address translation. If the address is not in the RAM, the OS moves specific page from virtual memory file to RAM, possibly replacing some other process’ pages, and re-assigning the physical address in the virtual address translation table. Once this process of memory migration happens, the OS calculates the physical address and returns it to the application.

In some sense the paging mechanism acts as a OS based caching mechanism. While initially paging was implemented purely in the software, a large amount of effort was put into hardware support, primarilly for performance reasons. We won’t be talking about paging too much in the context of this series, but it is an important topic to be considered in a context of parallel aplications.

For the sake of completeness: we will call a memory element to be page aligned if it meets the following equation:

[address of memory element] modulo [page size] = 0

You can read more about memory virtualization here.

Summary

At this moment you should be already aware of the cache subsystem and the paging mechanism. You should also be able to distinguish between natural alignment, register alignment, cacheline alignment and page alignment.

Few performance hints to be remembered:

  1. Accessing data in a contiguous manner is usually the most efficient way as it allows complete consumption of cachelines.
  2. Maximizing data reusal is the key to exploit cache subsystem. Perform as many operations as possible with the most recently used/created data.
  3. Try to meet alignment requirements, as it could reduce number of transported cachelines and make cache-registers communication faster

In following posts we will be looking at some practical aspects of memory alignment and efficient cache utilisation. We will also look at data structures and some memory layouts that are critical from the perspective of performance.