In this tutorial we will discuss few basic ideas related to memory management in the context of memory hierarchy. While efficient code generation is an issue, the real performance problems are related to memory. This topic is very broad, and possibly deserves another tutorial series, but some basic knowledge is required to understand basic aspects of vector code development. The material here is theoretical, as we will be taking look at specific techniques closer in next posts.
If you ever took basic computer architecture or digital electronics courses, you should be already aware of two basic computer architectures: Von Neumann architecture and Harvard architecture. It should be noted that modern computer systems are a variation of both of them.
The Von Neumann architecture in a simplest implementation consists of: memory device, a CPU and I/O interfaces, possibly connected to some external devices. In this tutorial we won’t discuss in detail the topic of I/O, but we will be focusing on memory and CPU interaction.
The whole system can work like this:
- CPU loads a value from specific memory cells into registers,
- CPU performs some calculations on registers,
- CPU writes back the result into the memory.
This is very similar to what we did in previous tutorial. Unfortunately this mode of operation was already proven to be very hurtful for performance. Imagine that CPU carries some computations, and then tries to reuse results generated in the past. The number of registers available is very small, even in modern systems. If the data does not fit the register bank then possibly new data has to be loaded and the old data has to be stored back to memory. Once the result is needed again, it has to be loaded again. Both memory reads and writes are expensive in comparison to moving data between registers, simply because of physical distance between: arithmetic/control units, RAM and register bank.
Often our programs iterate over large memory containers, with data layed out over consecutive memory addresses. An additional optimization opportunity opens: loading the data that will be needed in the future while CPU works on previous data. This basic concurrency technique allows hiding memory latency, and therefore decreasing overall computation time. Unfortunately with the original architecture, the size of register bank limits us to loading only few additional fields.
An engineering approach for solving the problem of memory access latency was invention of a cache, a small additional memory placed somewhere between the CPU and RAM, and preferrably having a dedicated, direct connection to CPU. Initially the cache was placed on motherboards, slowly moving closer and closer towards CPU cores. As the cache was moving closer to cores, the amount of physical space available was dropping, limiting the size of cache. At the same time, initially single cache, has been split into a hierarchy of cache levels.
The above picture gives a brief overview of how does the memory subsystem work nowadays. The data is moved from RAM through a series of cache levels, each of them replicating the original data, and being able to communicate only with immediately higher and lower memory layers. Cache level L1 is the smallest one and the only one capable of exchanging information with CPU registers. Cache level L2 is larger than L1, but also restricted to single CPU core. The main reason why L1 and L2 are distinct in some CPUs is that L1 might come in two pieces: data cache and instruction cache. Cache level L3 is usually much larger than L2 and is shared between multiple cores of a single CPU. Existence of shared cache can improve inter-core communication, and thus is a very interesting topic for high-performance engineering.
Now how does the cache system work? Imagine a scenario: a value (let’s say 64b integer) is loaded from RAM to L3, then from L3 to L2, then from L2 to L1, finally landing in one of the registers. CPU performs some computations and, if it runs out of registers, uses L1 cache to temporarily store the data. Once the data is needed again, it only has to be loaded from L1 and not from RAM! OK, but L1 is very small. So what happens when L1 is depleted? Simply some of the data is moved from L1 to L2, and moved back from L2 to L1 and then to registers when needed. In a similar way the data from L2 is being pushed back to L3 if L2 is full. Finally if L3 is full, the only thing that can happen is to write the data to RAM.
It is very possible that you, a programmer, never had to deal with cache before. Why is that? The cache subsystem is managed primarily by the hardware and systems software. The hardware decides when and what data has to be moved between different levels of cache. The system software, including compilers or some privilaged modules, might have some form of direct effect on cache behaviour by issuing hints or by configuring the cache management hardware. Regardless of that, from the software perspective the cache is almost completely transparent. Almost, as it might be visible when measuring application performance.
When moving data between registers and memory, it is important to be aware, that the data is easiest to load, when it is register aligned, that is:
[address of data element] modulo [size of register] = 0
This can be depicted using following example:
In the diagram, we have some structure definition with a possible layout of that structure in the contiguous memory. Field a is aligned to it’s natural alignment defined as:
[address of data element] modulo [size of data element] = 0
Field b is an array starting immediately after the a field and being aligned according to C++ rules, to the first address that meets natural alignment of its’ elements, that is 64b. Mind that all fields of b are 64b aligned, and the array itself is 64b aligned as its’ first fields’ highest alignment is 64b.
On some architectures, only register aligned load operations are permitted. Other architectures offer unaligned loads, but penalise them with longer execution time. For both performance and correctness reasons it is best to always align the data to the register alignment when performing computations. As depicted in the example above:
- loading b[i] to any 64b register is always possible, as each element of b is register aligned;
- loading a pair b-b to a 128b register (i.e. a vector register) might cause alignment issues, as the field is not aligned to 128b boundary;
- loading a pair b-b to a 128b register will not cause the alignment issues, as both fields together are aligned to 128b boundary.
At the same time, aligning too much might waste some space, as padding will be added to fill the gaps between aligned fields. Knowing what needs to be aligned, and what should be left untouched comes with some practice, and with understanding of internals of an application.
In the case presented above, we also assumed, that the structure S itself is at least 128b aligned. This does not have to be the case in a general scenario.
To simplify the hardware a bit (and make things faster), instead of passing a single byte at a time, the data is being transfered in chunks called cachelines. In x86 the cache line size is always 64B. In arm32 it’s 32B and in arm64 it is 64B. Other processors might use different values.
Existence of cache has an impact on performance. Reading first byte of a cacheline will take some time, as the data has to be transported through the cache. Reading next 63 bytes, regardless of whether they contain the same data set or not, will introduce very little if no latency. This means that the most efficient memory layout is a contiguous array of data. If you are using an array of structures (AoS) layout, then accessing only a single field of each structure will not be an optimal access pattern: more cachelines will have to be loaded to L0, and the percentage of utilisation of each cacheline will be very low.
We will say that a data element is cacheline aligned if it begins with a new cacheline, that is:
[address of data element] modulo [cacheline size] = 0
A large portion of application optimization is revolving around making the data layouts cache friendly. This means both data reuse and maximizing cacheline consumption.
Computer systems have very limited resources in terms of RAM, compared to parmanent storage such as hard drives. Running out of RAM is bad for everyone, as it most often results in application and sometimes even in system crash. A common way of avoiding insufficient memory is known as memory virtualization. Memory virtualization is also a very useful mechanism to protect other applications from malicious influence of other, possibly untrusted applications.
In great, great simplification, memory virtualization works as follows:
- When application starts, the operating system allocates certain amount of space required by static requirements of the program (e.g. the program stack). The space is starting at a virtual address 0x0, and ends at some virtual address 0xff..ff. The precise size of this space depends on the OS and hardware. These address spaces are not unique between applications, but the addresses are unique within a single system process.
- The virtual address space is divided into 4kB blocks, called pages. Each page of virtual memory that is being considered to be useful is registered in virtual address translation table. The table consists of: process ID, virtual page address and physical page address. Physical page address is unique in the scale of the operating system.
- When the process allocates some memory, the OS adds entries to the table, and reserves some space in the physical memory. When the process frees up the memory, the OS removes specific entries from the table.
- When the OS runs out of physical memory it allocates a virtual memory file somewhere on a hard drive. This file extends the RAM capacity by allowing OS directed movement of pages between physical memory and virtual memory file.
- When a user process (i.e. an application) accesses a specific address in virtual memory, the OS checks the physical address of the page. If the address fits into the RAM, it starts the process of address translation. If the address is not in the RAM, the OS moves specific page from virtual memory file to RAM, possibly replacing some other process’ pages, and re-assigning the physical address in the virtual address translation table. Once this process of memory migration happens, the OS calculates the physical address and returns it to the application.
In some sense the paging mechanism acts as a OS based caching mechanism. While initially paging was implemented purely in the software, a large amount of effort was put into hardware support, primarilly for performance reasons. We won’t be talking about paging too much in the context of this series, but it is an important topic to be considered in a context of parallel aplications.
For the sake of completeness: we will call a memory element to be page aligned if it meets the following equation:
[address of memory element] modulo [page size] = 0
You can read more about memory virtualization here.
At this moment you should be already aware of the cache subsystem and the paging mechanism. You should also be able to distinguish between natural alignment, register alignment, cacheline alignment and page alignment.
Few performance hints to be remembered:
- Accessing data in a contiguous manner is usually the most efficient way as it allows complete consumption of cachelines.
- Maximizing data reusal is the key to exploit cache subsystem. Perform as many operations as possible with the most recently used/created data.
- Try to meet alignment requirements, as it could reduce number of transported cachelines and make cache-registers communication faster
In following posts we will be looking at some practical aspects of memory alignment and efficient cache utilisation. We will also look at data structures and some memory layouts that are critical from the perspective of performance.