One of the key elements of developing modern software for performance is the use of SIMD instructions. I dedicated a series of posts to the library I’ve been developing for a while here: UME. This is just one approach that I like the most, but there are others.
In this post I will discuss briefly pros and cons of each of the popular approaches. This is a lightweight post with the intention to give you an overview of how you can approach your problem. I won’t delve into gore technical, but rather try to present you with some external links where you could learn more about a specific technique. I will also limit the discussion to whatever tools are available for C++ projects.
From the programmers perspective simplest and most pleasing concept: write your code normally and rely on compiler to generate SIMD instructions. This is the biggest advantage of this approach. Unfortunately it is not that simple in practice.
First of all the compiler needs to be aware of SIMD instructions. If using g++, clang++ or some proprietary compiler developed specifically for the purpose then you are on the safe side. If the compiler is developed in-house or it is a niche language, then this approach will probably not work.
Secondly, the compiler cannot optimize any kind of codes. One thing is that autovectorization is usually limited to loops. And not all loops are allowed to be vectorized. In most cases the code has to be re-written to meet some specific criteria:
- number of iterations has to be countable – that is it has to be known before the loop starts running,
- no data dependencies between iterations – results of next iterations cannot use results from preceding iterations,
- no control flow that breaks the loop – in C++ context this means no break and goto statements.
Third problem is performance stability. Imagine that you are working in a project where you are responsible for providing performance components. Suddenly someone is given a task to extend functionality of primitives that you tediously tuned for performance. They modify one hot loop with seemingly innocent additional statement. The change makes it to the baseline and go to customer. Suddenly you get tons of bug reports saying that between two versions there is say an 8x slow-down… Now guess who is going to be blamed for that?
Some more reading about autovectorization:
GCC gives a nice list of loops that can be autovectorized.
A nice post on how SIMD can be used in java can be found here.
A guide to use autovectorization with Intel compiler.
One additional technology that I would put also into this classification is OpenMP. (I know, some people will not agree, but we are only talking about #pragma omp simd which is a very small feature of the whole standard. ) The approach is based on annotating loops or functions with directives informing that the compiler can safely replace a scalar construct with its’ SIMD equivalent. The only difference with basic autovectorization is the fact that it is easier to inform the compiler that specific criteria for loop vectorization are fulfilled. It also requires a compliant compiler.
The most crude way of programming. Regardless of what people say to you this is still used. There is a plethora of disadvantages of manual assembly: validation, portability, problems with compiler interaction, code maintenance etc. The way I see it there is one perfectly good reason why you might use it: WYWIWYG (What You Want Is What You Get). When you write a list of assembly instructions, you will get exactly what you requested. No transformations performed by compiler, except for mapping register names. This gives you almost total control over what will happen inside your kernels.
Using SIMD instructions with this approach does not differ from assembly programming, so you probably have to be familiar with inline assembly and instruction set details (Intel and ARM as examples).
SIMD Intrinsic functions
Intrinsic functions are a form of an unofficial extension to a language. Because auto-vectorization requires a lot of work on both compiler theory and engineering, a simple solution was offered in the early days of SIMD instruction sets. The basic idea is that you write your code using functions and special register types. As opposed to auto-vectorization the use of SIMD operations is stated explicitly by the programmer. This reduces the problem of performance stability to just few corner cases. This approach has been shown many times as a sufficient methodology for reaching near-peak performance.
There are few drawbacks. First of all intrinsics only expose instructions (or groups of instructions if you think of SVML). This means that you have to still program for a specific instruction set. So portability is heavily limited.
To address portability problem, you have to look at your project design, and decide which kernels you want to specialize for different architectures, and provide some dispatching mechanism. This is cumbersome as it requires you to conditionally include fragments of your implementation. While programming with intrinsics is not that difficult, the structural requirements for big libraries/applications can be pretty heavy.
Another problem is name mangling. Intrinsic functions were developed to address problem of direct access to the instructions from C/C++ level. This means that function declarations have to conform to C standard. A very simple operation such as addition cannot be simply named as _add(…), as this would have to be overloaded for all sort of register types. This means that there are multiple addition functions, each of them having different name and input types. For instance a search in Intel intrinsics guide gives something like this:
Encoding/decoding the names is not that difficult once you get familiar with the convention, but it is still slower than just writing ‘A + B’. You can get used to that, but if you want to use whole horsepower of C++ you will have to develop wrappers just to make code little bit easier to handle.
Last thing, a problem in some situations and a feature in others, is that intrinsics don’t really match to individual instructions. The compiler has to perform some additional operations when managing registers (think of register spill/fill). So the code you will write with intrinsics will not be exactly what you would get writing inline assembly. Because the compiler is involved it can replace some operations with some others, for instance recognizing FMA or streaming store opportunities, actually improving performance.
If you ever want to use this approach on x86 platforms, this is the link you have to know. You can find similar documentation for ARM.
Explicit Vectorization Libraries aka. Wrapper Classes
In order to improve on maintenance and portability several attempts have been made on providing a better type-oriented interface for intrinsic functions. Few libraries you might be interested are: VC, VCL, boost::simd and of course UME::SIMD. What is the most important about this approach, is that it tries to squeeze all instruction set under an umbrella of uniform data types. It also makes use of C++ features such as operator overloading to make the notation shorter and easier to work with.
This approach is very nice in terms of performance portability, although it can still have few caveat. For instance in-register permutations are heavily limited by instruction sets, so it might be possible to speed that up by a more direct approach, such as intrinsics.
In terms of platform portability – this is perfect. You can write code once, and run it anywhere, albeit with no performance promises. From what I’ve seen this is not a big issue, as most architectures are already covered by existing implementations.
Code written using this approach has to still be more direct then autovectorization, so it might require from the user the understanding of SIMD computation model. The user has to still manage peel and remainder loops as well as operate on vector registers. This is definitely a drawback, as it dictates a certain additional effort on the structural part of the user code.
Embedded Domain Specific Languages (EDSL)
This is barely a category as I only want to present here one specific library: UME::VECTOR. In a sense this library is unique, although the concepts have been used in some way in both VC and boost::simd, as well as it has an interface predecessor in C++ as std::valarray. The fundamental idea behind this approach is that while explicit SIMD libraries only address the problem of instruction selection, the DSL approach allows handling of any array-like containers.
The approach gives a very efficient code generation, provided that a given implementation can expose it. Secondly a more efficient, if not optimal, memory access pattern is handled by a library. Third of all the interface does not use concept of SIMD registers, but more of 1-D vector arithmetics.
For instance in UME::VECTOR the library exposes a set of 1D vector types and operations, that use lazy code generation: the code is generated not on a per-operation basis, but rather at the end of definition of a computational kernel. Once the whole kernel is known, the compiler uses it and some library defined code generation scheme to generate actual list of CPU instructions. The library uses UME::SIMD for generating target-specific code, so it does not have to re-implement everything from scratch.
The drawback is: expression templates and their debugging… Because the result of an EDSL operation is not a value, but rather a nested type representing the operations to be carried, the final operation and the way it is handled might seem like a gibberish even for advanced programmers. This is the main reason why std::valarray is not very popular. The ET technique is an idiom, that can be well understood only if you know template template meta programming and have some clue about code generation. If you know the concept this should be perfect approach for you. If don’t – well you either get some basic understanding or use one of the previous approaches.
This approach is rising in popularity, and I think it already has pretty broad range of use- cases. Two popular libraries that implement JIT-Assembly and are ready to be used are: asmjit and xbyak. They primarily solves the problems of code being generated statically, so it might be a solution for situations where all previous approaches failed. The basic idea is to build computational kernels during runtime, instead of at compile time. This requires an additional overheads but only at initialization time. Once kernels a rebuilt, they can be executed as normal functions from regular C/C++ code.
First benefit is, that you don’t need to provide any target-specific binaries to target different architectures. This removes some packaging problems and handling many build-system configurations. The instruction set can be decided at runtime, and only a dedicated version of a kernel has to be generated.
The second benefit is, that you can decide when kernels have to be generated. This means that you can wait until very last moment, and even generate a kernel suitable for your input data! Can you imagine new possibilities…?
Now the drawbacks: the first one is again portability. So far I haven’t seen a good standalone library that addresses code generation for multiple architectures. JIT assemblers are targeting specific instruction sets only. Compiler tool chains such as LLVM allow you to develop a JIT compiler but they don’t offer a deployable solution.
The second issue is programming complexity: with JIT assemblers you don’t have an abstraction available over an instruction set. If you want to use JIT compiler, then the problem might be the need to understand some compiler design practices as well as to develop a form of an embedded language.
Just to summarize. I would rank different approaches in following categories:
- Performance in basic, most common scenarios (Like ‘A+B’)
- Performance in corner-case situations (Like permutations or data-dependent control flow),
- Performance portability (As how likely it is to see performance degradation of the same code on another platform?)
- Performance stability (How reluctant is the performance of a specific fragment of code to changes in the near neighborhood?)
- Target portability (Will the code be able to be re-compiled for another platform?)
- User code structure (Does the user need to write any additional code except for the SIMD sensitive parts?)
- Architecture knowledge (How much does the user need to know about specific instruction set?)
- SIMD knowledge (How much does the user need to know to efficiently use SIMD instructions?)
- Additional programming knowledge (Are there any framework conventions or specific idioms that need to be known?)
As for the notes that I am giving, those are completely subjective values in range 1-6. The higher value, the better an approach scores in a given category.
I might be biased by what I already know and I also considered the different ranking categories in the context of existing tools. Things might change as all approaches will evolve, so it is not a definitive comparison. If you disagree, let me know.
In general all approaches should give you pretty high performance in most common scenarios. To really compare each approach you have to look also on how important portability is for your project, how much of an effort will it be to scale use of a specific technology across the whole project, and how comfortable do you feel with a specific way you approach task at hand. There is no silver bullet.