Vectorize It! - Grokking the Hardware

Getting a Feel for the Hardware

Let us put the five aspects to the test, and see how they can help us to better understand the capabilities of a particular machine. I am writing this chapter with a SIMD extension of a scalar processor in mind, but I strongly believe that other architectures such as widely programmable graphics processing units can be analyzed the same way and with the same benefit. The more exotic machines are usually hidden beneath more layers of abstraction (if they are opened to third party programmers at all), but I am confident that the five aspects will only make it easier to pierce this veil.

As a first step, you would look at the programming model of the platform, i.e. at the semantics of the instruction set architecture. The plan is to go through all the features and primitives of the programming language and classify each one with respect to the five aspects. The goal is to identify potential strengths and/or weaknesses of the machine, as well as any interesting weirdness that does not immediately make sense. These weirder details tend to be of situational use later, and generally reveal some of the philosophy that the hardware designers were following. Let me illustrate this process for each of the aspects in turn.

SIMD Data Formats

Finding out which data types the machine handles natively is straight forward. Take note how complete the support is for each type, and how exchangeable the types are with respect to the provided computational operations. The more orthogonal the type support, the fewer restrictions you will have to work around in software.

If there seem to be significant gaps, keep an eye out for helpful functionality such as explicit type conversion (within registers, otherwise the functionality falls under data storage), or computational instructions that implicitly convert types (such as widening sum or widening multiply). Pack and unpack instructions play a role here, too, even though they are technically data flow functionality. Data storage instructions with type conversion do not usually fill data type gaps because of the memory overhead. But they can be viable solutions occasionally, thus they warrant some attention from this aspect.

A more advanced scenario includes secondary data types such as multiprecision integers (also known as bignums), or extended precision floating point (where pairs of native floating point numbers are used to represent numerical values of higher precision). The machine in question might or might not offer functionality to handle such types.

Fixed point numbers are another important derived data type. These are typically handled with the usual integer operations, so you rarely see specific operations for them. The availability and power of bit shifts is important, though, as well as data flow functionality that might enable efficient splitting/merging of fractional and integral parts of a fixed point number.

SIMD Data Storage

In real machines, this aspect tends to be implemented with some redundancy. Often there is specific functionality that could theoretically be subsumed within more general functionality that is also available. The reasons will become clear in step two below, when you consider runtime efficiency as well.

The two most important qualities of this aspect are: units of addressability and address restrictions. Can the machine address vector elements in memory individually, or only whole vectors? And can the addresses be chosen arbitrarily, or must they meet a set of restrictions? Practically always the most efficient path to memory involves whole vectors located at addresses that are integer multiples of the vector size. A few machines offer the luxury of individually addressed vector elements, but need a vector full of addresses to make it possible. The former is called "aligned vector loads/stores", and the latter "scatter/gather" functionality.

Platforms which limit themselves to aligned vector loads and stores tend to define efficient idioms for unaligned access. These usually involve data flow facilities and possibly address calculation functions. Such helper instructions can be very specific, but that is not intrinsically required. Be aware that there is more to this aspect than the actual memory access instructions.

The architects of some machines have chosen to provide some form of conditional loads and stores. The more traditional version skips all vector elements that have an "off" condition, and skips the corresponding memory addresses as well. This is called a "masked load/store". An interesting new twist on this will be implemented in Intel's upcoming Larrabee architecture: addresses won't be skipped, but vector elements will be (Intel calls these operations "compress"/"expand"). Effectively, a conditional data flow functionality is implemented by such memory accesses.

SIMD Data Flow

This aspect is about addressing individual data elements in vector registers, just like data storage is about addressing data in memory. In an ideal machine, these two aspects would balance and complement each other. Your task is to find out the exact ways in which a real machine fails to meet this ideal, i.e. you have to identify cases where neither data storage nor data flow facilities of a given machine cover required data flow patterns of some algorithm.

SIMD data flow is necessarily limited to short distances (across one or two vector registers), but even then the sheer number of possible data flow patterns is huge. Nevertheless, more and more machines are equipped with a full crossbar switch and can handle arbitrary permutes of data elements within one or two vector registers (for a description of one generic vector permute instruction VPERM, see here). The ability to mix/shuffle vector elements from two source vector registers into one destination is significantly more powerful than permutes of only a single vector (because the contents of two vector registers can come from two distinct memory addresses even on machines with extremely limited data storage facilities), so take note if a particular machine cannot freely mix elements of different vectors.

Some machines integrate a bit of vector element shuffling "for free" with operand access of most any computational operation. Due to combinatorial explosion, only a tiny subset of possible permutes can be offered this way. Furthermore, data from different operands cannot usually be mixed by this kind of facility. In my opinion, this kind of data flow capability is too limited and too hard to use to rely on in the general case. Nevertheless, it can make for elegant and efficient code in the few benchmarks where it is applicable. (I admit that a case could probably be made for broadcasting/splatting of arbitrary individual elements across a whole vector; this specific pattern is perhaps frequent and generic enough to be integrated with operand access.)

On machines with weaker data flow features, it can be helpful to abuse instructions like bitwise shifts/rotates (to shuffle narrower data items within a wider type), or element-wise conditional select (to mix data from two source registers). Integer multiply can broadcast/splat a narrow scalar value across a wider type (because it is a series of shifted adds). Other instructions which widen or narrow down their operands also have some potential to be re-purposed for some data flow tasks. Specialized table look up instructions, depending on their exact semantics, might be able to do double duty as awkward but fairly generic permute operations.

At least one popular GPU programming model pretends that the hardware executes independent concurrent threads. It can therefor not offer clean SIMD data flow functionality, because that would not fit the threaded paradigm. I find that unfortunate. In reality, there are lanes in a wide vector that are processed in lockstep; SIMD data flow primitives could therefor be two or three orders of magnitude more efficient than classical thread synchronization and communication primitives. Not only does the threaded programming model promise things that the hardware cannot keep, it also precludes the possibility of utilizing some of the unique strengths of the machine. Unfortunate. If your target machine follows this model, check if there is still a way to access data flow functionality. No matter how awkward, non-standard, or limited it may be, it will still prove very useful.

SIMD Computational Operations

This was a tough section to write. On the one hand, it is a pretty boring topic: just check to see if the standard scalar operations are provided with properly vectorized operands. On the other hand, there are still some open questions which scalar operations should belong to the standard set. For example, a majority of SIMD instruction sets offers min and max operations, but only a minority of scalar instruction sets has them. The issue is further complicated by the fact that you will probably program a SIMD machine on a somewhat lower level than scalar machines. Hence your mental set of computational operations might have been distorted by compilers, and might not be a good fit for the SIMD machine.

Yet another obstacle to getting a clear view of the expressiveness of a set of computational operations is the fact that most SIMD architectures include quite a few special purpose instructions. These usually merge a small number of processing stages of some larger algorithm. It is not always obvious what the intended purpose of such an instruction is. These instructions are rather specific in use, but quite powerful when they can be cleverly employed. So try and make sense of those when you encounter them.

Finally, there are SIMD operations that are no straight forward derivations of scalar instructions. For example, bitwise shifts of a whole vector register, regardless of element boundaries. Such instructions don't arise naturally from the vectorization of scalar loops, but lend themselves to SIMD specific programming idioms. Some of these idioms are rather obscure. With any luck, the platform vendor has provided programming examples which present some of these idioms (I will mention a few of these in the subsequent chapters on the software side of the five aspects).

In closing I have to admit that I won't be able to educate you about this hardware aspect in just a few paragraphs of text. You will have to learn this the hard way, using other people's code as reference. If you are starting out, use your common sense to judge the completeness of the basic operations, and save the tricky bits for later.

SIMD Control Flow

All SIMD machines rely on methods that convert control flow into data flow. Interestingly, though, no one uses actual data flow functionality to do that (for a hypothetical example instruction VPSPLIT, see my other text here). Instead, they use the well known technique of predication/if conversion. Some machines might have predicates as first class data types. Some might use primitives like conditional move or select. But one weakness is common to all these programming models: they eliminate the conditional branches, but they do not eliminate the overhead of those branches (the EPIC programming model has the exact same weakness). Some GPU programming models pretend to execute conditional branches, but internally map this onto traditional SIMD design patterns, and suffer from the same problems.

Ultimately, all of today's SIMD machines end up executing both paths of a conditional branch. That's roughly equivalent to a guaranteed branch misprediction every single time (where you execute the wrong path first, and then the correct path after that). This is the sore spot of SIMD, so watch out for every little bit of help: min/max instructions, instructions for clamping of values, table lookup instructions, anything that helps you express decisions in a program without resorting to branches.

If you happen to target a platform with unusually powerful data flow or data storage features, see if you can emulate something like the VPSPLIT operation mentioned at the beginning of this section. It is is the only hope you have of executing only one side of a branch in a SIMD way (you do not eliminate the branch, but reduce its overhead from one guaranteed misprediction per vector element down to one non-guaranteed misprediction per whole vector).

Semantics vs. Timing

As a second step, you would look at timing details of the specific hardware you are targeting. This information may or may not be documented by the hardware vendor, and even if it is documented, it may be inaccurate. Take the time to dig a little deeper. If you really can't find anything, don't shy away from making experiments to infer relative latencies and throughputs of important functionality. It will pay off later, for a simple reason: low performance can make some functionality effectively unavailable for code tuning.

I feel I must repeat this for emphasis: low performance can make some functionality effectively unavailable for code tuning. This may seem exaggerated or trivial, but do not forget that we vectorize code for the overall goal of optimal efficiency. If there is some semantically neat feature that does not bring us closer to that goal, we are better off pretending it does not exist. A fellow code tuner calls such features "performance trojan horses". This is the mindset you need here: do not accept everything only because it is a gift.

In the context of SIMD programming, high throughput usually trumps low latency. That's because data sets are large (as established as one of the fundamental requirements for effective massively data parallel programming). Latency still matters, but chances are that you have an easy time covering it with other useful work. Consequently, throughput tends to be the more suitable of the two basic performance metrics. Start from there, until you have other experience to go by.

So in this second step, you should make another pass over the list of goodies, baddies, and weirdness, and adjust it accordingly. Some goodies might have to be cautioned, some even might have to be pruned. Some baddies might not be so bad after all, and some of the weirdness might clear itself up. In the ideal case, you should now have a coarse mental model of the machine that allows you to make forecasts with some accuracy: how easy or hard it might be to program for certain functionality, and where performance bottlenecks are likely to show up under a given set of conditions.