Assessing the Expressive Eloquence of a SIMD ISA

In passing, I mentioned five aspects of computation: data formats, data storage, data flow, computational operations, and control flow. I am not really happy with this classification, because those categories are not sharply distinguished from one another. But they are the best I could come up with, so let me define my terminology in more detail.

SIMD Data Formats:
The data types and layouts that a SIMD ISA supports natively inside its vector registers. This covers differences like packed vs. unpacked vector elements, general vs. special purpose vector registers limited to specific data types, and so on. The simultaneous coexistence of elements of different types in the same vector register falls under this category as well, and is not a purely hypothetical thing. Stay tuned. :-)

SIMD Data Storage:
The data types and layouts that a SIMD ISA supports in main memory. This differs from internal data formats because of issues like address alignment restrictions, scatter/gather operation, masked vector loads/stores, loads/stores with format conversion, and so on.

SIMD Data Flow:
The facilities that a SIMD ISA offers to shuffle vector elements around internally. Existing ISAs rely on widely different concepts: shuffle instructions with invariant data flow (specified at compile time), permute instructions with variable data flow (specified at runtime), invariant shuffles implied in specific computational instructions, variable permutes implied in most any computational instruction, and so on.

SIMD Computational Operations:
The set of computational primitives of a SIMD ISA. Naturally, most are straight forward generalizations of their scalar counterparts. But things get interesting when the SIMD environment allows or necessitates different solutions: e.g. a widening multiply can halve the number of vector elements rather than fill two destination registers with high and low part of the result. The various existing SIMD ISAs tend to differ significantly in this area.

SIMD Control Flow:
The facilities of a SIMD ISA to enable some form of conditional processing. Traditionally the major weak point of the SIMD concept. Usually there are vector wide compare instructions which produce a vector of boolean values, and ordinary branch instructions that can act on the condition codes of one whole vector. Element-wise conditionals can be emulated with a select operator based on the element-wise condition codes. Hypothetically, a special type of conditional permute instruction might enable efficient conditional processing in a SIMD environment. Stay tuned. :-)

I want to mention that these five aspects suggested themselves not by lots and lots of routines that I vectorized, but rather by lots and lots of routines that failed to vectorize. I find the difficult cases more interesting, and so I eventually ended up with a sizable collection of algorithms and routines that were not practically vectorizable. The five aspects represent essentially distinct weaknesses of existing vector processors, as far as I was able to identify and distinguish them from the non-vectorizable examples. That is why any specific vector instruction might not clearly fall under a single one of the five aspects.


An Example Comparison of a few (almost) Existing SIMD ISAs

Armed with this terminology, we are able to estimate the expressive power of a given SIMD ISA without looking at specific algorithms or code examples. In my opinion it is important not to make too specific assumptions about programming styles and idioms. I sure cannot foresee what future programmers will do with emerging programming paradigms such as SIMD, and I doubt that anybody else can. Therefor I focus my view on the available tools of a SIMD ISA rather than on specific case studies (i.e. benchmarks).


Classical Age: Cray

I start with the claim that Seymour Cray's original vector computers were ahead of their time. Unlike the more recent "short vector" SIMD extensions to popular microprocessor families, the original SIMD computers were designed to be general purpose machines. They ultimately didn't succeed in that sense, but if you look at the five aspects defined above, you can see that the old vector computers always used the most general way of doing things that was known back then.

They handled all integer and floating point data formats consistently: a vector was basically a collection of unpacked scalars. This means there is quite a bit of wasted register space for the narrower data types, but that does not differ from how a scalar RISC handles, say, an 8 bit quantity in a 64 bits wide register.

Data storage is handled in the most general way possible: with full scatter/gather support. Thanks to fast SRAM main memory, this lets specific data flow facilities drop below the radar. (Those old machines achieved hardware parallelism with "very deep" pipelining, so the scatter/gather operations were pretty much as fast as any other instruction.)

The set of computational operations was somewhat simplified, but not beyond the point that some simplistic RISC ISAs of that day considered optimal. Most scalar operations existed as straight forward generalizations, and most of the missing instructions could be built with short sequences of the existing simple operations.

For control flow, the old vector computers used the standard if-conversion, in full generality with boolean mask registers (i.e. a vector of boolean values).

As we can see, Cray's original vector computers suffer the restrictions of the SIMD paradigm, but they do not suffer artificial restrictions beyond that. It seems obvious to me that those old machines were designed to give programmers as much freedom as possible. Which makes sense, given that the SIMD paradigm was new and no one really knew what people would do with those machines.


The Renaissance: Multimedia Extensions

In contrast, modern "short vector" SIMD ISAs started out as a special purpose thing. They were conceived as small additions to existing scalar ISAs, and one of the prime considerations was low hardware complexity. One might wonder why such efforts were pursued at all, when they were not allowed to consume many resources, and not intended to be generally powerful.

In my opinion, it was three independent developments coming together that created a unique opportunity: 64 bit processing (in server and workstation markets, but also in embedded CPUs), the digital multimedia hype, and a set of old and arcane programming tricks known as "SIMD Within A Register" (SWAR).

The basic idea behind SWAR is to use one relatively wide register to work with several smaller items of data in parallel. The standard example is eight bytes packed into a single 64 bit wide general purpose register. The hardware is blissfully unaware of those individual bytes. Yet some computational instructions, like the boolean operations, can already process those eight bytes in a SIMD manner.

Things get more complicated for, say, addition. But the overhead incurred by stopping carries from spilling across byte boundaries is still quite manageable, so the eight way parallelism is usually a win. Of course, in hardware it would be a lot easier to control those carry overs, requiring almost no additional logic.

In the beginning of the digital media age, the world had not yet settled on standards. Both the applications (what to do with audio and video) and the formats (how to store and transmit audio and video) were still somewhat in flux. The market could move quickly and suddenly, so it was desirable to implement media processing in software rather than hardware.

And that was it: digital media as the killer app, and a little bit of hardware for efficient SWAR processing started the renaissance of SIMD processors.

The first generation of modern "short vector" SIMD ISA extensions were fairly similar in their capabilities and restrictions. The focus on very uniform data of low dynamic range (i.e. pixels and sound samples) means that SIMD data types were limited to 8 and 16 bit integers. There were no special vector registers. Instead, existing GPRs or FPRs were re-used for SWAR processing.

Consequently, data storage was handled with scalar loads and stores, and usually implied strict address alignment (or specialized instructions for unaligned access). Some aspects of data flow could be emulated with the existing bitwise shifts and rotates (only DEC's Alpha went a bit further, trying to compensate the lack of single byte loads and stores with more SWAR instructions).

The available SIMD computational operations were generally rather limited. Wherever possible, existing scalar instructions were re-used, and the number of specialized SIMD instructions was kept low. The basic idea back then was to be significantly faster than the best scalar implementation for the codecs of the day, but at the lowest possible cost in additional hardware complexity.

Typically, no special precautions were taken to enable SIMD control flow. The expected bulk processing of very uniform data did not need it. Some of the instruction manuals showed how to build a vector select operation from simpler booleans, but generally the need for conditional processing was not acknowledged in those early days.


Modern Times: The Next Generation

At the time of this writing, few of the older processor families are still alive and kicking. But several of the survivors have evolved more powerful SIMD ISAs, as estimated by the five aspects, and as proven by vectorized code out in the wild.

Those newer processors can afford the luxury of spending a sizable chunk of silicon real estate on SIMD functionality. The available resources are always used to extend the scope and applicability of SIMD. Typically those processors have dedicated vector registers that are wider than 64 bits. This enables a reasonable amount of parallelism even for wider data types like floating point and 32 bit ints. But the narrower packed integer data types are still supported as well.

The dedicated vector registers use dedicated loads & stores, but address alignment restrictions usually still apply. However, most of those SIMD ISAs offer some help for reasonably efficient unaligned memory access, even though the idioms might differ substantially between architectures. Basic support for loading and storing individual vector elements is provided in several architectures of this generation, too.

By now, SIMD data flow finally becomes a concern of the architects, but they chose to take very different paths. Some of the more specialized vector processors offer some shuffling "for free" with each operation (typically restricted to a single operand, and only within one single vector's width). The SIMD extensions of established scalar ISAs tend to lean more on specialized shuffle or permute instructions (probably because their opcode space is much more crowded). I am very fond of AltiVec's VPERM instruction, which I consider the most versatile and efficient way of providing this functionality, so take my comparisons with a grain of salt. :-)

The computational operations follow the general pattern of extending scope and applicability of SIMD processing. At the same time, digital media is still a privileged application. But this time around, this is reflected in specialized hybrid operations which fold short common sequences of operations into single unusually complex instructions. I believe this is a result of a matured digital media landscape, where established standards imply that certain algorithms and transforms are here to stay, and can safely be incorporated into an ISA.

The Achilles' Heel of SIMD control flow remains a weak point. But by now, the architects acknowledge the need for supporting it, and so the second generation generally offers concise but complete sets of vector compare instructions and a generic vector select operator. Other implicitly conditional operations such as Min & Max operations are often included, too. But while those are very helpful in the cases where they are applicable, they do not generally address the problem of conditional processing.


Back to the Future: Larrabee

At the time of this writing, Intel's upcoming massively data parallel single chip multiprocessor, code named Larrabee, has already been documented publicly to some degree. What is known about its SIMD instruction set is intriguing in that it seems to be a tribute to Cray's original machines. The following paragraphs are based on incomplete information, and should not be taken as an authoritative reference.

Larrabee will have significantly wider vector registers than today's "next generation", but at the same time drops support for packed narrow integer data types. Only 32 bit floats and ints, as well as 64 bit floats will be supported. I guess the old SWAR programming tricks will become popular again on Larrabee when 8 or 16 bit integers have to be processed. In addition to the vector registers, there is a separate set of mask registers (i.e. vectors of boolean condition codes), and a separate instruction subset to work on those.

Unusually powerful SIMD data storage facilities will compensate for the reduced number of data types. Specialized loads and stores can convert data on the fly, including the slightly exotic half precision floating point format used by dedicated graphics processors. Larrabee will have generic scatter/gather instructions, and the interesting compressing/expanding loads & stores that will indirectly help with some cases of conditional processing.

There are fairly rich, but oddly non-generic facilities for SIMD data flow in Larrabee. The limitations posed by the instruction set hint at implementations which process a vector in four distinct chunks of 128 bits each. Many instructions can shuffle one operand for free, but due to limited opcode space, only a tiny subset of permutation patterns are available. There is a semi-generic shuffle instruction, but it is limited by the aforementioned 128 bit chunk borders.

The available computational operations are fairly standard and complete. There will be a few special purpose instructions, this time not targeted at digital media processing, but at three dimensional graphics rendering. The limited number of data types allows for a generally more regular and simpler set of operations than other modern SIMD ISAs.

Intel took control flow seriously, allowing most computational instructions to be predicated with a variable mask. As usual, SIMD compare instructions generate those masks. This model is slightly more efficient than a separate select operator, but still suffers from the typical SIMD weakness that some elements of a vector become dead weight and reduce efficiency.


Utopias: one Sunken, one Hidden, one Unborn

I hope that the above examples, as superficial as they are, can still provide a convincing justification for my terminology of the five aspects of SIMD. I will now turn around and use this same line of reasoning to look at a few select individual instructions in more detail. Finally I can explain in a more rigorous manner why I like AltiVec's VPERM as much as I do. :-) But that won't be the end of it; there is a minor detail in Intel's SSE that could be the starting point for significant progress; and finally I will offer a radically new and powerful concept for doing conditional processing in a SIMD context.


AltiVec's Vector Permute

This instruction takes three 16 byte wide vectors as inputs: two data vectors a and b, which are conceptually concatenated into a 32 byte long array A[], and a third control vector c. Each byte element of c is interpreted as a five bit wide index into A[] (the upper 3 bits are ignored), and the output vector d is filled with the result of the following table lookup:

d[i] := A[c[i]]

There are three fundamentally different ways to use this instruction:
1. For shuffling, c is invariant, and an input stream is moving through a and b.
2. For table lookup, a and b are invariant, and an input stream is moving through c.
3. Permutations themselves can be operated upon when a and c are variables. In that case, b is ignored.

There are three fundamentally different ways to apply this instruction:
1. For data flow, obviously.
2. For data storage: AltiVec re-uses VPERM for unaligned memory access and for packing and unpacking integer data.
3. For computation with table lookups.

I cannot imagine a single SIMD instruction that does that much in such a straight forward and general way. I remember back when I browsed the AltiVec instruction manual and encountered VPERM for the first time, I was struck by a sense of awe. I knew I was witnessing a significant improvement of the SIMD idea. Unfortunately, this knowledge was a purely intuitive thing. It took me pretty much ten years of reasoning to arrive here at this text you are reading, ten years before I was able to explain in simple terms why VPERM is such a significant idea.

This is my sunken Utopia. AltiVec lives on in the current crop of gaming consoles. But with Apple having abandoned PowerPC, I doubt we will see a continued evolution of the AltiVec instruction set. AMD's SSE5 is a ray of hope, with a slightly extended version of VPERM - but with my luck, Intel will finally crush them rather than learn from them.


SSE's Complex Numbers

I believe it was the transition from Northwood to Prescott (two stages of evolution of the processor marketed as Pentium 4) when Intel added a few vector instructions that would treat even and odd elements of a vector slightly differently. Back then I believed that this was a useless pollution of opcode space, designed to speed up the inner loops of one or two SPEC benchmark codes. And perhaps that was all it was.

But when viewing this in light of the five aspects, I can see how Intel laid a subtle and modest foundation for the support of a new data type: complex numbers (these are pairs of numbers with a specific mathematical meaning). I think this is a promising path to follow further. IBM has done similar things with their SIMOMD idea; but when you work on vectors of just two elements, the concept of even and odd vector elements is too simple to be interesting.

This is my hidden Utopia: a promising idea that even its creators are probably still unaware of.


Conditional Vector Pair Split

This is my unborn Utopia: the radically new and powerful way to do conditional processing in a SIMD context. The example instruction presented here does have its warts: as part of a hypothetical instruction set it would require a bit of infrastructure; a few more helper instructions to make it widely applicable. Hardware designers will hate the example instruction presented here, because in order to do as much work as it does, it needs to read and write a lot of operands. Fasten your seat belts, here we go:

VPSPLIT takes a total of four inputs: two data vectors (i1, i2) and two boolean mask vectors (m1, m2); it produces a total of three outputs: two data vectors (o1, o2) and one scalar condition code bit c.

The inputs i1 and i2 are conceptually concatenated into one double wide vector i. Likewise, a mask m is formed from m1 and m2.

A well defined unique permutation P is constructed based on the contents of m such that all zeros in m are sorted towards the left end of m and all ones are sorted towards the right end. The sort is stable, i.e. the relative order of zeros remains unchanged, as well as the relative order of ones.

This permutation P is then applied to the input data i, producing a double wide output vector o. This output vector is returned in vectors o1 and o2, along with condition code c, which is zero when m contains more zeros than ones, and one otherwise (the choice is arbitrary when there are equally many zeros and ones, but should be consistent). In essence, c specifies which of the two result vectors o1 and o2 contains uniform data (i.e. elements that all share the same condition with respect to m).

The intended usage of this instruction is as follows: i1 and m1 are carried over from the previously processed vector. New data is placed into i2 and m2. Then the split is performed to create one vector full of uniform data items. A scalar branch is needed to determine which of the two halves is ready for processing in this iteration. The respective program path does its work, and then places the unprocessed non-uniform half into i1 and sets m1 appropriately. Rinse, repeat.

That's it. We have successfully replaced control flow with data flow. We still need a scalar branch instruction based on c, but that is only one branch per vector rather than one branch per element. And we no longer carry dead weight in vector registers in the form of elements that are predicated "off".

Some of the hardware complexity can be handled by basing VPSPLIT on a standalone VPERM. In that case, VPSPLIT would produce just the permute control vectors and a scalar condition code. Theoretically it would suffice to generate only the permute control for one half of the result. But then the overhead grows to the point where VPSPLIT cannot provide a benefit in the frequent case where one conditional path does nothing at all, and the other path does relatively little work.

Pitfalls: one might think that inputting a condition code instead of two mask vectors is a good idea, because this reduces the number of inputs and the mask can still be generated internally. But doing so actually makes the instruction less powerful, because while you can still split the data, you can no longer split the vector of addresses that remembers where the data came from.

Disclaimer: as far as I am aware, I was the first person to describe this "conditional vector pair split" instruction. However, I am standing on the shoulders of giants: there is an ancient "bit gravity" instruction from mainframe days, as well as a single vector version of split from Cray's times. As far as I am concerned, both these earlier forms could be considered prior art. Nevertheless, you have my permission to take my contribution and run with it. I hereby put it in the public domain.