Data Parallelism, Analytical Perspective

Not many programs match the extremely restricted pattern of the previous section. Fortunately, only two of those restrictions are truly fundamental, while the others can be relaxed by creative hardware architects and software engineers. One fundamental restriction is that we need a fairly large amount of data, otherwise we simply do not have enough work to do to make parallelization effective. The other fundamental restriction is that the computations need to be mostly independent of each other, otherwise the implied serialization limits the amount of work that can be done simultaneously.

The first restriction is not usually a concern. A small problem size almost always means that computation is finished quickly, and no one bothers to optimize that piece of code. The second restriction, though, is the single biggest obstacle to any kind of parallelization. There is an ongoing philosophical debate among computer scientists about how severe this restriction really is. Some think that a large fraction of all useful algorithms are inherently serial. Others believe that over time, we will discover more parallel algorithms. I readily admit that I see myself on the more optimistic side, but I am unable to offer evidence for settling the question. Instead, I will focus on the restrictions that can indeed be relaxed.

I will broadly classify these restrictions into five categories, which I call the five aspects of SIMD programming. My terminology is the result of futile attempts at vectorization, and a close look at the reasons for each failure. Therefor, each aspect is an essential capability of a specific SIMD computer, or an essential requirement of a specific algorithm (in other words, I look at a machine from a certain angle, or at an algorithm from the same angle, and I identified five such angles that yield useful insights). No aspect should be reduced to individual machine instructions, or to specific features of an instruction set architecture; it is often the comparison between different SIMD instruction sets that makes the aspects clearly stand out. The five aspects of SIMD are:

SIMD Data Formats
The hardware aspect regards the data types and layouts natively supported in vector registers. This covers details like packed vs. unpacked vector elements, general vs. special purpose vector registers limited to specific data, and so on. The software aspect here is simply the data types required by a specific algorithm.

SIMD Data Storage
The hardware aspect looks at the data types and layouts supported in main memory. It is distinguished from internal data formats by additional restrictions such as address alignment, unit stride requirement, or by additional freedoms such as type conversion implied in memory access, masked element-wise memory access, or full scatter/gather operation. The software aspect is concerned with conflicting memory layout requirements of an algorithm, like for instance row vs. column vectors of a matrix, which cannot both be unit stride at the same time.

SIMD Data Flow
The hardware aspect covers means and facilities to shuffle individual elements of one or more vector registers independent of memory access. The software aspect focuses on the data flow patterns required by an algorithm. The specific methods used by different vector processors vary as widely as the requirements of different algorithms. Therefor the software side may have some leeway trading off SIMD Data Flow against SIMD Data Storage and vice versa. The two aspects are nevertheless cleanly distinguished by SIMD Data Flow not having any overhead from memory management or memory/cache access.

SIMD Computational Operations
The hardware aspect views the set of available computational primitives. Interesting cases arise when there is more than one possible way of generalizing a scalar operation to a vector version. The software aspect is simply the set of computational operations required by an algorithm.

SIMD Control Flow
The hardware aspect is concerned with facilities that enable conditional processing of individual vector elements. This is a traditional weakness of the SIMD paradigm, because it contradicts the idea that each and every element of a vector is treated the same way. Therefor the software aspect usually focuses on reducing the required amount of conditional branching as much as possible.

In the following sections, I will present ways to deal with the restrictions distinguished by the five aspects as well as the fundamental restrictions of the SIMD paradigm.