Finding Data Parallelism

Before you, dear reader, can apply the theory of the five aspects to the specific problem you wish to solve, you have to overcome a more substantial obstacle: data parallelism is inherently limited. The very first thing you need to do in any project involving SIMD programming is to find out if you can extract enough parallelism from the algorithm to make vectorization worthwhile.

You can start thinking about this even before I show you how to apply the five aspects. They would be helpful, but the decisions you make this early on are more abstract and tend to be revisited several times before you settle on one particular "axis" along which to vectorize the problem. It is more important to become aware of different alternatives than to settle for a "perfect" design early on. Ultimately, time is the teacher that will leave you with the necessary experience.

When the data parallelism is obvious, there are usually still choices to be made. The typical example is an algorithm with a few nested loops at its core. Inner and outer loops can be vectorized; loops can be transformed, transposed, merged and split. Try to understand the resulting graph of data dependencies, so that you can determine how "orthogonal" various potential vectorization axes are to the data dependencies.

When the data parallelism is not obvious, parallelism may be merely obscured by the complexity of the algorithm. As a simple illustrative example, consider a two dimensional image filter, where the value of each pixel depends on the two neighbours to the top and to the left. Any serial implementation would simply iterate row wise or column wise with a loop carried dependency, but a parallel implementation would need to process diagonals (parallel to) bottom left to top right. The axis of vectorization is oblique to the natural dimensions of the problem, but it still exists.

When data parallelism is believed to be non-existent, or even proven to be very limited, you have no choice but to abandon the original algorithm. With any luck, one of the potential replacements is vectorizable. The search space will be most limited if the replacement algorithm has to produce bitwise identical results, so a little freedom here is a big help; find out if an equivalent result would do, and what "equivalent" exactly means in the specific case. And do not be fooled by embarrassingly parallel algorithms that are really inefficient. Potential parallelism is only one of the criteria which the replacement algorithm has to meet.

An Example

When dealing with any specific data dependency, it can be helpful to understand whether it is rooted in fundamental, theoretical causes, or comes from pragmatic considerations of programming. Such knowledge can sometimes help to weaken or even break a dependency. As an example, consider a data compression method called arithmetic coding. I will not try to explain it here in full, suffice to say that at its heart, it computes a long sequence of nested intervals. Every symbol of the original message corresponds to a specific subinterval. As the symbols are processed one by one in sequence, the interval keeps shrinking. After the message has been read in full, a representative number is chosen from the final interval that encodes the message.

Arithmetic coding is often thought to be a strictly sequential algorithm. And indeed the kind of interval shrinking as needed by arithmetic coding, when regarded as a mathematical operator, is not commutative. That means the order of operands cannot be changed freely. An example of a commutative operation is addition: the sum remains the same regardless of the order in which we add a given set of numbers. However, interval shrinking is associative. This means that we cannot move operands around, but we can influence the order of operations without changing the result. Therefor, we may take the long sequence of shrinks (using "#" to denote the operation of shrinking one interval by another)

s1 # s2 # s2 # s3 # s4 # s5 # ... # sN

and compute N/2 partial results in parallel

(s1 # s2) # (s3 # s4) # ... # (sN-1 # sN)

and we can do this recursively to arrive at the final interval in lb(N) parallel rather than N sequential steps. So far the theory. Unfortunately, this is completely impractical, because the interval borders are real numbers between 0.0 and 1.0 with an unlimited precision.

Practical implementations of arithmetic coding handle this by tracking the floating point along those unlimited number of fractional binary digits in a strictly sequential manner, so that they can maintain high accuracy with finite precision. Nevertheless, deeper understanding of the problem allows us to trade off accuracy of the result versus a small constant factor of speed. We will not be able to go all the way and compute all interval shrinks in parallel with finite precision, but we can reduce the sequential part of the algorithm to half or quarter size, and still maintain good accuracy.