Addendum: the three exact ways in which VPSPLIT sucks

I finally found a few ... er ... volunteers to discuss the conditional vector pair split. The unsurprising thing is that VPSPLIT sucks, but the surprising thing is that it does so in fairly interesting ways. Not all of the following points were explicitly mentioned in the debate, so I want to recollect them here. In my opinion, knowing the specific limitations only makes such a primitive more applicable (in the sense that we better know the exact niche it lives in).

The intended code idiom for VPSPLIT looks something like this:

...
vectorData_t leftHalf, rightHalf;  // transient data storage
vectorMask_t leftMask, rightMask;  // transient boolean condition of data

vectorData_t* ptr;  // some array of data
int i;              // loop counter
bool flag;          // auxiliary

// loop preparation
leftHalf = ptr[0];  // indexed per vector, not per element
rightHalf = ptr[1];

for (i = 2; i <= imax; i++) {
  leftMask = isSomeCondition(leftHalf);
  rightMask = isSomeCondition(rightHalf);

  (leftHalf, rightHalf, flag) = vpsplit(leftHalf, rightHalf, leftMask, rightMask)

  if (flag) {
    processRightHalf();
    rightHalf = ptr[i];
  }  else {
    processLeftHalf();
    leftHalf = ptr[i];
  }
}

// loop epilogue
leftMask = isSomeCondition(leftHalf);
rightMask = isSomeCondition(rightHalf);

(leftHalf, rightHalf, flag) = vpsplit(leftHalf, rightHalf, leftMask, rightMask)

if (flag) {
  processRightHalf();
  processLeftovers(leftHalf);
}  else {
  processLeftHalf();
  processLeftovers(rightHalf);
}
...

This construct has three big problems:

1. It obscures what was clear, causing extra work
Changing the order of data elements is the whole point of the operation, but it causes additional complexity. The immediate consequence is that on average, every data element has its condition checked twice rather than just once. Theoretically, the boolean information is still there, but it has been scrambled by the shuffling of data.

2. It destroys memory locality
The above pseudocode does not show how results are written back to memory. Clearly, some sort of scatter store is needed in the general case. Even if the hardware provides that, the potentially irregular access patterns will have a price in terms of bandwidth and latency.

3. It shovels data around without processing it
The above pseudocode does not show how complex the actual processing of data (in both true and false code paths) can be. Potentially, a lot of state might be associated with every data element. So you might have to shuffle dozens of vector variables around for every single condition.

The trouble here is that only the first of those VPSPLIT operations produces new information, while the subsequent ones keep re-computing an identical permutation. Even on machines with generic vector permute capability (where VPSPLIT can be decomposed into 1. determine permutation, 2. apply permutation), you might end up shuffling around lots and lots of data, without actually getting closer to any desired result.

Conclusion

The conditional vector pair split is not only not a general solution, but it is not even a complement to predication. VPSPLIT and predication together still do not cover all the needs of the SIMD control flow aspect. Any more bright ideas?