In thefollowing passes, the input to each thread is always two sorted vector arrays A and B and the output isthe sorted merge of these two arrays. One vector, a, will be takenfrom A and one vector, b, from B, and thecomponents of these two vectors,a and b, will be sorted sothat a contains thelowest four floats and bthe highest four floats. This iseasily done if one realizes that, for an element an to be among thefour highest elements, there must be at least 4 elements lower than it, andsince there will be nlower elements in athat there must be 4 ?n lower elements in b.
I.e., b4 ? n ? 1 must be lower than an.
Thus this can be accomplished bytwo compare-and-swap instructions, see Figure 1(a). // get the four lowest floats a.xyzw = (a.xyzw < b.
wzyx) ? a.xyzw : b.wzyx // get the fourhighest floats b.
xyzw = (b.xyzw >= a.wzyx) ? b.xyzw: a.wzyx The vectors a and b are theninternally sorted by calling sortElements(a)andsortElements(b). ais output as thenext vector on the output and btakes the place of a(see Figure 1(b)). A new bvector is fetched from Aor B depending on whichof these contains the lowest value.
This process, adding one sorted vector tothe output, is repeated until either Aor B isempty. When either input array is empty, the (already sorted) remains inthe other array are simply appended to the output. Each threadwill be working on some specific part of the input stream, so the argumentspassed on to each thread in each pass is the offset where its two consecutivelists begins and the number of elements it shall merge. The thread will startreading its input values from this offset in the input list and will startwriting the result into the same offset of the output list. In the next pass,the input and output lists will be swapped. This methodof merge-sorting a list works very well in the first passes, where many threadscan merge many lists in parallel.
Looking at the execution times for each pass,we can easily see that the algorithm will become slow when there is not enoughparallel threads left to keep all processors busy (see Figure 2). In the finalstep, the entire sorted result will be created by a single processor of asingle multiprocessor, on the G80 effectively leaving all but one processoridle. See Figure 3 for pseudo code.