following passes, the input to each thread is always two sorted vector arrays A and B and the output is
the sorted merge of these two arrays. One vector, a, will be taken
from A and one vector, b, from B, and the
components of these two vectors,
a and b, will be sorted so
that a contains the
lowest four floats and b
the highest four floats.
easily done if one realizes that, for an element an to be among the
four highest elements, there must be at least 4 elements lower than it, and
since there will be n
lower elements in a
that there must be 4 ?
n lower elements in b. I.e., b4 ? n ? 1 must be lower than an. Thus this can be accomplished by
two compare-and-swap instructions, see Figure 1(a).
// get the four lowest floats
a.xyzw = (a.xyzw < b.wzyx) ? a.xyzw : b.wzyx // get the four
b.xyzw = (b.xyzw >= a.wzyx) ? b.xyzw
The vectors a and b are then
internally sorted by calling sortElements(a)
is output as the
next vector on the output and b
takes the place of a
(see Figure 1(b)). A new b
vector is fetched from A
or B depending on which
of these contains the lowest value. This process, adding one sorted vector to
the output, is repeated until either A
or B is
empty. When either input array is empty, the (already sorted) remains in
the other array are simply appended to the output.
will be working on some specific part of the input stream, so the arguments
passed on to each thread in each pass is the offset where its two consecutive
lists begins and the number of elements it shall merge. The thread will start
reading its input values from this offset in the input list and will start
writing the result into the same offset of the output list. In the next pass,
the input and output lists will be swapped.
of merge-sorting a list works very well in the first passes, where many threads
can merge many lists in parallel. Looking at the execution times for each pass,
we can easily see that the algorithm will become slow when there is not enough
parallel threads left to keep all processors busy (see Figure 2). In the final
step, the entire sorted result will be created by a single processor of a
single multiprocessor, on the G80 effectively leaving all but one processor
idle. See Figure 3 for pseudo code.