Sorting Large Data Sets on a Massively Parallel System - CiteSeerX

Sorting Large Data Sets on a Massively Parallel System Ralf Diekmann, Jörn Gehring, Reinhard Lüling, Burkhard Monien, Markus Nübel, Rolf Wanka y Department of Mathematics and Computer Science University of Paderborn, D-33095 Paderborn, Germany e-mail:fdiek,joern,rl,bm,nuebel,[email protected]

has a running time of 3n + o(n). Unfortunately, the low order term increases the running time of this algorithm on existing machines significantly. Other, more or less topology independent sorting methods are parallel versions of Quicksort [5, 6], Samplesort [9, 14], and Radixsort [7].

Abstract This paper presents a performance study for many of today’s popular parallel sorting algorithms. It is the first to present a comparative study on a large scale MIMD system. The machine, a Parsytec GCel, contains 1024 processors connected as a two-dimensional grid. To justify the experimental results, we develop a theoretical model to predict the performance in terms of communication and computation times. We get a very close relation between the experiments and the theoretical model as long as the edge congestion caused by the algorithms is predicted precisely. We compare: Bitonicsort, Shearsort, Gridsort, Samplesort, and Radixsort. Experiments were performed using random instances according to a well known benchmark problem. Results show that for the machine we used, Bitonicsort performs best for smaller numbers of keys per processor (< 2048) and Samplesort outperforms all other methods for larger instances.

Previous experimental work. To demonstrate the usefulness of these algorithms and to investigate them from a practical point of view, they have been implemented on different existing parallel machines. Implementations were done on large scale SIMD machines and on MIMD systems. In [4] and [21], a number of algorithms are compared using a Thinking Machines CM-2 (1024 nodes). Both authors found Samplesort to perform best on this machine. A similar result is presented in [11] using a MasPar system containing 1024 nodes. Implementations on MIMD architectures so far have been done for medium sized systems only. In [23], results for an implementation on a 128 processor AP1000 system are presented. Implementations of Bitonicsort on 64 processors are presented in [21] using the iWarp system. The same number of processors was used for the implementation of Samplesort on an iPSC/860 system [14].

1 Introduction The problem. One of the fundamental problems in computer science is to sort large data sets quickly. By the use of parallel computers, it is possible to speed up the time to sort. Therefore, parallel sorting has already been studied for a long time. One of the realizable parallel processor network architectures is the two-dimensional n n grid. This network is used for our studies. Sorting algorithms can roughly be divided into methods that are exclusively based on the compare-exchange operation and into methods that use additional operations like e. g. counting, copying, and a routing that depends on each specific input. Batcher’s famous bitonic sorting circuit [1] can be implemented on the grid in a way that results in a running time of O(n) [22] by using an efficient routing scheme. For practical purposes, this means only that the runtime of an implementation is proportional to the diameter of the grid, but the real constant factor is not determined. Algorithms that are inspired by the structure of the grid are Shearsort [19] and Gridsort [13, 16] with a running time of O(n log n) each, and the theoretically optimal algorithm due to Schnorr and Shamir [18] that

The new results. This paper presents implementation results of five sorting algorithms on a 1024 processor Parsytec GCel system which is a 32 32 mesh-connected MIMD machine. The implementations were tested by a series of random 32-bit numbers built according to the generator routine of the NAS Integer Benchmark [2, 3]. To justify our experimental results, we analyze the behavior of the algorithms under a theoretical framework which is applied to our machine. For this, we investigate the costs of the main operations, i. e. comparisons, memory movements, inter-processor communication, and synchronization. This enables us to predict the overall runtime of the algorithm very precisely as long as network congestion can be predicted exactly. Implementations and theoretical investigations lead to the following results:

Partially supported by DFG-Forschergruppe “Effiziente Nutzung massiv paralleler Systeme”, by Esprit Basic Research Action Nr. 7141 (ALCOM II), and by DFG Leibniz Award. y Partially supported by Volkswagenstiftung.

1

For the 1024 processor MIMD system we used, we have best running times for Bitonicsort if the number of keys per processor is small ( < 2048). For larger instances, Samplesort outperforms all other methods. The predicted running times correspond almost exactly to the measured running times. By this, we

get interesting insights into the structure of the algorithms and can observe what operation mostly effects the overall performance.

Comparisons. All implemented algorithms use a common compare-function. Measurements give TCompare (C ) = C 6:77 10?3 ms. Communication. For all parallel sorting algorithms presented in this paper, interprocessor-communication consumes between 50 and 90 percent of the algorithms’ total running time on the GCel. There are three main parameters that determine communication speed: packet size, dilation, and congestion. The packet size is the number of bytes that have to be sent during a communication phase, dilation is the number of physical links this message has to cross, and congestion is the number of virtual links that are mapped onto a physical link. Throughput measurements on the GCel using different packet sizes and dilations have led to a classification of the virtual links depending on the location of the processors they connect. These classes are (cf. Figure 1): Class A: The processors Class B: The processors are neighboring in the same are neighboring in different cluster. clusters. Class C: The processors Class D: The processors belong to the same cluster, are neither neighbors nor in but are not neighboring. the same cluster. Figure 1 shows performance measurements for dilations from 1 through 22 and the corresponding classes in the GCel. The maximum communication throughput is reached for packet sizes larger than 1 KByte. Since communication distances for the classes A, B, and C are very short, the edge congestion may be ignored in these classes. It has to be taken into consideration in class D only. In this case, our measurements have shown that the maximum edge congestion is the most important fact. This parameter is very difficult to predict in an asynchronous computer network, because it is not known in advance at which time communication really takes place on a physical link. Practice has shown that for large packets a maximum edge congestion of Emax slows down a communication phase by a factor of approximately Cong(Emax ) := maxf1; Emax ? 1g. Using this fact and the measurements presented in Figure 1, we are now able to define formulas estimating the communication time in ms, given the edge congestion Emax and the packet size S (in bytes). We use TComm A (S ), TComm B (S ), TComm C (S ), and TComm D (S ) to express the communication times in the different classes where the TComm x are third-order approximations of the curves of Figure 1 obtained by hyperbolic regression (due to space limitations, the polynomials are omitted). TComm (K ) is the overall communication time obtained combining the TComm x . If a communication link is used bidirectionally, a communication step is not as fast as in the case of a single unidirectional communication. For our system, the time for one send operation has to be multiplied by CBid = 1:08 if bidirectional instead of unidirectional communication is used.

The paper is organized as follows. First, we present the theoretical framework and its application to the Parsytec GCel 1024. Then we present the results of the implementations of the following algorithms: Bitonicsort, Shearsort, Gridsort, Samplesort, and Radixsort. The fastest methods, Bitonicsort and Samplesort, are discussed in detail, the remaining briefly. In the final section, we discuss the advantages and disadvantages of the algorithms using the theoretical framework and compare the results of the implementations.

2 A Theoretical Model In theory, sequential sorting algorithms are analyzed depending on the number of comparisons they need. To predict their running time more precisely, also the number of assignments has to be analyzed. Parallel sorting algorithms require communication in addition to comparisons and assignments. Let M be the number of assignments (memory movements), C the number of comparisons, and K the number of communication phases an algorithm A needs to sort N keys with P processors. The predicted running time of A is TR(A) = TMove (M )+TCompare (C )+TComm (K ). The functions TMove , TCompare and TComm depend on the characteristics of the individual hardware. The values of M , C and K are machine independent and only defined by the algorithm. We show that this model is suited to predict the running times of different sorting algorithms if the three functions are defined in a way that describes the given hardware closely enough. In some examples, we also see that if the adaptation is not sufficiently exact (especially TComm is not easily to determine exactly), the prediction is not as precise. Nevertheless in all cases, the model can serve as a useful tool to understand the behavior of a given sorting algorithm. The Parsytec GCel. The GCel is a two-dimensional 32 32 grid of INMOS T805 Transputers, each running at a speed of 30 MHz. The Transputers are arranged in clusters of 4 4 processors. The whole machine contains 4 4 gigacubes which are built by 2 2 clusters each (cf. Figure 1). The operating system used for our implementations is PARIX 1.2 [8, 15]. One of the most important features of the PARIX runtime environment is its ability to implement virtual links between Transputers that are not physically connected. This allows to use a large number of topologies, e.g. hypercubes, butterflies or higher-dimensional meshes, which are automatically mapped onto the hardware [17]. The runtime environment implements virtual links with wormhole routing. In the following, we present the runtime of the basic operations TMove , TCompare and TComm establishing the theoretical model. Memory movements. For the time for an assignment operation, we have TMove (M ) = M 3:24 10?7 ms:

3 Comparison-based Algorithms Comparison-based algorithms usually are designed to sort with N P = 1, but in practice, every single proces2

KBytes/s B

.. .. .. .. .. .. .. . .. .. .. . .. . . . .. ..

1200 D cube border A cluster border 1000 B 800 C B 600 D C 400 A 200 Packet size 0 in bytes D 0 8192 16384 24576 32768 Figure 1: Link classes according to the architecture of the GCel, Communication throughput

................................. ......... ..... . ......

......... . .......... ........ .............. ........... ................................................

.............................................................................................................................................................................................................. ................................... ......... .. ... .. . . .. ................................................................................................................................................................................................................................................ .. .............. . ...... . ..... . . . . .. . ............... ......................................................................................................................................................................................................................... . .. ................................................ . ......... ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ ... . . . . ... .... . ......................................................................................................... ... . . . . . . . . . . . . . . .. ............................... . .. . . .. ..... .............. . . . . . . . ...... .... ..... . . . .. ........... .... . . ... .... ... . ....... . ... . .. . . . ... . . . . . .

resulting in a smaller dilation than the rarely used links in higher-numbered dimensions. We have compared the performance of Bitonicsort using this algorithm-sensitive mapping to a simple 1-1 mapping that assigns hypercube node j to grid node j and to the PARIX-system hypercube mapping that uses a so called bandwidth mapping. This comparison showed that the 1-1 mapping and the bandwidth mapping lead to a runtimes 26 percent and 100 percent, resp., above the runtime using the algorithm-sensitive mapping. The results presented in the following analysis refer to the optimized mapping decribed first. If not mentioned otherwise, P denotes the total number of processors and N the total number of keys to be sorted.

sor holds many keys. Therefore, we replace the original compare-exchange operations by split&merge operations [12]: one processor participation in the original compare-exchange operation receives the smaller N P keys, N the other the greater P keys. To implement this operation efficiently, the keys each processor holds are sorted locally before the original algorithm is started. This initial internal sorting is performed by sequential Quicksort. It ?N takes 1:4 N log comparisons and assignments on P P ? average. Thus, a single split&merge needs log N P comparisons in order to determine the number of keys that have to be taken from the other processor, N P comparisons and N 3 P assignments for the actual merging. Furthermore, two bidirectional communication steps are necessary.

3.1

Practical and theoretical time consumption. Now we analyze Bitonicsort with respect to the theoretical model described in Section 2. Exploitingthe Bitonicsort algorithm and the embedding, we count the number of communication steps for the classes A, C, and D. In the optimized mapping, no communication takes place in class B. Split&merge along dimension 0 of the hypercube takes place log P times, along dimension 1 it takes log P ? 1 communications. So we have 2 log P ? 1 communication steps in class A. Class C communication happens in dimensions 2 and 3 at a total of 2 log P ? 5 times. Plog P ?1 CBid Cong(2bi=2c) (log P ? i) The remaining i=4 communication steps belong to class D. These considerations are summarized in Table 1. Time loss caused by edge congestion is only significant for class D and is therefore not be considered for evaluation of class A and C communication. In Figure 2, we give a theoretical prediction of the total time in seconds, used for comparisons, assignments, and communication steps as well as the measured running time of our implementation. The figure presents these numbers in a cumulative style (TR = TComp + TMove + TComm ), i. e. the communication curve contains also the times for assignments and comparisons and therefore gives the total running time TR . The dotted line refers to our measurements and corresponds nearly to the predicted running time. This strong correlation is induced by the fact that Bitonicsort works very synchronously and therefore operations on all processors take place nearly simultaneously. Further, the maximum edge congestion can be predicted very well. So our theoretical formulas model the real behavior precisely. Figure 2 shows that comparisons and communication steps

Bitonic Sort

The algorithm. The Bitonicsort algorithm [1] is very well suited for the hypercube topology, but by the use of the PARIX virtual links, it also can be implemented on the GCel easily. The basic idea of this algorithm for the kdimensional hypercube is as follows: first recursively sort the so called upper and lower (k ? 1)-dimensional subhypercubes – one into increasing and one into decreasing order – then merge these sorted hypercubes. Using this principle, the algorithm takes exactly 12 k(k + 1) parallel steps to sort on a hypercube with 2k processors. Note that egdes belonging to high-numbered dimensions are used seldom. As the GCel is a two-dimensional grid instead of a hypercube, it is very important how to map the hypercube onto this grid. We tested several mapping strategies. The strategy which was found to give best results works as follows: The hypercube of dimension k is divided into two subcubes of dimension k ? 1. Each subcube is mapped onto a physical 2(k?1)=2 2(k?1)=2 subgrid if k is odd and onto a 2b(k?1)=2c 2d(k?1)=2e subgrid otherwise, in order to obtain almost square-meshes. This strategy taking the communication structure of Bitonicsort into account leads to very good results for the implementation of Bitonicsort on grid architectures, because the dilation for an edge belonging to dimension i is 2di=2e ? 1. Thus, links used more often by the algorithm in low-numbered hypercube dimensions are mapped in a way 3

Split&merge steps R

?N

Comparisons Assignments Class A comm. Class C comm. Class D comm.

(log P + 1) ? R P + log P + 1:4 NP log NP ?N N 3R N P + P log P ? CBid (2 log P ? 1) TComm A NP key size ? CBid (2 log P ? 5) TComm C NP key size ? CBid Cong(2bi=2c) (log P ? i) TComm D NP key size 1 2

Plog P ?1

i=4

log P

? N

Table 1: Estimation formulas for Bitonicsort tion t. dlog ne + 1 iterations are sufficient to sort. If n is a power of 2, the algorithm terminates within n log n + 3n ? 1 split&merge steps.

are the most time consuming parts. The time necessary for assignments can be ignored for this algorithm. Advantages and disadvantages. Transfer of data packages happens during split&merge operations only. The total amount of keys exchanged in each step leads to a packet size of more than one kilobyte, if the number of keys per processor is larger than 256. In this case, startup times can be ignored. The achieved link performance is very close to maximum. The operation that is mainly performed by this algorithm is split&merge. Because of that, the processors are synchronized frequently. Since every processor has nearly the same amount of work per iteration, the synchronization loss caused by unbalanced workload is negligable compared to the total running time. The basic Bitonicsort algorithm is very simple and can be implemented by few lines of code. The main complexity is buried in the mapping algorithm that builds up the hypercubes virtual links. As the program is very small, there is enough main memory available to sort large sets of data within a reasonable time.

3.2

Practical and theoretical time consumption. The precalculated theoretical times for comparisons, assignments, and communications on a 1024 processor grid, as well as the measured running times (dotted curve) of the implemented algorithm are shown in Figure 3.

100 80 60 40 20

65536

3.3

Keys per processor =

16384

65536 =

1024.

Advantages and disadvantages. As for the Bitonicsort algorithm, most time is spent to compute split&merge operations. In contrast to the Bitonicsort method, this operation is only performed by physically neighbored processors (communication classes A and B). As the number of keys each processor holds is nearly equal, no workload imbalance occurs during runtime. This leads to very small synchronization loss. The large number of split&merge operations in the worst case is the main drawback of Shearsort. Sometimes the number of split&merge operations can be decreased using termination detection principles to check whether during the last iteration keys changed their position. But experiments show that a reduction can only be achieved for small values of N P , because large numbers of local keys increase the probability of worst case behavior.

Communication Assignments Comparisons

Figure 2: Time consumption of Bitonicsort for P

4096

As Figure 3 shows, the main time consuming parts are the comparisons. This is based on the fact that Shearsort requires more split&merge steps than e. g. Bitonicsort.

.... ..... ..... . ... ............. ........... . .... .... . .... .......... ..... ..... .... ...... ............... . . .............. ..... . ........... ............ ........................... ............ .......................... ............ ............................. ............................................................. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .........................................................................................................................................................................

16384

Comparisons

Figure 3: Time consumption of Shearsort for P

Time (s)

4096

Communication

Keys per processor

The algorithm. Shearsort [19, 20] is one of the simplest sorting algorithms for two-dimensional grids. It sorts rows and columns of the grid alternately with Odd-Even Transposition Sort [12] that performs split&merge operations only. One row/column sorting is called iteration. Iterations are repeated until the whole grid is sorted. Shearsort is well suited for two-dimensional grids because it performs only neighborhood communication. As it is shown in [20], on an n n grid, dn=2t?1 e parallel steps are sufficient to sort a column during itera-

1024

Assignments

.. .... ..... ........ .... ........ .... .... ......... ...... .... ... ....... ......... ....... . . . . . . . . .. .. ........ ........ ....... ..... ....... ........... ...... ........... ........ ......... ........... .............. . . . . . . . . ... . .. ........... ...... ......... ................. .............. . ................. ..................... ..................... . ..................................................................... . . . . . . . . . . . . . . . . . . . . . . . . . . .....................................................................................................

1024

Shearsort

100 80 60 40 20

Time (s)

Gridsort

The algorithm. We call the following short-periodic algorithm Gridsort. It is due to Schwiegelshohn [16] who shows

1024. 4

an O(n log n) running time on n n grids with n even. The algorithm uses a mesh with some additional wrap-around edges (see Figure 4). It repeats eight phases shown in Figure 4. A connection between two nodes represents a split&merge operation. q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

Phase 1 + 5 q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q


q q q q q q

q q q q q q

q q q q q q

q? q q? q q? q

q q q q q q

Phase 4

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

q q q q q q

for the mapping of the wrap-around edges that needs some additional programming effort, but generally this does not affect the overall simple structure too much. Because during one iteration only a constant number of split&merge operations is repeated, this method is especially qualified for VLSI implementations.

q q q q q q

4 Algorithms with Additional Operations 4.1


q q q q q q

q q q q q q

q q q q q q

q q q q q q

The algorithm. Samplesort [4, 9] is a hashing-type multiple phase sorting algorithm using some randomization techniques. From the whole set of keys, a small random sample is chosen and sorted in parallel by an arbitrary parallel sorting algorithm. For our purpose, we used Bitonicsort with the optimized hypercube mapping as described in Section 3.1. Bitonicsort is especially suited to sort small numbers of keys (cf. Section 6). From the sorted sample, P ? 1 splitter keys are selected to split the total set of keys into P buckets. These buckets are assigned to the P processors, i. e. all keys belonging to bucket i are routed to processor i. Hence, a total order can be achieved by performing a final local sort algorithm on the received keys (in this case, Quicksort is used as local method again). There is no guarantee that the buckets are equal-sized after the routing phase. The variation of the sizes of all buckets strictly depends on the size of the sample. Let b(i) denote the size of bucket i after the routing phase. We call b(i) N=P the bucket expansion of bucket i.

q q? q q? q q

Phase 8

Figure 4: Parallel steps of one iteration of Gridsort Though some aspects are similar to Shearsort, there are some significant differences in the communication schemes and structures. Further details results are presented in [13]. Practical and theoretical time consumption. Schwiegelshohn’s analysis states an upper bound of 11n log n split&merge operations, but this bound is known to be not exact. We measured a value of less than 3n log n on average by implementing a termination detection scheme that stops the program after detecting a correct sorting. This result is used in the formulas of our theoretical model. Analyzing a 32 32 grid, we obtain the curves in Figure 5 for a variable number of local keys. 100 80 60 40 20

Time (s)

Practical and theoretical time consumption. To determine formulas within our theoretical framework, we observe that it costs 0:02 N P assignments to choose a sample. According to some measurements, the local sample size was set to 2% of the local number of keys. Propagating the splitter requires log P bidirectional communication steps with a packetpsize of P key size byte and a maximum congestion of 12 P in class D. As we sort 32-bit numbers, the routing of the keys requires P bidirectional communication steps with 4 bytes for sending counting tags followed by an average of N=P P key size byte in class D. The random communication scheme during this routing phase makes the occurring edge congestion unpredictable. Experiments concerning this problem show that an average congestion of 4 can be observed for a 32 32 processor-grid. Additionally, N P data movements have to be performed. Final ?N reordering with sequential Quicksort needs 1 :4 N P log P comparisons and data assignments, because due to our own experiments and results of [4], bucket expansion can be neglected for larger numbers of N P. Table 2 shows our formulas. BitonComp (a; b) is the number of comparisons used by the bitonic subsorter with a local keys on b available processors. BitonMove (a; b) counts data movements caused by the bitonic subsorter. BitonComm A;B ;C ;D (a; b) is the communication time in corresponding classes. The results of Table 2 as well as the implementation results are visualized for 1024 processors and different numbers of keys in Figure 6. Again, time for data movements

Assignments

.. .......... ....... ........ ............ ........ ......... ......... ......... .... . . . . . . . . . . . .. ... ............ ........... ........... ............. ........ ......... ............ ......... .......... ...... ........ . . . . . . . . . . . . . . . . . . .... ... .......... .............. ............................................... ... .................... ............................ ............................. .........................................................................................................

1024

Communication

Comparisons

4096

16384

65536

Keys per processor Figure 5: Time consumption of Gridsort for P

=

Samplesort

1024

Time for data movements is very low, so this curve is nearly equivalent to the curve presenting the time necessary for comparisons. The difference between the predicted running time TR and the measured time (dotted curve) is based on the inaccuracy of the upper bound and the difficulties in measuring congestion caused by communication on the wrap-around edges. Advantages and disadvantages. The behavior of Gridsort concerning communication and packet size is similar to Shearsort (see Section 3.2), except for the fact that communication using wrap-around edges takes more time and therefore belongs to our classes C and D. The synchronization loss is nearly at the same order as seen for Shearsort except that the wrap-around edges cause some additional loss. The coding complexity is similar to Shearsort except 5

?N ? N 1:4 N P log P + BitonComp 0:02 P ; P

Comparisons

1:02 N P

Assignments Class A comm. Class B comm. Class C comm. Class D comm.

+

N log ? N + BitonMove ?0:02 N ; P P P P ? N BitonComm A 0:02 P ; P ? BitonComm B 0:02 N P ;P ? BitonComm C 0:02 N P ;P

p ? CBid log P TComm D (P key size) Cong( 12 P ) + BitonComm 0:02 N ; P D P N=P key size Cong(4) +CBid P TComm D (4) + TComm D P Table 2: Estimation formulas for the Samplesort algorithm

can be ignored. The time consumed by comparisons is smaller than the time used by periodic sorting methods, because in this case comparisons are only needed by the bitonic sorter on a small set of keys and by the binary search on the splitters. 30 20 10

local keys. For a given sorting problem, a satisfying compromise has to be found between speed performance and memory utilization. A drawback of this algorithm is the possibly large bucket-expansion. In the last routing phase, the keys are ordered according to the splitters, i. e. a good choice of the splitters will result in a good key distribution and a bad one in a much worse bucket-expansion, respectively. Our experiments show that for samples containing 2% of the total number of keys, the bucket-expansion converges to a value of 1 for increasing numbers of keys.

Time (s)

.... ....... ...... . ...... . .. ...... .... ......... ..... . . . . . ... .. ..... ........ ..... ....... .......... ... ...... .......... ..... . ....... .......... .. ..... ....... ................. . ............... ..... .... . ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............... .... ..... ...... ...... ........ ..................................... ................ .. ............................................................. ..... ...... ...... ..... ...... ..... ................................................. ...... ...............................................................................................................

1024

Assignments

Communication 4096

Comparisons

16384

4.2

65536

The algorithm. Radixsort [7, Subsection 9.3] is a special purpose method. All sorting keys are considered to be bitstrings (k?1 : : :0 ), 0 ; : : :; k?1 2 f0; 1g. Rather than performing comparisons between keys, the bitstrings are divided in R parts, which are used as subkeys in several runs. In the i’th subsort, the keys are sorted according to the i’th part of the bitstring. For details on our parallel implementation of Radixsort, see [10].

Keys per processor Figure 6: Time consumption of Samplesort for P

=

Radixsort

1024

Because Samplesort uses a general routing phase, the congestion is larger than that found for Bitonicsort and for the periodic grid sorting methods and, what is even more important, it is not exactly predictable. So the theoretical times of the analysis do not match the measured times exactly (cf. Figure 6) and can only serve as upper bound to the real running times.

Practical and theoretical time consumption. Figure 7 shows the curves of the 1024 node grid, according to the formulas of our theoretical model. The time for data movements is very small, so that the curve is nearly identical to the X-axis. The rest of the total time is consumed by communication. As mentioned above, the dotted curve shows the measured time of our implementation on a 32 32 grid. Radixsort is a non-comparison based algorithm and therefore makes no use of key comparisons. Thus, this component is absent in Figure 7.

Advantages and disadvantages. During the routing phase, the keys are moved within relatively large packets. This results in an acceptable link performance. But for some calculations during the routing and gossiping, some additional packets with smaller size are used. Since these packets are transmitted with a worse link performance, this additional overhead deteriorates the overall communication behavior. On the other hand, there are many different destination buckets for the keys held initially on one processor. So the routing distances can become very long which affects dilation and congestion unfavorably. Synchronization only takes place at the beginning of the algorithm during the sorting of the sample and the distribution of the splitters. The local computations and the routing phase performed afterwards, is done in a more or less asynchronous way. Therefore, idle times are very small. Due to its more sophisticated structure, the coding complexity of Samplesort is higher than that of the periodical methods. The relatively large code segments decrease the memory available to store

60 40 20

Time (s)

... ... .. .. . .... .. .... . ........... ...................................................... ............................................................ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... ............ ....................................................................................................................... ..... ...... ..... ...... ..... ....................................................................................................................................................................................................................................................

1024

Communication 4096

Assignments

16384

65536

Keys per processor Figure 7: Time consumption of Radixsort for P Advantages and disadvantages. 6

=

1024

Main operations of

of the running time on N . For smaller values of N P , the influence of the idle times increases and so the achievable link performance decreases. Thus the curves for these algorithms rise to the left. As already mentioned, the total running time of the comparison based algorithms mainly depends on the number of split&merge operations. In the theoretical discussions, it was mentioned that Gridsort and Shearsort require nearly the same amount of split&merge operations while Bitonicsort needs only about a quarter of this amount on 1024 processors. The difference between the periodic grid sorters origins mainly from Gridsort’s additional use of wrap-around edges. This leads to some class D communication. So it is likely that on architectures that have such edges, Gridsort would outperform Shearsort, because it requires less split&merge operations. In contrast to the comparison-based algorithms, Samplesort and Radixsort route the keys directly to their destination processors. This is the main reason for their better performance. Because these algorithms contain parts that only depend on P (prefix computation in Radixsort, splitter distribution in Samplesort), their time curves rise for small values of N P. A drawback of these methods is that they result in a complex code. As a consequence, the number of keys per processor is more limited as in the easy to implement comparison-based methods. Other authors achieve similar results. For example, Blelloch et al. [4] implemented Bitonicsort, Radixsort, and Samplesort on a CM-2, a SIMD machine. Their results mainly correspond to ours, because for small numbers of ? local keys N P they found that Bitonicsort is best, and that for larger numbers, Samplesort is the best method. The differences in total sorting times to our implementation can be explained as follows. On the one hand, the CM-2 has specialized routing hardware, which prevents the processor nodes from being slowed down by routing operations, and on the other hand, the architecture itself is a hypercube, which is a great advantage for the Bitonicsort. However, our hardware is a MIMD machine without hardware routing and with an underlying two dimensional grid topology. This often requires a link mapping that causes a performance loss. Other analogous results were achieved by Hightower, Prins, and Reif [11] who implemented Bitonicsort and Samplesort on a MP-1. Due to the small amount of local storage of the MP-1, they only give results for N P values up to 1024, and so their Samplesort implementation reaches the performance of the Bitonicsort, but does not outperform it. Like the CM-2, the MP-1 is capable of hardware routing which is very advantageous for algorithms that require routing intensively. Finally, Stricker’s implementation [21] of Bitonicsort uses a torus consisting of iWarp-nodes. The general behavior is similar to the above mentioned CM-2 algorithm. By using the iWarp, Stricker additionally can improve the speed for small values of N P.

Radixsort are routing and prefix computation, which both make use of small data packets that decrease the throughput of the communication links. Generally, the communication aspects correspond to those already found for Samplesort. Because Radixsort iterates routing and prefix computations, global synchronization has to be performed frequently. The running time between two synchronizations on different processors can differ considerably, because communication operations use different packet sizes, dilations and congestions. Therefore, it is not possible to evaluate the synchronization behavior as exactly as for periodic sorting algorithms. The number of iterations is generally low. It is governed by the number of substrings that results from a bitstring partitioning. On one hand, more substrings lead to a higher number of iterations, but decreases the computational complexity in every round, because the handled substring are smaller. On the other hand, fewer substrings cause a low number of iterations and a higher computational complexity in every round. So the best number of substrings has to be found on a given parallel system. Practice has shown that a value of 3 substrings leading to a substring length of up to 11 bit is the best for our purposes. The coding complexity is similar to that of Samplesort. A lot of technical overhead causes a complex structure and decreases the number of keys that can be stored locally.

5 Discussion To compare directly overall performance of the algorithms implemented on our 1024 processor grid, Figure 8 presents sorting-curves normalized to the time one processor spents per key. By this, it is possible to compare the results of our implementations to results found in the literature, even if the machines differ substantially. Time 8000 7000 per 6000 key 5000 per 4000 3000 proc. 2000 (s) 1000 0

2

b.................b..................b. ................................................ ................................................

b

........ ....... ............. . ................. ............................... ....................... . ..............

32

2 2 2

? < b ? < 2 b b ?b b < b b b b 2 2 2 2 2