The hierarchical hypercube - IEEE Xplore

2 downloads 0 Views 1MB Size Report
Abstract-Interconnection networks play a crucial role in the performance of parallel systems. This paper introduces a new interconnection topology that is called ...
17

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. I. JANUARY 1994

The Hierarchical Hypercube: A New Interconnection Topology for Massivelv Parallel Svstems J

4

Qutaibah M. Malluhi and Magdy A. Bayoumi, Senior Member, IEEE

Abstract-Interconnection networks play a crucial role in the performance of parallel systems. This paper introduces a new interconnectiontopology that is called the hierarchical hypercube (HHC). This topology is suitable for massively parallel systems with thousands of processors. An appealing property of this network is the low number of connections per processor, which enhances the VLSI design and fabrication of the system. Other alluring features include symmetry and logarithmic diameter, which imply easy and fast algorithms for communication. Moreover, the HHC is scalable; that is, it can embed HHC's of lower dimensions. The paper presents two algorithms for data communication in the HHC. The first algorithm is for oneto-one transfer, and the second is for one-to-all broadcasting. Both algorithms take O(log, IC), where k is the total number of processors in the system. A wide class of problems, the Divide & Conquer class (D&Q), is shown to be easily and efficientlysolvable on the HHC topology. Parallel algorithms are provided to describe how a D&Q problem can be solved efficiently on an HHC structure. The solution of a D&Q problem instance having up to IC inputs requires a time complexity of O(log, k). Index Terms- Broadcasting, diameter, divide and conquer, hypercube, interconnectionnetwork, multiprocessor,n -cube, parallel system.

1. INTRODUCTION

I

N THE last decade, progress in VLSI technology has produced a technological environment in which massively parallel computing systems with hundreds or even thousands of processors are feasible to implement. One of the dominating factors that governs the performance of a parallel system is the underlying communication network and how it fits the algorithms to be executed on the system. Several topologies for interconnection networks have been proposed to fit different styles of computation. The hypercube topology, proposed by Squire and Palais in 1963 [13], has proved itself as a very powerful topology in which many other topologies, such as rings, trees, and meshes, can be embedded. Consequently, it accommodates numerous classes of problems [6], [lo], [ll]. Many other topologies revolving around the basic hypercube have been proposed by researchers [l], [2], [31, [41, [SI, [SI, [91, [151. These topologies, in one way or another, are hypercubes that have a slight generalization or modification tending to improve some of the hypercube properties. Manuscript received October 23, 1991; revised May 27, 1992. This paper was recommended by Associate Editor J. R. Jump. The authors are with the Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, LA 70504. IEEE Log Number 9214393.

When used in large systems, however, the hypercube has some practical limitations. In an n-cube (a hypercube of degree n or, equivalently, a hypercube with 2" nodes), each processor (node) is connected to n other processors. As the degree n increases, each node becomes more difficult to design and fabricate because of the larger fanout. In fact, this is the most serious drawback of hypercubes, and it is often considered the main limiting factor for using hypercubes in large systems [lo], [12]. In addition, a high-degree n-cube also has the problem of matching the internal processor speed with the available wide bandwidth [12]. In order to make use of the n channels of a given processor simultaneously, the processor should be capable of feeding the n channels with data concurrently. Moreover, the processor should be able to consume the simultaneous arrival of data on its channels. Thus, the node functional units and internal bus speeds should match the speed of the fanout bandwidth. These requirements increase the complexity and cost of the system and reduce its practical feasibility. To overcome these problems, the Shuffle Exchange network [ 141 and the Cube Connected Cycles (CCC) [101 were proposed as substitutes for the Hypercube network (HC). The CCC takes into consideration the practical limitation of increasing the number of input-output (UO)ports by restricting the node fanout to three. However, such a restriction may reduce the CCC performance. In this paper, we propose a new hierarchical structure for interconnection networks in parallel systems. This structure is referred to as the Hierarchical Hypercube (HHC). The number of links in the HHC forms a compromise between those of the HC and the CCC. As a result, the achieved performance of the HHC is superior to that of CCC. Taking time as a cost measure, the cost of executing a large class of algorithms on the CCC is shown to be about 30% more than its execution on the HHC (see Section VI). A k node HHC, where IC is a power of 2, is a symmetric structure consisting of a father hypercube whose nodes are by themselves hypercubes rather than simple processors. Unlike the hypercube, the HHC implementation is feasible even when IC is very large. It requires O(1og log IC) connections per processor. For example, if IC is 232 = 4 Giga processors, only five connections per processor are required. The HHC is a homogeneous and symmetric structure in which no node plays a special role. Recently, several hierarchical topologies were proposed for massively parallel systems [3], [5], [9]. In these hierarchical architectures, each level of the hierarchy comprises a network of modules. Each module contains a set of lower-level modules interconnected in some

1045-9219/94$04.00 0 1994 IEEE

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5 , NO. 1, JANUARY 1994

18

fashion, as well as a designated U 0 node. The I/O node is used as a communication processor to handle intermodule communication. A major drawback of these structures is that the processors are not identical. This lack of symmetry complicates the VLSI design and leads to uneven traffic distribution over the processors. Moreover, the U 0 processors in these architectures suffer from large degrees, leading to high contention around these nodes and thus a potential for degradation of performance. In addition, the special role played by these U 0 processors suggests a diminished fault tolerance. Being a hierarchical structure, the HHC bears the advantages usually gained by hierarchy. In general, hierarchy is a useful means for modular design. In addition, hierarchical structures are capable of exploiting the locality of reference (communication), and they are fault tolerant. Other attractive properties of the HHC structure are logarithmic diameter and a topology inherited from, and closely related to, the hypercube topology. The former property implies fast communication, and the latter implies easy mapping of operations from HC to HHC. The HHC can emulate the hypercube for a large class of problems (Divide & Conquer), without a significant increase in processing time. The HHC can embed rings and HHC's of lower dimension. In addition, the HHC embeds the CCC. As a result, the performance of HHC is in the worst case equivalent to the performance of the CCC. This paper is organized as follows. Section I1 describes the HHC structure. Section I11 presents a deeper investigation of the HHC topology by illustrating some of its topological properties. In Section IV, we address the scalability of the HHC. Section V describes algorithms for data communication in the HHC. In Section VI, we show how the HHC can emulate the hypercube for the D&Q type of algorithms, and then we discuss its performance in comparison to that of the HC and the CCC. 11. HHC STRUCTURE The structure of an n-HHC consists of three levels of hierarchy. To simplify the description of HHC structure, let's assume for the time being that n = 2" m for a non-negative integer m. (This condition is relaxed later on.) At the lowest level of hierarchy, we have a pool of 2" nodes. These nodes are grouped into clusters of 2" nodes each, and the nodes in each cluster are connected to form an m-cube called the Soncube or Scube. The set of Scubes constitutes the second level of hierarchy. A father cube, called the Fcube, connects the 2(n-m) = 22m Scubes in a hypercube fashion. Edges of the Scubes are called internal edges, and edges of the Fcube are referred to as external edges. An Scube having 2m nodes is connected to exactly 2m external edges, and each is incident to one node of the Scube. Fig. 1 shows two adjacent Scubes in an 11-HHC (an HHC with n = 11 and m = 3). Formally, for n = 2" m, an n-HHC is a 2,-nodes graph G = (V, E) where:

I

scube00000000

Scube OOIWOOO

\

Fig. 1. Two adjacent scubes in 11-HHC.

. . bo) is the identifier The sequence of binary bits (b,-lb,-z or address of a node. The address of a node is divided into two parts, S part and P part, and is represented as a twotuple ( s , p ) . The S part is the n - m bits binary number (b,-lbn-2...bm), representing the address of the Scube in which the node is located. The P part is the m bits binary number (bm-lbm-n . . . bo), representing the address of the node (processor) within the Scube. Integers and their binary encoding are used interchangeably in this paper; thus, in some contexts, the S part and the P part are used to refer to the integer values whose binary encodings are ( b n - l b n - ~ . ..b,) and (bm--lbm-z . ..bo), respectively. We denote the S and P parts of a node by the node name catenated with s and p , respectively. For example, the S and P parts of node A are As and Ap, respectively. The set of edges E is the union of two sets Eint and IText, which are the sets of internal and external edges, respectively, as the following equation illustrates:

where

+

+

Thus, an HHC node ( s , p ) is connected to the following: 1) m nodes in the same Scube through internal edges. These are the nodes whose addresses are found by changing only one bit of the P part of the address. 2) Exactly one node in a neighbor Scube through an external node corresponding to the change of the p t h bit of the S part of the address. To simplify the description of the HHC structure, it was assumed that R. = 2" m. However, it is easy to generalize the structure for an arbitrary n by choosing m to be the m. In this situation, smallest integer, such that n I 2" some nodes will not possess an external edge. A 5-HHC is shown in Fig. 2. Notice that even though the structure looks like that of a CCC, it is not a CCC. The nodes of an Scube are connected in a cube fashion, not in a cyclic fashion. When the condition n = 2m m is satisfied, the HHC is referred to as a perfect HHC.

+

+

+

19

MALLUHI AND BAYOUMI: HIERARCHICAL HYPERCUBE

(2.2)

+

(3.0)

(5.2)

(1.2)

Intemal Edge

~

External Edge

Fig. 2. A 5-HHC.

111. HHC TOPOLOGICAL PROPERTIES

In this section, we are going to explore the HHC structure more deeply, and we will prove some of its topological properties. Maybe the most attractive property of the HHC is its compliance with the technological constraints of VLSI. The number of links coming out of a node (degree of a node) is much less than that of the ordinary hypercube. This property is important for systems that have a large number of processors, in which case, as discussed earlier, the system configuration as a hypercube becomes infeasible. Considering the current state of the art, systems with a large number of processors can be thought of as systems with 1000 or more processors, requiring a hypercube of degree greater than or equal to 10. Let's calculate the degree of an n-HHC (recall that the degree of a graph is the highest among the degrees of its nodes). Obviously, if n = 2"+m (i.e., the HHC is perfect), all nodes have a degree of m+ 1, because each node is incident to one external edge and m internal edges. When n < 2"+m, the HHC nodes can be partitioned into two groups with respect to their degrees. In one group, the nodes possess external edges, and hence their degree is m 1. In the other group, nodes do not have external edges, and their degree is m. Thus, we can conclude that the degree of an n-HHC is m + l . This helps us to compare the maximum number of connections per processor in the HC, HHC, and CCC topologies. Fig. 3 shows the degrees of HC, HHC, and CCC for various practical values of n. Lemma I : The number of edges in an n-HHC is (m(2" 1) n)2n-"-l. Proof: Recall that the number of edges in an m-cube is (2"m)/2. In an n-HHC, there are 2n-" Scubes, each having (amm)/2edges (internal edges). In addition, because the Fcube is an ( n - m ) cube, there are (2n-"(n - m ) ) / 2 external edges in the n-HHC. Adding up the internal and external edges, we get the total number of edges is equal to 2n-m-1 (m2" n - m ) = 2n-m-l(m(2" - 1) n). 0

+

+

+

From the above lemma, it is easy to see that for a perfect HHC (i.e., n = 2" m), the number of edges will be 2"-'(m + 1). Fig. 4 shows the total number of edges in each of the HC, HHC, and CCC for different values of n. Define parity(X), the parity of a node X (or a binary number X ) , to be the exclusive-OR of the bits of the address of X. It is clear that neighboring nodes have different parities. Lemma 2 below states that a cycle with any odd length can never be mapped into an HHC. Lemma 2: There are no cycles of odd length in an n-HHC. Proof: Consider a cycle X 1 ,X z , . . . ,X t , where X t = XI. As we travel from X i to Xi+l, 1 5 i < t, the parity changes. Since X1 = X t , there must be an even number of changes. Thus, the length of the cycle is even. Before going on with the HHC properties, we need to prove a result for the hypercube. In a hypercube, nodes are given addresses such that nodes A and B are adjacent if and only if their addresses differ in exactly one bit. A Hamiltonian path of a graph is a path that passes over each node of the graph once and only once. The problem of finding a Hamiltonian path in an n-cube is the problem of finding a 2" sequence of the n bit distinct binary numbers such that any two consecutive numbers have only one bit difference. Such a sequence of binary numbers exists and is called a Gray code. For example, the sequence of binary numbers, (000,001,011,010, 110, 111, 101, 100) is a 3 bits Gray code representing a Hamiltonian path in a 3-cube starting at node 000 and ending at node 100. Lemma 3: Let PI and PZbe two nodes of a hypercube. There is a Hamiltonian path for the hypercube that starts at PI and ends at Pz, if and only if parity(P1 63 Pz)= 1. Proof: Necessary part: Assume that there is a Hamiltonian path between P1 and Pz. This path involves 2" nodes, where n is the degree of the hypercube. This implies that the length of the path is 2n - 1, which is odd + there is an odd number of parity changes between Pl and Pz =+ parity(P1 @ Pz) = 1. Sufficient part: We have parity(Pl63 Pz) = 1; the presence of a Hamiltonian path from PI to PZ is to be proved. The proof is by induction on the degree g of the hypercube. The result is trivial for g = 1. The induction hypothesis is that the lemma is true for g = n. In an ( n 1)-cube, let PI and PZ be any two nodes such that parity(Pl @ Pz) = 1 =+. There is at least one bit difference between PI and Pz. Without loss of generality, assume that bit ,(P1)is not equal to bit ,(Pz). Let PI = x X , P2 = Z Y , where x is a bit value, X, Y are sequences of n bits each, and x is the complement of x. Let 2 be a sequence of n bits such that arity(X @ 2 ) = 1. By induction hypothesis, there is a Hamiltonian path X = Ao, A I ,. . . , Azn - 1 = 2 passing over all of the possible binary n bit numbers. Because parity(xX @ Z Y ) = 1, we have parity ( X @ Y )= 0 andparity(X@Z) = 1 -+ parity(Z@Y)= 1. By induction hypothesis again, there exists a Hamiltonian path 2 = Bo, B1,. . . ,BZn-1 = Y in an n cube. It is possible to construct a Hamiltonian path between PI and PZ in an ( n 1) cube as follows: PI = xX = xAo, xA1, . . . , xAZn-1 = x 2 , x Z = xB1,-..,%Bzn-1= xY = pz. Thus, the lemma is true for hypercubes with any degree.

+

+

+

20

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5 , NO. 1, JANUARY 1994

Fig. 3. Comparing the degrees of the HC, HHC, and CCC. 1.2et07 I

I

1

I

I

HC HHC

ccc

---

le+07

8et06

w

6et06

.(et06

2e+06

0

10

12

14 16 Topology Dinension (n)

18

20

Fig. 4. Total number of edges for the HC, HHC, and CCC.

Theorem I: In an n-HHC, dist((sl,pl), (sZ,pz)),the distance between any two nodes (s1,pl) and ( s 2 , p z ) is less than or equal to zm+'. Proofi A path of length 5 2m+1 between any two nodes ( s 1 , p l ) and ( s 2 , p z ) will be constructed. Each edge in a path between (sl,pl) and ( s 2 , p z ) corresponds to one bit change in the P part of the address (internal edge) or the S part of the address (external edge). Let's concentrate on the changes of the P part throughout the path. In the path to be constructed, the changes of the P part comespond to the Hamiltonian path in an

m cube starting at p l and ending at p~ if parity(p1 $ p z ) = 1, or ending at a neighbor of p 2 if parity(p1 $ p z ) = 0. By lemma 3, such a Hamiltonian path always exists. If the Hamiltonian path reaches a neighbor of p z , then one additional intemal edge is required to reach p 2 . Thus, the number of internal edges involved are 5 2m. While the P part is changing over all of the possible m bit binary values, if the P part is equal to pi and bit,;(sl@ s2) = 1 (which means that the pith bit of s1 does not equal the pith bit of SZ), then move over the external edge changing the pith bit of the S part. This operation takes

2.1

MALLUHI AND BAYOUMI: HIERARCHICAL HYPERCUBE

0 or more int edges

1 or more int edges

1 or more int edges

7

oormore intedges

*

-

One external edge

M A 4

A set of internal edges

diameter is fast communication. A message from any node to any other node takes 0 ( 2 m ) = O ( n ) hops to reach its destination (see Section V). The CCC diameter is known to be 1.25 x 12"+l - 2. Thus, the HHC diameter is about 20% less than that of the CCC. Fig. 6 compares the diameters of the HC, HHC, and CCC.

Fig. 5. General path structure in HHC.

us one step toward an Scube closer to Scube s p . Since the P part is changing over all of the possible values, we are sure to reach s2.Obviously, the number of external links required is 5 2" + total number of links (external & internal) in the U path is 5 2m+1. Let's investigate the shape of a path in an HHC. Fig. 5 shows a general path between nodes A and B. Starting at node A, we have the choice of moving in the Scube to which A belongs, using internal edges, or taking A's external link (if it exists) to node C in a new Scube. In node C, we cannot use C's external node, because it takes us back to A . This means that at least one internal edge should follow. Similarly, after any external edge except the last, there should be at least one internal edge. For this reason, the number of internal edges in a path is greater than or equal to the number of external edges minus one. Theorem 1 above states that for an n-HHC, 2m+1 is an upper bound on the distance between a pair of nodes. Theorem 2 shows that this bound is tight for the perfect HHC. In other words, in a perfect n-HHC, there exist two nodes the distance between which is 2m+1. As a result, the diameter D (the maximum among the minimum distances between pairs of nodes) of a perfect HHC is actually 2m+1. Theorem 2: The diameter D of a perfect n-HHC is 2m+1, and it is between any two nodes ( s , p ) and ( 3 , p), where S is the one's complement of s and parity@@ @) = 0. Proofi Let's find the minimum number of edges in a path between ( s , p ) and ( 3 , p). There are at least n - m = 2" external edges to move from Scube s to Scube 3. As discussed earlier, at least one internal link is associated with each external link except the last. By lemma 3, after the last external edge, the P part cannot be p, because parity@@)= 0. At least one internal link is needed after the last external link. From this, the number of internal links is 2 2m and d i s t ( ( s , p ) ,(s, d))) 2 2m. But by Theorem 1, D 5 2m+1. Hence, D = 2"+' U and d i s t ( ( s , p ) ,(3, 0)) = D. Example: In a 6-HHC, n = 6 and m = 2, the diameter D = 23 = 8. Two nodes between which the distance is D are nodes (0000,OO) and (llll,OO), because the S parts are complements and parity ((00) @ (00)) = 0. One possible shortest path between these nodes is ((OoOO,OO), (0001,00), (Oaol,Ol), (001 l,Ol), (0011,l l), (101 1,111, (101 l,lO), (1 1 11,lo), (1 111,Oo)). Notice that the edges in the path are an external edge followed by an internal edge followed by an external edge followed by an internal edge, and so on. 0 Because 2"+' is of the same order as the base 2 logarithm of the total number of nodes in an n-HHC, the HHC diameter has a logarithmic order. The implication of a logarithmic order

IV. RECURSIVE DEFINITION OF HHC STRUCTURE

This section addresses the scalability feature of the HHC structure, which has a number of attractive consequences. By scalability, we mean the ability to define an ( n + 1)-HHC in terms of n-HHC. Salability implies that the HHC performance does not decline when the input size does not match the number of processors in the system. This is a significant improvement over the CCC. For the D&Q class of algorithms (see Section VI), the CCC emulates the hypercube by pipelining its operations. A stage of the pipeline performs a data rotation in the CCC cycles followed by an operation on the data. If the input size matches the CCC size, each processor in a cycle will have exactly one data item. As a result, the rotate-aperate pipeline will always be full after the initial setup period. When the input size is smaller than the total number of processors, however, there will be a performance penalty. Consider, for example, a 27 nodes CCC. Each cycle will have eight processors, four of which are connected to other cycles by external edges. Suppose that a D&Q algorithm is to be executed on the CCC and that the input size is 26. How should the input be distributed over the nodes? One suggestion is to distribute the data items in such a manner that four inputs are stored in a cycle (each in the local memory of a processor). The other suggestion is to distribute the data items so that eight inputs are stored in a cycle, filling up eight of the existing cycles. In either case, it would be more efficient if we ran the algorithm on a smaller CCC off of 26 nodes. When an nHHC ( n 2 6) is used instead of the CCC, the above problem disappears, because a t-HHC for any t < n can be embedded in the n-HHC. Another consequence of being able to construct an ( n+ 1)HHC from lower-dimension HHC's is the increased fault tolerance arising from the ability to use the smaller-dimension HHC's when an unpleasant event affects the links or the nodes of the original HHC. In building up an ( n 1)-HHC from an n-HHC, we differentiate between two cases: where the nHHC is perfect and where it is not perfect. Below it is shown what to do in each of these cases. We use 11 to refer to the concatenation operator.

+

if n = 2m + m then { n-HHC is perfect} 1. Duplicate each Scube and give the same node names to the newly generated cubes. 2. Rename each node ( A s ,Ap) of the original n-HHC as ( A s ,OllAp). 3. Rename each newly generated node (As,Ap) as (As1 1IIAP). 4. Connect (As,OAp) with (As,1Ap).

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 1, JANUARY 1994

22

140

i

0

HC

-

ccc - - -

1

2

3 Value of m

4

5

I

6

Fig. 6. Comparing the diameters of HC, HHC, and CCC topologies.

else {n-HHC is not perfect} Duplicate the n-HHC. 1. Rename each node (As, Ap) of the original 2. n-HHC is (OIIAs,Ap). 3. Rename each node (As, Ap) of the newly generated n-HHC as (lIIAs,Ap). 4. Connect (OAs,Ap) with ( l A s ,Ap) if Ap = number of bits in As. end. {if} Fig. 7 illustrates the construction of the 4-HHC from the 3-HHC and then the 5-HHC from the 4-HHC. A simple application of the principle of mathematical induction, together with the construction procedure above, can show the following lemma. Lemmu4: A t-HHC can be embedded into the n-HHC graph for any t 5 n. V. DATACOMMUNICATION m HHC

Effective data communication is crucial in parallel systems. In this section, we introduce two efficient algorithms for data transfer in the HHC. The first is for one-to-one communication, and the second is for one-to-all communication (broadcasting). Efficient one-to-one transfer is essential because it is the most basic type of communication. Likewise, fast broadcasting is important because it is employed in a large number of well-known algorithms, such as Gaussian elimination, the Conjugate Gradient algorithm [ 121, Matrix-Vector multiplication, Matrix-Matrix multiplication, LU factorization, and House Holder transformation [7], as well as in various image processing applications. It is natural to try to find optimal communication algorithms that incur the smallest possible number of time units. However, the hierarchical structure prescribes that we deal with the

Fig. 7. Construction of 4-HHC from 3-HHC and 5-HHC from 4-HHC.

problem at various levels of hierarchy. Thus, instead of finding the globally optimal communication procedure, we divide the problem logically into two parts. The first part is concerned

23

MALLUHI AND BAYOUMI: HIERARCHICAL HYPERCUBE

with data transfer within the Fcube (between Scubes), and the second part is for data communication within the Scube. An optimal communication procedure is used for each of these two parts. The two locally optimal procedures are merged together in a way that produces a near-optimal communication algorithm for the HHC. The division of the communication algorithms into two parts is rather conceptual. In the actual algorithms presented, these two parts are molded together, resulting in algorithms that route through the entire HHC and that mix and interleave communication steps of the two conceptual parts. Dejnition 1: G, is defined to be an m bits Gray code obtained by the recursion: G1 = (011) G,+i = (OG,, lG,R), where GF is the sequence obtained by reversing the order of the numbers of G, and OG,/lG, is the sequence obtained by concatenating 0/1 to each element in the sequence G,. Dejnition 2: g m ( X ) is defined to be the Gray encoding of the m-bit number X = ( ~ ~ - z,-z, 1 , . . . , Z ~ , Z Oin) G,, or, equivalently, gm(X) is the X t h number in G,. Let g,(X) = X = (?,-I, 2,-2,. . . ,?I, 2 0 ) ;then the conversion of binary into Gray encoding is defined by 2, = x, @ x,+1. Let gG1 be the inverse of g,; then the opposite conversion of Gray into @ i , + 2 @ . . . @ kk-1. binary encoding is given by x, = Dejinitzbn 3: G k , where 0 5 IC 5 2" - 1 is defined as the sequence obtained by gA1( I C ) left rotations of G,. Thus, GO, = G, and G& = the rotation of GO, until IC is the first element in the sequence. Dejnition 4: The partial order d k (Gk-less) is defined over the set of integers less than 2" as follows: A d k B (read as A is Gk less than B ) if and only if A precedes B in G k . Conversely, B k k A (read as B is Gk greater than A) if and only if A d k B. Example: Using the recursion in definition 1, we get the following values:

procedure executed by the source node A and by every node C = (Cs,Cp) in the path to the destination.

procedure node-to-node-routing (C,B , M ) {Rout message M at node C to destination B ] begin if C # B then if Cs = B s then Scube-routing (C, B , M ) ; else D := C @ B ; I := The set of indices of 1's in Ds; if Cp E I then Send M on external edge; else i := GcP-smallest element in I ; Scube-routing(C, (Cs,i) , M ) ; end;{ if} end;{if} end;{ if} end;{ node-to-node-routing} procedure Scube-routing(A, B , M ) {Rout message M from node A to node B within the same Scube} begin D P := Ap @ B p ; j := index of first 1 in Dp; Send M on internal edge along the J t h dimension; end; { Scube-routing}

The algorithm starts by checking whether the current node C is the destination, and it stops if that is the case. Next it tests if the current Scube is the destination Scube. If it is, procedure Scube-routing is used to route the message M to the destination B over internal edges. Procedure Scube-routing is no more than a typical routing algorithm for an ordinary hypercube. If the message is to be routed to another Scube, the algorithm finds I , the set of Fcube dimensions that is to be changed in order to reach the destination Scube. The algorithm G2 = (OO,Ol, 1 1 , l O ) passes the message M on the current external link if this link G3 = (000,001, 011,010, 110,111,101,100) takes M along one of the dimensions in I . Otherwise, M is = (0,1,3,2,6,7,5,4). routed within the current Scube Cs to the processor whose P From G3, we can deduce that the values of g3(2),g3(3) and part is the smallest element in GZP. From there, M will be gY1(2) are 3, 2, and 3, respectively. By definition 3, we get sent along an external edge. Exumple: If the source A is (0000,OO) and the destination the following: B is (Olll,Ol), the path produced by procedure node-toG! = G3 = (0,1,3,2,6,7,5,4) node routing is (OOOO,OO), (OOOl,OO), (OOO1,01), (001 l,Ol), 0 (0011,l l), (0011,10), (0111,10), (0,111,l I), (0111,01). G i = (2,6,7,5,4,0,1,3) The algorithm uses a shortest path between the source Scube G i = (7,5,4,0,1,3,2,6). As and the destination Scube Bs in the Fcube. The shortest Notice that any GL represents a Hamiltonian path in an m- path used has the property that edges of the Fcube are traversed cube. From definition 4, we have the relations 3 io 2,2 3, in GAP-ascending order (sorted with respect to the relation and 5 i7 1; hence, we have 2 k o 3,3 2' 2, and 1 k7 5. In i A p ) . The rationale behind this choice is to make the P part G!, 0 is the Go smallest, and 4 is the Go greatest. U values on the path to destination Scube change in one direction and to prohibit its fluctuation. This puts an upper limit on the A. One-to-one Communication number of internal links that are traversed in order to change Let A = ( A s ,Ap) be the source node and B = ( B s ,B p ) the S part (by moving on external links) to the required value be the destination node. The following algorithm is a routing Cs. This upper limit is equal to IGZI - 1 = 2, - 1.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 1, JANUARY 1994

24

Theorem 3: The length of the path produced by node-tonode routing algorithm is 5 H ( A s , B s ) + 2" + m - 1, where H ( A s ,B s ) is the Hamming distance between the S parts of nodes A and B. Proof: Let II be the path from A and B produced by the algorithm node-to-node-routing. Let II = I I l I I 2 where IIl is the path from node A to the entry node ( B s , e ) of the destination Scube B s (i.e., the first node of Scube B s encountered in II), and II2 is the path from (Bs, e ) to B within the destination Scube Bs. Obviously, IIl contains exactly H ( A s ,B s ) external links. Moreover, the number of internal links in IIl is at most IG&Pl - 1 = 2" - 1. The path 112has no extemal edges, but at most m internal edges. Summing up, m - 1. 0 the length of path II is 5 W ( A s ,B s ) 2" Since the S part of a node address consists of less than or equal to 2" bits, we have H ( A s ,Bs) I 2". Thus, (nI(5 2"+' m - I

+

+

+

I 2(2"

+ m)

= 2n = O ( n ) .

B. One-to-All Communication (Broadcasting) One-to-all transfer can also be examined at two levels of hierarchy. At the higher level, broadcast is performed within Fcube, using external edges to address the message to every Scube. At the lower level, broadcast is performed within each Scube using internal edges. Since the Scubes and the Fcube are all hypercubes, we first introduce a broadcast algorithm for the hypercube. This algorithm is a generalized version of the known one-to-all communication procedure in hypercubes 171, 181, 1121. Let a be any partial order on the set of integers Z p = {a10 5 a < 2"). Procedure HC-Broadcast communicates the message M from node A to every other node in a k-cube.

procedure HC-Broadcast(A , M ) {Broadcasts message M from node A to every other node in the HC} begin A sends M to all of its neighbors; for any node B receiving M on dimension i do for every dimension j such that i a j do B sends M on dimension j; end;{ HC-Broadcast} HC-Broadcast has the following properties: The message M reaches every node in the HC once and exactly once. This is ensured by the use of the partial order a. For all nodes B , M is routed from A to B along a shortest path II. That is, IIIl = H ( A ,B ) . For all nodes B , the links on the path II (i.e., the path followed from A to B ) are sorted with respect to the partial order a. Fig. 8 shows the broadcast operation in a 3-cube where a! is the ordinary less than or equal relation ( 5 ) . An efficient HHC broadcast algorithm based on procedure HC-Broadcast is obtained by using the partial order dAP for Fcube broadcasting and the partial order 5 for broadcasting

+

21

(111)

Fig. 8. Broadcast tree in a 3-HC.

in the Scubes. Broadcasting within an Scube starts as soon as the message is received. A transfer on an external edge (broadcasting in Fcube) from an Scube waits until the proper exit node in the Scube has received the message. This is the node whose P part matches the dimension of the external link in the Fcube.

procedure HHC-Broadcast(A , M ) {Broadcasts message M from node A to every other node in the HHC} begin A sends M to all of its neighbors; for any node B receiving M on an external edge do B sends M on every adjacent intemal edge; for any node B receiving M on an internal edge on dimension i do begin for every internal edge on dimension j such that i 5 j do B sends M on j; if As = B s then { node B is in the source Scube} B sends M on external edge; else { node B is not in the source Scube } I := The set of indices of 1's in As CBBs; k := GAP-greatest element in I; if k . i A p Bp then B sends M on external edge; end; { if } end;{ for } end;{ HHC-Broadcast } An Scube is entered through an external edge once and only once. This is ensured by the use of the partial order d A P for Fcube broadcasting. Once the Scube is entered, broadcasting starts within this Scube. Therefore, in procedure HHC-Broadcast, a node receiving M on an external link sends M on every adjacent internal link. A node receiving M on an intemal link does two things: i) it continues the broadcast operation already started within the Scube, and ii) it continues the Fcube broadcasting by forwarding M on the external link if one of the following two conditions is met: 1) The current Scube Bs is the source for Fcube broadcasting (i.e., Bs = A s ) ; therefore, M should be sent on every external link adjacent to Scube Bs (examine procedure HC-Broadcast). 2) k dAP B p for every Fcube dimension k traversed earlier. This is equivalent to checking that the Fcube

25

MALLUHI AND BAYOUMI: HIERARCHICAL HYPERCUBE

fti

t (olj,ll) A Is A. ; / c(

A

(OOlo\

(lm.10) (1ooo.01)

(0010.10)

(Ol00aO)

(1010.11)

(0100.01)

(1001.11)

'kt f i

:Iooo,00)

(lloo.10)

(ooo1.11)

(0011.01)

(ooll.00) (0011.11)

(0101.11) (0101.00)

(0110.10) (lOl0,lO) (1010,01) (".IO)

(loo1,Ol) (0011.10) (1011.11)

(0101,01)

A' fl 'k A c0h i' 1' s/\t, fi41"k

100.1I) (I 100.00) (01 10.1 1)(0110.00) (1030,00)(1110,10)

:1 "Ol)

(01 10.01)

(II1O.lt) (1110.00)

(IlOI,lO) (1001.00)

(0111.10)(1011.10) (1011,01)

(1101.11)(1101.00) (0111.11) (Olll.00) (IOlI.00)

-

(1111.10)

i'

(1 1 11,oo)

transfer on an intemal edge +transfer on an extemal edge

Fig. 9. Broadcast tree in a 6-HHC.

dimension on which the Scube Bs has been entered is GAP-lessthan the dimension of the external link incident to node B (reexamine Procedure HC-Broadcast). Fig. 9 illustrates the broadcast tree produced in an HHCBroadcast algorithm in a 6-HHC. Let B be an arbitrary node in the HHC, and let II be the path produced by procedure HHC-Broadcast along which M is routed from the source node A to B. The path II has the following characteristics: 1) The number of extemal links in II is H ( A s , B s ) (by property 2 of HC-Broadcast). 2) Any portion of II containing a set of consecutive intemal links is of minimum length. That is, if this portion is between nodes ( C S , ~and ) ( C S , ~ 'then ) , its length is equal to H ( p , p ' ) (by property 2 of HC-Broadcast). 3) The extemal links in II appear in GAP-ascendingorder of their dimensions (by property 3 of HC-Broadcast). 4) The links in any portion of II containing a set of consecutive intemal links appear in ascending order of their dimensions (by property 3 of HC-Broadcast). Theorem 4: Algorithm HHC-Broadcast requires at most n 2m - 1 time units (communication steps) for completion. Proofi Let n be as given above, and write II = IIlnz, where 111 and IIz are the same as they are in the proof of theorem 3. By property 1 of the path II,II1 contains

+

exactly H ( A s , B s ) external edges. By properties 2 and 3, the number of intemal edges in II1 is not greater than the length of a Hamiltonian path in an m-cube (corresponding to GkP). Therefore, we have the number of intemal links in II, being less than or equal to 2m - 1. Moreover, we have 5 m. Hence, IIII 5 H ( A s ,B s ) + 2" + m - 1. Because H(As, B s ) 5 n-m, we can write lIIl 5 n+2"-1. Therefore, the depth of the broadcast tree generated by HHC-Broadcast is less than or equal to 7~ + 2" - 1, and thus algorithm HHCBroadcast can be completed in n + 2" - 1 communication steps. 0 We know that 2" is O ( n ) . Thus, by theorem 4, HHCBroadcast required O ( n ) time units. This is the same as the time complexity for broadcasting a message in a hypercube. We should not end this section without comparing the time required by the near-optimal solution produced by HHCBroadcast with the time of an optimal solution. The time of an optimal solution is specified by the diamater D of the topology. By using the results of Section 111, one can easily verify that D - ( n 2" - 1) 5 m - 1. Thus, the maximum departure of HHC-Broadcast from the optimal solution is m - 1 time units. As a final comment, we note that the route produced by HHC-Broadcast along which M is forwarded to a node B is identical to the route generated by procedure node-to-node routing(A, B , M ) .

+

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5, NO. 1, JANUARY 1994

26

j=O:

V[Ol

VI11

~121

'431

v[4]

v[5]

j= 1:

v'[OI

~'[ll

v'[2]

~'131

v'[4]

v'[5]

;=2:

V"[o]

V"[ll

v"[2]

V"[3]

v"[4]

v"[5]

v[6]

v"[6]

v"[7]

Fig. 10. General form of a D&Q algorithm for input of size 8.

VI. HHC FORA CLASS OF ALGORITHMS

The class of algorithms described above is similar to what was referred to as ascend class by [ 101. Some variations of the above general structure of a D&Q algorithm are possible. For example, it is sometimes required to rearrange the inputs of a problem if the problem specifications do not fit the general D&Q Algorithm given above. In some problems, j could go the other way around, descending from n - 1 to 0. An example of a variation of the general algorithm given will be encountered later in this paper (procedure "Cyclize" in the next subsection). Fig. 10 illustrates that there is an inherent parallelism in a D&Q algorithm. All the operations between pairs that are 2' positions apart can be performed concurrently. A parallel version of a D&Q computation is described in the procedure "Par-D&Q below.

As discussed later in this paper, an HHC can be viewed as an ordinary hypercube with certain links deleted. The procedure Par-D&Q(V) links are deleted in such a manner that keeps it possible to begin utilize the flexibility of the hypercube topology and to emulate for j : = O to n - 1 do the hypercube operations to solve a problem effectively and for each i J 0>= i < k do systematically. cobegin In this section, we demonstrate how the HHC topology can if bit, ( i ) = 0 then be employed to solve the divide & conquer class of problems. OP(v[i], v[i 231); A systematic procedure to construct an algorithm for this class coend; of problems is provided. end {Par-D&Q } Divide and Conquer (D&Q) is a very popular paradigm Par-D&Q is most suitably implemented on an n-cube. This for solving many types of problems. Examples of problems is because OP is always applied to values that are 23 positions solvable by Divide & Conquer are Matrix Multiplication, apart. If each value is stored in a node of an n-cube, then OP Matrix transpose, Sorting, Fourier Transform, and convolution. is always applied to values that are adjacent in the cube. It is D&Q is a well-known and much-studied algorithm design easy to see that the time complexity of Par-D&Q is O ( n ) ,or, technique. The general strategy of divide & conquer is to equivalently, O(10g2k), and that Par-D&Q, when executed follow these three steps in sequence: on a hypercube machine, requires n = log, IC steps for its 1) Divide the problem instance into a number of subprob- completion. For this, HC topology achieves the maximum lems (usually two) of the same type as the original parallelism. As discussed before, however, for a large n, problem. the usage of n-cubes is expensive and infeasible. The next 2) Solve these subproblems separately by the same method. subsection explains how the HHC can be used to execute 3) Combine the solutions in some fashion to give the Par-D&Q efficiently. solution of the original problem instance. Sequential implementation of D&Q is usually recursive. Of interest to us is the iterative (nonrecursive) description of A. Par-D&Q Computation on HHC D&Q. Another way to look at HHC is as an incomplete hypercube Formulating the above description, assume that the input (i.e., a hypercube with some links missing). The set of edges V consists of k = 2" values v[O], v[l], .-., v[k - 11. A of an n-cube can be partitioned into n subsets. An edge is in sequential iterative version of a Divide and Conquer algorithm subset i , O 5 i < n if it connects two nodes that differ in the manipulating the input V has the following general structure: ith bit. In other words, subset i contains all of the hypercube edges along the ith dimension. We will refer to the edges of procedure Seq-D & Q(V) subset i collectively as sheaf i. The same term (sheaf i) is begin used in HHC to refer to the collection of edges connecting for j:=0 to n - 1 do nodes whose ith bit is different where the ith bit of a node for i:= 0 to IC - 1 do is the ith bit of the concatenation of the S and P parts of the if bit, (i) = 0 then node address. OP (v[i], v[i 231); Starting from an n-cube, if, for i = m, m+ 1,. . . , n - 2, n end { Seq-D & Q } 1, all links belonging to sheaf i are removed, except the link The above algorithm states that a certain operation "OP' is connecting nodes of the form (., i - m),that is, nodes whose applied first to all pairs of inputs that are 2 O positions apart, P part is i - m (see Fig. ll(a)). The resulting topology is then to those 2l positions apart, then to those 2, positions an n-HHC. Let's investigate the correctness of the preceding apart, and so on. The last set of operations is applied to pairs statement. First, any edge corresponding to a change in the that are 2"-' positions apart. Fig. 10 illustrates how a D&Q jth bit where j < m is retained. Those edges are exactly the algorithm performs its operation on an input of size 8. internal edges of an n-HHC. The other retained edges are the

+

+

MALLUHI AND BAYOUMI: HIERARCHICAL HYPERCUBE

27

n (.,i-m)

dimension i

Dimension 0 1

I

n;+l

dimensions

External edge

w

8

Internal edge

-Extemaledge ...............,...........

Removed edge (b)

(a) Transforming then-HC into the n-HHC, (b) Transforming the 1 I-HC into the 1 1-HHC.

Fig. 11.

edges of the form ((SI, i - m ) , ( 3 2 , i - m ) ) differing in the ith bit or, restated, differing in bit i - m of the S part of the node address. These edges are exactly the external edges of an n-HHC. This implies that the resulting topology is an n-HHC. Fig. 1l(b) illustrates how we handle the links incident to node (00000000101) = (0,5) in the 11-HC when transforming it into the 11-HHC. The above transformation suggests the following strategy for executing a D&Q algorithm on HHC:

procedure HHC-D&Q(V) begin for j:=O to m - 1 do for each i (O& = i < k do cobegin if bitj(i) = 0 then OP(v[i],v[i aj]); coend; Handle-High-Sheaves; end { HHC-D & Q }

+

The Handle-High-Sheaves procedure deals with operations on sheaves m to n - 1 (i.e., OP on operands that are 2", 2,+', . . . , 2n-1 positions apart). At any time instant, only

one of the nodes of an Scube s can be performing OP on sheaf 5 n. This is because exactly one node in the Scube is connected to another node 2i positions distant (by an external edge, of course). As might be expected, the address of this node is ( s , j - m). Therefore, at least 2m steps are needed to perform OP on sheaf j for all nodes of the Scube. This is done by allowing data in each node to pass over the node (s,j - m). The OP operations on different sheaves are pipelined in order to reduce the total processing time of Handle-High Sheaves. Handle-High-Sheaves consists of three main parts: 1) Cyclize Scube data 2) A series of OP, rotate operations 3) Cubify Scube data Cyclization of Scube Data: An Scube is a hypercube of m dimensions. The hypercube is known to have a Hamiltonian circuit (a cycle that passes on each node once and only once and then returns back to the initial node). One way to find a Hamiltonian circuit in an m-cube is to generate the m bits Gray code G,. Because the first and last numbers in G, differ only in one bit, the sequence of numbers in G, is the sequence of nodes in a Hamiltonian circuit. Cyclization of cube data means rearranging cube data in such a way that the data stored in processor IC of an Scube j,m

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 5 , NO. 1, JANUARY 1994

28

Processors:

OOO

001

010

011

100

101

110

111

j-0 :

v[Ol

v[ll

v[21

v[31

vf61

V I

v[41

v[51

,Exch, ms

Exch

v

The algorithm "Rotate" for data rotation within the Scubes Hamiltonian cycles and the algorithm "Handle-High-Sheaves" are given below. To simplify the description of the algorithms and to enhance understandability, w[i] is used to refer to the data value residing in processor i , where i is the integer value of ( s , p ) , the catenation of the S and P parts of the processor address. In addition, int(X) is used to refer to the integer value of the binary number X .

procedure Rotate begin for each (s,p)lO 5 s < 2n-m,0 5 p < 2" do 001 011 010 110 111 101 Ham.cycle: OOO 100 cobegin Datavalues: v[Ol v[51 v[61 v[71 v[ll v[21 v[31 v[41 (compute K ; the element immediately preceding (b) gm(p) in Gm} Fig. 12. (a) Cyclize procedure applied on input of size 8. (b) Hamiltonian IC := gm((g,l(p) - l)mod2"); cycle of processors in an Scube with corresponding data values after Cyclize. let i = int((s,p)); let j = int((s,IC)); w[i] := w[j] is redirected to processor gm(k). By this, repeated rotations coend; within the cycle corresponding to G,, allow all data items in end { Rotate } the Scube to pass over the node (s,j - m), and consequently procedure Handle-High-Sheaves OP on sheaf j can be performed for each of the Scube nodes. begin Fig. 12 shows the cyclization in a 3-cube. The cyclization Cyclize; problem is a D&Q variation whose solution can be described for i := 0 to 2"+' - 2 do {2"+l - 1 pipeline informally as follows: Tear the m-cube along the m - 1 steps are needed } dimension to obtain two identical (m - 1)-subcubes. Then begin cyclize the data in each subcube separately and combine { Find the active processors } the two results. A formal description of this operation is if i < 2" then L := 0; H := i; given below as procedure "Cyclize." An illustration of how else L := i - 2" + 1;H := 2" - 1; procedure Cyclize works is given in Fig. 12(a). The hypercube end { if }; Hamiltonian cycle and its stored data after cyclization are for each (s,p)lO 5 s < 2", L 5 g-'(p) 5 H do shown in Fig. 12(b). Obviously, Cyclize takes m - 1 time cobegin steps. if bit p ( s )= 0 then procedure Cyclize o p (w [int((s,p))I,w[int((s + 2P,p))l); begin coend for j:= m - 2 to 0 step -1 do Rotate; for each i10