On the Design Complexity of the Issue Logic of ... - CiteSeerX

On the Design Complexity of the Issue Logic of Superscalar Machines Sorin Cotofana Stamatis Vassiliadis Delft University of Technology Electrical Engineering Department Mekelweg 4, 2628 CD Delft, The Netherlands fSorin,[email protected] Abstract

tions:

In this paper we investigate the complexity of superscalar decode/issue logic assuming primitive gates. We show, assuming that the issuing is performed on the base of opcodes, that the complexity of checking data dependencies k gate delay, k being the is in the order of k 3 gates and issue width, when assuming infinite resources and in-order issuing. In assuming out-of-order issuing the complexities k gate delay, and for are in the order of k gates and out-of-order issuing with renaming in the order of k gates and k gate delay. When the resources are restricted we show that the complexity is in the order of nk gates and k 2 n delay, n being the cardinality of the instruction set. Finally, by assuming that the issuing is performed using grouping of instructions rather than opcode specific description the m delay, complexity is in the order of mk gates and k 2 where m is the number of instruction groups.

log

2

log

It is widely accepted that in superscalar machine organizations, see for example [4, 10, 6], the decode/issue stage of processing is expensive to implement in terms of area and delay. Assessing the magnitude of its complexity has been and continue to be of substantial theoretical and practical importance [2, 7]. In view of several proposals for aggressive superscalar and superscalar “related” machine organizations, see for example [8, 5, 9], the issue/decode processing could indeed be one of their limiting factors. Regarding the complexity of dependency checking it has been postulated [2] that it grows1 in the order of n2 . It is further indicated [2] that the checking of dependencies could indeed limit the issue rate of superscalar machines. The previous assertion leaves open the following ques1 In [2] it is not very clear what n stands for. We assume to be the number of issued instructions per cycle.

The check for data dependencies among the instruction in a k instruction buffer can be achieved with: an area in the order of k 3 and a delay in the order of k for in-order issuing; an area in the order of k and a delay k for out-of-order issuing; and an in the order of area in the order of k and a delay in the order of k for out-of-order issuing with renaming.

log

log

1. Introduction

Is the data dependency2 checking indeed the limiting factor of the issue rate?

We will investigate both questions and show the following:

2

log

Is the n2 complexity conjecture a correct one?

2

log

2

Assuming a superscalar processor with restricted hardware resources the corresponding issue block based on opcode decoding has an area in the order of nk and a delay in the order of k 2 n.

log

It should be noted that these conclusions strongly suggest that approaches such as the one described in [3] that assume instruction opcodes to determine if multiple instructions can be issued in parallel severely limit the issue rate that can be issued3 per cycle. Clearly there are other approaches, denoted here as hierarchical and hardware utilization, that alleviate the problem of opcode decision making. It is widely accepted for example that floating-point instructions can be hierarchically separated and issued to the floating point unit with no checking with fixed pont instructions, see for example [1]. Furthermore instructions can be separated by hardware utilization, as defined in [11], to further alleviate the problem. Finally, it has been suggested [11] that the decode/issue of 2 We assume here that what is referred in [2] as dependency checking relates to the data dependencies. Other interpretations such as structural dependencies are also possible. In all cases the interpretation of the word will not change the discussion and the conclusions to follow. 3 Indeed in [3] assuming two instructions per cycle 38 “first” instructions followed by 57 “second” instructions were implemented.

instructions should actually be computed in advance, tagged and possibly removed from the decode/issue pipeline stage. While such approaches have been successful in reducing substantially the complexity there are no studies assessing the degree of complexity reduction. In this paper we also investigate the following open question:

What is the complexity of issuing using partitioning (hierarchical and hardware utilization based) ?

Regarding this question we show that assuming a superscalar processor with restricted hardware resources, a issue width of k instructions per cycle, and with the instruction vocabulary partitioned into m classes, the corresponding issue block has an area in the order of mk and a delay in the order of k 2 m. The organization of the discussion is as follows: First we introduce some assumption and preliminary notions for the instruction issuing. Consequently, we investigate the complexity of data dependency checking with unrestricted resources. Finally, we restrict the resources and investigate the complexity of issuing under such a restriction. We consider two basic scenarios: opcode and instruction partitioned based issuing.

log

2. Machine Organization and Preliminaries

(i ; i ; : : : ; ik ), with ij 2 I , j = 1; 2; : : :; k. Furthermore 1

2

assume that at least k instructions are fetched into an instruction buffer and that a decision is reached on whether or not a k -instruction tuple can be issued and executed in parallel. This decision making process is performed by the “Decode & Issue” logic and is usually based on: the opcodes of the instructions, on availability of resources, and on the structural and data dependencies. In assuming that the issuing decision is based on opcodes a number of rules have to be put in place describing if a sequence of instructions is potentially issuable4 . To clarify consider a machine operating on the following instruction set I fAdd; Sub; Comp; Load; Storeg, issuing at most two instructions per cycle, and having available two ALUs executing the instructions Add; Sub; Comp. For this example machine the following rule has to be put in place: “An issuable ALU pair is Add; Add or Add; Sub or Add; Comp or Sub; Add or Sub; Sub or Sub; Comp or Comp; Add or Comp; Sub or Comp; Comp ”. Generally speaking such rules can be viewed to logically form a table5 . Furthermore, assuming that the instruction buffer contains k instructions6 the organization of the “Decode & Issue” block in Figure 1 can be detailed as graphically depicted in Figure 2. It should

=

( ( (

) ( ) )

(

) (

)

(

( ) (

) ) )

Instruction Buffer

Consider a machine organization capable of issuing more than one instruction per cycle depicted in Figure 1.

0 1

Issued Instructions ...........

Instruction Fetch

Decode & Issue Logic

k-2 k-1

Register Decode & Issue

Rule Table

File

Figure 2. Decode & Issue Functional Unit

Functional Unit

Functional Unit

.......

be noted that in real implementations the “Rule Table” is actually embedded in the “Decode & Issue” logic. Memory

Figure 1. Basic Superscalar Architecture Assume that the instruction set executed by the processor is I fI1 ; I2 ; : : : ; In g and that at most k instructions can be issued per cycle described by the k -tuple P

=

=

4 Issuable here means that other conditions, e.g., data dependencies, conflicts, etc., have to be also respected. 5 Logically means that there are several ways to implement the table. In most implementations the table is realized with Boolean logic rather than with a memory array. 6 In this presentation we will assume only restricted scope instruction issue, i.e., a k -instruction buffer, in order to better cope with the complexity of the problem we are analyzing. Sometimes an extended scope, i.e., a w-instruction buffer w > k, is assumed in practice in order to improve performance. This assumption will increase the complexities we report in this paper. For example an extended scope issue policy increases the delay complexity we will report in Theorem 3 from k 2 log n to w k log n k possible k-instruction tuples over the wbecause in this case all the Cw instruction buffer have to be considered.

1

3. Unrestricted Computational Resources In this section we assume that the processor has unrestricted computational hardware resources and at most k instructions can be issued per cycle. This means that if the functional unit FUj can execute an instruction Ij , j ; ; : : :; n, the processor will contain k units of each type. This clearly corresponds to an area of the computational part of the machine in the order of kn units. Under this assumption the parallel issue of a k i 1 ; i2 ; : : : ; i k , ij 2 I , j instruction tuple P ; ; : : :; k, in a single cycle can be performed with no hardware conflicts. If the k instructions in the instruction buffer have no data dependencies the tuple can be issued, otherwise it is assumed that the “Decode & Issue” block graphically depicted in Figure 3 will attempt to issue as many instructions as possible. Generally speaking, for an

=12

= (

12

Instruction Buffer

)

=

has to implement k ? rules. If rule Ri corresponds to the case when the first i instructions can be issued in parallel it can be formulated as follows:

(

Given that the decision on how many instructions can be issued has to be as fast as possible we assume that the issue logic implements a parallel evaluation scheme for all the k ? rules. One approach7 to implement this parallel evaluation scheme is to assume that ND il ; im is a combinational circuit that checks if there are any data dependencies between the operands of the instruction im and those of the instruction il , and to compute first the following PIssue i , i ; ; : : : ; k, quantities:

1

(

()

PIssue(i) =

0 Issue(k) ...........

Decode/Issue Logic

Issue(k-1)

Dependency Logic

Issue(2)

k-2

Issue(1)

k-1

Figure 3. Decode & Issue for Unrestricted Resources in-order issue policy, the dependency logic circuit will set the signal Issue l to in order to signal that the first l, l 2 f ; ; : : :; kg instructions in the buffer can be issued in parallel and all the other output signals Issue m , with m 6 l and m k, will be set to “0”. In the theorem to follow we will investigate the complexity of the circuit that perform the data dependency checking on a k -instruction tuple.

() 1

12

=

( )

1

Theorem 1 Assuming a superscalar processor with unrestricted hardware resources, an in-order issue policy, and a issue width of k instructions per cycle, the corresponding issue logic can be implemented with an area in the order of k2 and a constant delay. Proof: Assume the instruction buffer contains k instructions. For an in-order issue policy the issue logic can signal the parallel issue of all the k instructions in the buffer, or of the first k ? , or of the first k ? , etc. Consequently, in order to decide how many instruction to issue the issue logic

1

2

)

=2 3

Decode & Issue

1

=23 )1 1

The first i instructions, i ; ; : : : ; k, can be parallelly issued if there is no data dependency between the instruction pairs il ; im , l < m i, and if i < k there is at least a dependency between an instruction il and an instruction im , l i < m k .

( ( (

) )

(

)

ND i1 ; i2 and ND i1 ; i3 and ND i2 ; i3 and and (1) ND ii?2 ; ii and ND ii?1 ; ii

)

(

) Consequently, because only the output signal Issue(i) with

i corresponding with the largest number of instructions that can be issued has to be “1” and all the others should be “0”, the output signals of the issue logic can be computed by: Issue(k) = PIssue(k) Issue(k ? 1) = PIssue(k ? 1) and PIssue(k) :::::::::::::::::::::::::::::::::::: Issue(3) = PIssue(3) and PIssue(4) Issue(2) = PIssue(2) and PIssue(3)

(2)

In order to check all the possible instruction pairs we need Pk?1 k(k?1) ND circuits working in parallel on j =1 k ? j 2 all the instructions in the buffer. Assume that an ND circuit can be implemented with d gates and that the computation of the PIssue quantities can be performed with k ? gates and those of Issue quantities with other k ? gates. If we assume that a gate can be implemented, regardless of its fan-in8 , in Ag area the implementation cost of the issue

(

)=

2

2

7 We note here that one may propose also other schemes to implement the rules but they will provide the same complexity for the circuit area and delay. 8 We note here that some of the gates computing PIssue quantities will k(k?1) . This implies requires a very large fan-in, the largest fan-in being 2 that the implementation cost and also the delay of such a gate are substantially larger than the ones required by the gates computing an Issue quantity.

block is given by9 :

A = d2 k(k ? 1) + 2(k ? 2) Ag

(3)

Asymptotically speaking this corresponds to an area in the order of k 2 . The delay of the issue logic we introduced do not depend10 on k because on the critical path of the design we have an ND function that can be computed in a fixed 2 constant delay ND plus the delay of a gate. Thus far we have shown that to in-order check the possible data dependencies that might appear on a k -instruction buffer the issue block has an area of k 2 gates with arbitrary larger fan-in and this computation can be performed in constant delay. Given that in practice, depending on the fabrication technology, the maximum fan-in is limited to a value f we investigate in the following corollary the consequences of this assumption on the design complexity of the issue logic.

Corollary 1 Assuming a superscalar processor with unrestricted hardware resources, an in-order issue policy, a issue width of k instructions per cycle, and a fabrication technology that can accommodate logic gates with at most f inputs, the corresponding issue block has an expected area in k. the order of k 3 and a delay in the order of Proof: We maintain the computation scheme in Theorem 1. Given that we restricted the gates fan-in to f some of the PIssue quantities can not be computed with one AND gate. Generally speaking in order to compute an m-input elementary function with gates that have their fan-in upper bounded by f we need an f -ary tree with d f me depth. ?1 Such a tree is build with at most d m f ?1 e gates. Given that the fan-in of the gate computing PIssue l , l ; ;:::;k 1) is given by l(l? this method changes the implementation 2 cost, measured in terms of gates, of the PIssue quantity as follows:

log

log

() =3 4

APIssue l = ( )

=3 4

(

l l? l l?

12 l

(

m if l ?l?2 if 2(f ?1)

1)

2

(

1)

2

f >f

(4)

for l ; ; : : :; k. Consequently, the implementation cost of the issue block becomes: 0

Af

k l ?l?2 f X = d2 k(k ? 1) + l kf 2(f ? 1) +k + kf ? 5) Ag

@

2

( ) (1 + 1 + 4 )

(

1) =

+1

()

3

plexity in the order of k 3 for the implementation cost of the issue logic. The critical path delay is imposed by the computation of the PIssue k signal. This k(k2?1) -term AND expression is implemented l with an fm-arylgate tree. Because m k(k?1) k(k?1) the depth of this tree is f f 2 2 the delay of the issue block is in the order of k. 2 Thus far we have assumed an in-order issue policy. It is well known that in order to exploit better the instruction level parallelism present into an instruction stream out-oforder instruction issue policy is used. This increases the probability to find a larger number of instructions with no data dependencies and obviously leads to a performance increase due to a more efficient use of the computation resources. As one may expect the out-of-order issue policy is more computational demanding on the issue logic. In the following theorem we will investigate the complexity of an issue block that can perform out-of-order issue out of a k instruction buffer.

()

log

= log log log

Theorem 2 Assuming a superscalar processor with unrestricted hardware resources, an out-of-order issue policy, and a issue width of k instructions per cycle, the corresponding issue block has an area in the order of k and k. a delay in the order of Proof: For out-of-order issue policy the issue logic can signal the parallel issue of one of the following combinations: all the k instructions in the buffer, one k ? -instruction tuple out of the Ckk?1 possible tuples, one k ? -instruction tuple out of the Ckk?2 possible tuples, etc. More precisely, if rule Ri corresponds to the case when i instructions out of the k presented in the buffer IB i1 ; i2 ; : : : ; ik can be issued in parallel it can be formulated as follows:

2

log

( 1) ( 2)

=

9 Note

In the previous equation we assumed that the implementation cost for the ND function is df gates and that kf is the smallest integer number for which the fan-in of the gate computing PIssue kf exceeds f . More precise kf can 1 be computed by solving p the equation f and 2k k ? 1 f . We note here that if it is equal with 2 kf is larger than k, case in which we assume it as being equal with k , all the equations computing PIssue functions can be implemented directly and Equation (5) gives the same result as Equation (3). Given that in practice f can not be very large when compared with the maximum fan-in required by the computation of PIssue k , i.e., f k(k2?1) , kf will assume a value a lot closer to than to k , i.e., kf k , and the implementationlcost in Equation (5) will be dominated by the m Pk 2 l?2 term l=kf l2(? . This implies an asymptotic comf ?1)

(5)

that because no changes will occur in the final result and for simplicity we use the generic term “gate” instead of the specific function on “cell” like terminology for gate arrays. 10 In an implementation the delay will be actually a function of k because the AND gate on the critical path (the one computing the PIssue(k) conk(k?1) , and it is well known that the delay of a gate dition) has a fan-in of 2 depends on its fan-in.

=(

=(

)

)

An i-instructions tuple Pi p1 ; p2 ; : : : ; pi , with pj 2 IB, j ; ; : : : ; i can be parallelly issued if there is no data dependency between all the instruction pairs pl ; pm , l < m i, and if i < k

=12 ( )1

there is at least a dependency between an instruction pl , l i and an instruction im , im 2 IB, im 2= P , and m k .

1 1

Because we have allowed for out-of-order issue there are Cki combinations that have to be evaluated in the rule Ri . If we maintain the assumption on the ND il ; im circuits the number of PIssue quantities that we have to compute P changes from k ? to ki=2 Cki . Consequently, the signal PIssue i in Equation (1) will be replaced by a bundle of signals PIssue i j , j ; ; : : :; Cki . This will change also the fan-in, but not the number, of the gates performing the computation in Equation (2) because the quantity W ki PIssue i has to be replaced with jC=1 PIssue i j . In order to check all the possible instruction pairs we Pk?1 k(k?1) ND circuits working in will need j =1 k ? j 2 parallel on all the instructions in the data buffer. We assume that the area of the gates do not depend on their fanin and that the any ND circuit can be implemented with d gates. The computation of the PIssue quantities can be Pk performed with i=2 Cki gates and those of Issue quantities with other k ? gates. Consequently, the implementation cost of the issue block is given by:

(

2 () ()

)

=1 2

()

()

(

)=

2

AO = 3

d k(k ? 1) + 2k ? 3 A g 2

(6)

Asymptotically speaking this corresponds to an area in the order of k . At the first glance the delay of the issue logic we introduced do not depend on k because on the critical path of the design we have an ND function that can be computed in a fixed constant delay ND plus the delay of a gate. However any practical implementation has to use gates with a limited fan-in and this will lead to a delay in k. The complexity for the area will not be the order of affected by the limited fan-in assumption. 2 The last question we analyze in this section refers to the influence of the register renaming procedure, which is used in some implementations to solve the non true data dependencies (WAR and WAW) [4], on the issue logic complexities we reported in the previous theorem.

2

log

Corollary 2 Assuming a superscalar processor with unrestricted hardware resources, an out-of-order issue policy with register renaming, and a issue width of k instructions per cycle, the corresponding issue block has an area in the order of k and a delay in the order of k . Proof: Assume that in order to check data dependencies on two instructions we use a more complicated circuit ND R , that is able to solve the WAR and the WAW dependencies by register renaming and can be implemented with dR > d gates. Use the same scheme as in Theorem 2 but instead of the ND circuits we use NDR circuits. This will change only the value of the constant d in Equation (6) then the area of

2

2

the issue logic will remain unchanged, i.e., k . However the renaming increases the delay. When renaming is used the NDR share resources and can not work in parallel any longer. This leads to a serial behavior of the logic that check for data dependencies the instructions pairs. More formally the check for data dependency of the pair pi ; pi+j , i < k, j ; ; : : :; k ? i can not be initiated until all the pairs pi?1 ; pi+j ?1 where checked. Consequently, the expected delay is in the order of k . 2

(

=12

(

)2

)

4. Restricted Hardware Resources and No Data Dependencies In this section we assume that there are no data dependencies in the incoming instruction stream but the hardware resources are limited, i.e., if the functional unit FUj can perform the instruction Ij , j ; ; : : : ; n, the processor contains kj < k FUj units11 . Consequently, in order to decide whether or not a number of instructions can be issued in parallel the issue logic has to check if enough computational resources are available. Generally speaking, for an n instruction processor, there k are An nk possible k-instruction combinations that might appear in the instruction buffer. This holds true for the most general case when we allow for repetition, i.e., a given instruction Ij 2 I can appear more then ones in i 1 ; i 2 ; : : : ; i k , ij 2 I , the composition of the tuple P j ; ; : : :; k, and that the instruction order is relevant, i.e., i1 ; i2 ; : : : ; ik 6 i(1) ; i(2) ; : : : ; i(k) , with a permutation over < ; ; : : : ; k >. We can reduce the number of possible combinations, if we assume that the instructions order in the buffer is not relevant12 from the point of view of the issue logic that checks for the availability of computational resources, to 1)! C kn Cnk+k?1 (kn!(+nk??1)! . Given that we assumed that the processor has restricted resources not all of this combinations can be issued and executed in parallel. Consequently, the issue logic should first be able to recognize if the instructions in the instruction buffer represent an executable combination and after that to carry on the corresponding parallel issue procedure. Therefore besides of the instruction buffer able to accommodate k instructions such an organization requires a combinational logic which implements a number of rules to control the parallel instruction issue. In the following we will analyze the complexity of the issue logic to perform this type of verification by evaluating the complexity of the rule table it has to implement.

=12

=

=12 (

=

=(

) ( 12

)

)

=

of the kj may be equal with k but not all of them. relevant here means, for example, that the following instruction (ADD; MUL; LD ) tuples are equivalent: (ADD; LD; MUL) 11 Some

12 Not

LD; ADD; MUL) (LD; MUL; ADD) (MUL; ADD; LD) (MUL; LD; ADD), and they can be checked for resource con-

(

flicts with the same rule.

Let first assume a processor that can provide an issue rate of instructions per cycle. In this case the rule table corresponding to an n instruction processor has to have n2 entries. The element Rk;l in the rule table provides information about the parallel execution of the instruction pair il ; ik . Consequently, we need at most n2 rules in order to be able to check all the resource conflicts that might appear in an instruction stream. The precise number of rules is actually lower than that because from the issue logic prospective we can assume that the symmetric instruction pairs are equivalent, i.e., il ; ik ik ; il , and under this assump2 tion the number of rules decreases to C n Cn2+1 n(n2+1) which of course is in the order of n2 . In the following theorem we will provide an answer to the following research question:

2

(

)

(

) (

)

=

=

What are the consequences of an increased issue rate assumption (a larger instruction level parallelism) on the design complexity of the issue logic that have to check for resource availability?

Theorem 3 Assuming a superscalar processor with restricted hardware resources, in-order issue policy, and a issue width of k instructions per cycle on an instruction stream with no data dependencies, the corresponding issue block based on opcode decoding has an area in the order of nk and a delay in the order of k2 n.

log Proof: Assume an issue rate k 2. In this case we need to implement rules for any possible tuple of k instructions P = (i ; i ; : : : ; ik ), where ij 2 I , j = 1; 2; : : : ; k . Assume also that the instruction order is not relevant, i.e., (i ; i ; : : : ; ik ) (i ; i ; : : : ; i k ), with a permutation over < 1; 2; : : :; k > and repetition is allowed, i.e., a given instruction Ij 2 I can appear more then one time in the composition of the tuple P . Consequently, the number 1

1

2

2

(1)

(2)

( )

of rules to be accommodated by the issue logic will be given k by the number of combinations with repetition C n and this is not any-longer quadratic in the number of instructions n supported by the processor. However the issue logic should be able to do more than this. When an issue width k larger than is targeted it is not enough to have a rule table which gives the information related to the parallel issue of k -instruction tuples. When the issue logic decides that the parallel execution of the k instructions under consideration can not be performed it should be able to find the maximum number of instructions k 0 < k which can be issued in parallel for the considered instruction tuple. Consequently, when k is larger than the issue logic should be able to analyze k -instruction tuples as well as k 0 -instruction tuples, k0 k ? ; k ? ; : : :; ; , in order to exploit the maximum instruction level parallelism one can get on an instruction stream. In other words the issue logic will have to use a rule

2

2

=

1

2

32

k

+

k?

+ + + 2

table with C n C n C n C n rules. Assume that CRl , l k; k ? ; : : : ; , is the circuit that implements all the Cnl +l?1 rules13 that corresponds to l-instructions tuples. This circuit, which is behaving alike a decoder, has Cnl +l?1 outputs, each of them signaling for the successfully issue of a certain instruction combination and at most one of its outputs can be at a time. If one of these outputs is the issue logic is signaling the fact that the first l instructions in the buffer were successfully parallelly issued. If all the Cnl +l?1 outputs are the l-instruction tuple can not be issued and the issue logic has to check the potentially issue of an l ? -instruction tuple. Given that a CRl circuit has to evaluate l-term AND products the delay of such a circuit can be considered in the order of l. In order to detect if for the current value of l an issue occurred or not we have to evaluate a logic OR over all the CRl outputs, i.e., nl signals, and this OR can be computed after a delay in the orn. The CRl circuit can be implemented with nl der of l AND gates and one OR gate. Under the assumption that the gate area does not depend on the fan-in the implementation cost of the CRl circuit can be estimated as being equal with nl gates. Consequently, the overall cost of the logic is P2 ? l n2 (nk?1 ?1) k ? and given by A n l=k n?1 this is indeed in the order of nk . In the best case scenario the issue logic can signal and accomplish the parallel issue of the entire instruction buffer k and this corresponds to a delay in the order of k n. In the worst case scenario the logic initiate the investigation on the parallel issue with a k -instruction tuple and ends it with a -instruction pair. This corresponds to a delay of k k n k? k? n

=

1

1

3

2

1

1

0

( 1)

log

log

+1

=

+1 =

+

2

log +

log

2 = log + log + log 1 + ( 1) log + + log 3 + 3 log n + log 2 + 2 log n = log k! + k(k ? 1) log n which is asymptotically speaking in the order of k log n. This delay can be reduced if we assume that all the CRl , l = k; k ? 1; : : : ; 2, circuits initiate the resource checking in the same time but the circuit CRl do not issue a possible l-instruction tuple until it receives a signal from the circuit CRl that signifies that the (l + 1)-instruction tuple could not be issued. This reduces the log k ! term of to log k but it does not change the asymptotic bound for the 1

2

2

+1

2 overall delay. In the previous theorem we assumed an in-order issue policy and have shown that the issue logic has an area in the order of nk and the preprocessing takes k 2 n de-

log

13 We note here that some of these rules do not have actually to be implemented. They correspond to combinations that can not be issued and in practice we have to implement only the YES set of rules. The cardinality of the YES set depends on the values we assumed for the number of units kj , j = 1; 2; : : : ; n and can assume values between Cnl and C ln . However in order to simplify the derivation and because both limit values are in the order of nl we assume in the following that we implement all the rules. This assumption has also as a side effect the fact that the constants kj do not appear in the computations because they were all replaced by their upper bound k.

lay. These complexities do not change if we assume outof-order issue policy even though for an implementation both the area and the delay will increase. In order to handle out-of-order issue we maintain the scheme in Theorem 3 but now there are Ckl possible l-instruction tuples to k; k ? ; : : : ; . If we want be evaluated for each l, l to be able to evaluate all of them in parallel we have to build Ckl CRl circuits for each l, l k; k ? ; : : : ; , instead of one. Consequently, the overall circuit area is given P2 P2 l ?nl l A CRl C C by AO3 l=k k l=k k n k ? n k ? k ? , and this implies an implementation cost in the orderP of nk . The delay for this ? scheme P2 2 l nl O 3 C l is given by ? k l = k l = k k Ck2 Ckk 12 k k ? n, and this is in the order of k 2 n.

=

1 =

2

1

( ) = +2 2 log log + = log ! + log + ( 1) log log ( + 1)

=

2 +1 = =

5. Restricted Hardware Resources with Data Dependencies In the following we assume that the issue logic has to handle both data dependencies and hardware conflicts, i.e., the real life scenario. In this case the area is given by the sum of the area of the data dependency logic and the area of the hardware conflict logic. Even that the data dependency logic and the hardware conflict logic can work in parallel what we would like to do in practice is first check the instruction buffer for data dependencies (solve some of them by renaming) and after that try to issue as many instructions with no data dependencies as possible. The asymptotic complexities for an issue logic which have to solve data dependencies as well as resource conflicts are stated in the following corollary. Corollary 3 Assuming a superscalar processor with restricted hardware resources and a issue width of k instructions per cycle the corresponding issue logic based on opcode decoding has an area in the order of nk and a delay in the order of k 2 n. Proof: Immediate from the fact that n, the cardinality of the instruction set, is larger than k and the previous results. The area is dominated, regardless of the assumed issue policy, by the complexity of the logic that performs the check for resource conflicts and is in the order of nk . The same observation holds true for the delay and this leads to an exn. 2 pected overall delay in the order of k 2 As one may observe the delay associated with the issue logic depends on the issue width and on the cardinality of the instruction set n. Large delay for the issue logic might actually mean that more than one pipeline stage will be needed for the decode & issue logic. This solution is very much detrimental to performance due to the more complicated and lengthy procedure such an organization will need in order to recover from miss-predicted branches. The area

log

log

is dominated by the term depending in n, the number of instructions supported by the processor. Consequently, if one seeks a substantial complexity reduction one should focus on the reduction of this term14 . In the section to follow we will investigate potential schemes to reduce the area complexity of the issue logic.

6. Partitioning of Issuing Even though there have been superscalar machine designs deciding on issuing via opcode description, see for example [3], as the previous results suggest such an approach should be avoided as it severely limits the potential of machine implementation to a very restrictive issuing. There are two main reduction techniques that could be followed (and have been followed) namely:

hierarchical partitioning of instructions and hardware utilization partitioning of instructions.

Both techniques, separately and preferably in combination, allow to reduce substantially the actual rules that get implemented. The hierarchical partitioning is based on the following (assuming a load store architecture principles):

A subset of instructions, excluding loads and stores that could be treated separately, belong to a hierarchical class if they operate on the same set of registers. A subset of instructions that manipulate sets of registers are viewed as a separate class and treated separately.

This definition also accommodates architectures containing register to memory operations but do not include memory to memory instructions. The problem is resolved with the additional statement indicating that such a subset of instructions belong to a separate hierarchical class. It must be noted that current architecture mostly can be subdivided into two to three large classes namely, memory to memory, floating point, and fixed point. It is quite advisable to enlarge the number of classes. However currently such a division is not sufficient to substantially reduce the complexity of the decode/issue logic. An additional technique named hardware utilization partitioning discussed in [11], should be added to substantially reduce the rules. In such a technique the instruction fI1 ; I2 ; : : : ; In g is partitioned into classes Ci set I fI1i ; I2i ; : : : ; IciiSg, Iji 2 I , j ; ; : : : ; ci , i ; ; : : :; m, m C as follows: such that I i=1 i

=

=

14 One

=1 2

=1 2

=

can reduce the area and also the delay by reducing the value of

k but this is not of interest because this type of solution will decrease the machine parallelism.

An instruction belongs to a class Ci if it uses the same hardware units as all the others in that class.

All the instructions Iji 2 Ci , j ; ; : : : ; ci are viewed as the same instruction from the prospective of the issue logic.

= 12

The reasoning behind the definition of such a partition is the fact that in an implementation there are a limited number of functional units and each of them operates on a multiplicity of instructions. As an example an ALU can execute instructions such as add, subtract, compare, logical OR, logical AND, and so forth. These instructions are not the same but the differences are such that they can be resolved by simple control signals or by minor modifications to accommodate the operation. For example, an addition differs from a subtraction (in ’s complement representation) in that the latter requires an inversion and the addition of an . This difference is typically solved by injecting a “hot ” carry-in into the ALU together with control information indicating that a subtraction is to be performed rather than an addition. As consequence of partitioning all the instructions in a class are equivalent from the prospective of the issue logic. This means that instead of rules for instructions tuple, as for example described in [3], we have to define and implement rules for class tuples. Consequently, the number of rules to be implemented by the logic that checks whether or not a k instruction tuple can be parallelly issued will diminish. As an example if k is assumed the rule table will reduce its dimension from n2 to m2 . The design complexity of the issue logic of a superscalar processor with a partitioned instruction set is stated in the following corollary.

2

1 1

=2

Corollary 4 Assuming a superscalar processor with restricted hardware resources, a issue width of k instructions per cycle, and with the instruction set partitioned into m classes, the corresponding issue block has an area in the order of mk and a delay in the order of k 2 m. Proof: Trivial from Corollary 3 and the partitioning of the instruction set in m classes. 2 Clearly the hierarchical partitioning also follows the same order of complexity and is not discussed any further.

log

7. Conclusions In this paper we have investigated issues related to the design complexity of the instruction decode/issue logic. In particular we have shown that in order to check only for data dependencies the decode/issue logic has to have an area in k for in-order the order of k 3 and a delay in the order of issuing; an area in the order of k and a delay in the order of k for out-of-order issuing; an area in the order of k and a delay in the order of k for out-of-order issuing

2

log

2

log

with renaming. To check for data dependencies as well as for resource availability the issue block has an area in the n when the order of nk and a delay in the order of k 2 issue decision making is based on opcodes. When the decision making is based on instruction partition the issue block has an area in the order of mk and a delay in the order of k2 m, where m is the number of instruction classes.

log

log

References [1] G. Grohoski. Machine organization of the IBM RISC System/6000 Processor. IBM Journal of Research and Development, 34(1):37–58, Jan. 1990. [2] J. Hennessy and N. Jouppi. Computer Technology and Architecture: An Evolving Interaction. Computer, 24:18–29, Sept. 1991. [3] R. Horst, R. Harris, and R. Jardine. Multiple Instruction Issue in the NonStop Cyclone Processor. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 216–226, 1990. [4] W. Johnson. Superscalar Microprocessor Design. PrenticeHall, Englewood Cliffs, NJ, 1991. [5] M. Lipasti and J. Shen. Superspeculative Microarchitecture for Beyond AD 2000. Computer, 30:59–66, September 1997. [6] J. Liptay. Design of the IBM Enterprise System/9000 highend processor. IBM Journal of Research and Development, 36(4):713–731, July 1992. [7] S. Palacharla, N. Jouppi, and J. Smith. Complexity-Effective Superscalar Processors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, 1997. [8] Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. Computer, 30:51–57, September 1997. [9] J. Smith and S. Vajapeyam. Trace Processors: Moving to Fourth-Generation Microarchitectures. Computer, 30:68– 74, September 1997. [10] H. Torng and S. Vassiliadis. Instruction-Level Parallel Processors. IEEE Computer Society Press, 1995. [11] S. Vassiliadis, B. Blaner, and R. Eickemeyer. SCISM: A Scalabale Compound Instruction Set Machine. IBM Journal of Research and Development, 38(1):59–78, Jan. 1994.