Evaluating x86 condition codes impact on ... - Semantic Scholar

1 downloads 0 Views 1MB Size Report
Evaluating x86 condition codes impact on superscalar execution. VIRGINIA ... according to the length of dependency paths [4, 7]. .... D = DS1 or DS2 or .. or DSm.
Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

Evaluating x86 condition codes impact on superscalar execution VIRGINIA ESCUDER, RAÚL DURÁN, RAFAEL RICO Computer Engineering Department Universidad de Alcalá Escuela Politécnica, Campus Universitario, 28871 Alcalá de Henares SPAIN Abstract: - The design of instruction sets is a fundamental aspect of computer architecture. A critical requirement of instruction set design is to allow for concurrent execution, avoiding those constructs that may produce data dependencies among instructions. Therefore, it is important to count on methods and tools for the evaluation of the behavior of instruction sets and quantify the influence of particular features of its architecture into the overall available parallelism. We propose an analysis method that applies graph theory to gather metrics to evaluate the impact of different characteristics of instruction sets as sources of coupling thus quantifying available parallelism. We present a case study using the x86 instruction set and obtain some measures of the influence of condition flags in code coupling. Key-Words: - Instruction Set Architecture, Instruction level Parallelism, Graph Theory.

1 Introduction The design of instruction sets is a fundamental part of computer architecture. One of the most critical aspects of it has to do with concurrency: those requirements that must be fulfilled to allow for superscalar execution. A crucial aspect for the parallel execution of programs is to avoid code coupling caused by data dependencies amongst instructions. It is therefore important to count on tools for the evaluation of existing instruction sets behavior and quantify the influence of particular features of its architecture into the overall available parallelism. Despite its importance, there is no much research in this subject as could be expected and the majority of studies about instruction set architecture evaluate the distribution of instruction types utilization (i.e. on VAX [2] or x86 [1]) rather than data utilization [5]. The method we propose applies graph theory to gather metrics to evaluate the impact of different characteristics of instruction sets as sources of coupling, and quantify available parallelism according to the length of dependency paths [4, 7]. To demonstrate the application of the model on a case study, we show the results of analyzing the influence of condition codes in a program that uses the x86 instruction set.

2 The analytical model Modeling parallelism using dependency graphs has been a popular approach for medium and large grain parallelism, but not much in fine grain. In our

approach we model instruction sequences that show data dependency as directed graphs and then use matrix representation of graphs for data processing. Matrix D is the dependency matrix defined as:  1, whenever instruction i depends on j ; d ij =   0, otherwise.



(1)

In this matrix, a row is a vector d i that specifies all direct data dependencies of instruction i. And the set of all the rows in the matrix correspond to the sequence of a code under analysis. As this matrix is to be used for the analysis of instruction level parallelism consistently, we define a set of restrictions and properties in D: • The instructions labeling should not affect the properties of D. Consequently, permutation of rows and columns doesn't affect these properties. The labeling that corresponds to the order of appearance of instructions in the code (program order labeling) is the most intuitive approach. • As an instruction does not depend on itself, the matrix's diagonal is null. • If an instruction depends on another instruction, the latter cannot depend on the former; this asymmetry implies that the matrix is not symmetric. • If we use program order labeling, the matrix D is lower triangular. In this case we have the canonical matrix Dc. • The l-power of matrix D, Dl, contains all dependency paths of length l+1 computation steps. • In a sequence of n instructions, Dn is always null.

Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

• There are no dependency cycles as an instruction can never depend on itself in any dependency path, therefore the diagonal of any power of D is also null. A dependency path is composed of computing steps, which in turn, are the successive processing states of an ordered sequence of instructions. For example, a completely serial n instruction sequence (with no parallelism at all) would correspond to a maximum of n computing steps, one per instruction. Conversely, a totally independent sequence could, in theory, be processed in 1 (parallel) computing step. Algebraically, we can obtain some useful information about the potential parallelism of the code (see [3] for formal definitions and demonstrations). These are: • Coupling C: a tightly coupled code shows more dependencies, so we use coupling to quantify the potential ordering of the code and find its bounds: n n −1 n −1 n −1 i −1 C = ∑i =0 ∑ k =0 d ik = ∑i =1 ∑ k = 0 d cik ; 0 ≤ C ≤    2

(2)

• Data reuse: or minimum data lifetime which is the longest length of the dependency paths existing from a data producer instruction and the different consumers instructions. We state this as:

{

[ ] ≠ 0, [D ]

timin = max 0, k j : D 1≤ j ≤ n

k j +1

kj

ij

ij

}

=0 .

(3)

• Critical path length L: the longest dependency path of the instruction sequence can be obtained when Dl=0 as L = l. It is bounded by 1 and n computing steps. • Parallelism degree Gp: we identify the number of instructions that can potentially be executed in parallel as an expression derived from L: Gp =

n ; G p ∈ [1, n] L

(4)

2.1 Dependency source composition A very important property of the model proposed is the fact that matrix D can be the resulting aggregation of different dependency sources if we represent each of these sources as a dependency matrix itself. So, we have: (5) D = DS1 or DS2 or .. or DSm where DSm is the matrix for the mth source of dependency. And all the properties and parameters defined for D apply to each component matrix. In particular the critical path length of the composition is bounded by the component's longest critical path:   max{Lsi } ≤ L ≤ min ∑ Lsi , n  . i   i

(6)

3 Condition codes in the x86ISA One motive for selecting x86ISA is its current heavy utilization: even though the design of the x86 instruction set was performed long ago and the criteria used then is now obsolete, the set is still under use as it was maintained for binary compatibility reasons. Another reason why we have selected the x86 architecture is because its behavior in superscalar environments: it performs very poorly compared to non-x86 sets. The IPC is 0,5 to 3,5 in different x86 execution models; compared to an IPC of 2,5 to 15 (and peaks of 30) of non-x86 processors. This fact leads us to think that the architecture of the own instruction set may be a limiting factor in the available parallelism. Some of the x86ISA limiting factors that may introduce negative effects into the amount of available parallelism are dedicated use of registers, implicit operands utilization, complex address computation and condition codes utilization. Using condition codes is an alternative for implementing conditional control flow. The evaluation of the branch condition is performed using one or more condition-code bits. A processing instruction typically precedes a conditional branch and therefore it creates a dependence which requires serial execution. Architectures using this schema are called status register architectures; the x86 and the PowerPC are examples of them. In superscalar execution, condition codes increment the ordering of instructions as they pass information from one instruction to the next. Theoretically, there are other two alternatives for implementing conditional control flow: evaluation of the contents of a register named in the branch instruction against a criterion also made explicit in the branch instruction, and atomization of the comparison and branching actions into a single instruction. The first alternative, commercially adopted in the Alpha and MIPS architectures for instance, is simple and also optimal for superscalar execution whereas the second one, used in the PA-RISC and VAX processors, makes the pipeline design more complex as it results from the union of two operations in one. 0: r1 op_a r2  r3 1: r4 op_b mem1 2: r5 op_c r6  r7 ( status) 3: if status == cc go to 0

1

2

dependence through condition codes

3

Fig. 1. Condition codes impact on parallelism in a typical basic block.

Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

To better understand the condition codes impact in parallelism we focus the analysis in the basic block structure. Figure 1 is an example for this where a basic block is shown together with its corresponding condition code's data dependence graph. Instruction 2 generates a true dependence with instruction 3. True dependences have computational meaning and require serial execution: the block needs two computing steps to execute in a superscalar environment. This example shows how using condition codes intrinsically decreases the parallelism in a basic block. 0: r1 op_a r2  r3 ( status) 1: r4 op_b mem1 2: r5 op_c r6  r7 ( status) 3: if status == cc go to 0

output dependence through condition codes

1

2

true dependence through condition codes

3

Fig. 2. Two processing operations with two status writes.

In Fig. 2 we have another similar basic block with two processing instructions instead of one. Both write into the status register, so three computing steps are necessary to process the block thus revealing a lesser degree of available parallelism in the block. Compared to Fig. 1, the true dependence remains, and the new data dependence is an output dependence that is imposed by the architecture of the instruction set and has no computational meaning. 0: r1 op_a r2  r3 () 1: r4 op_b mem1 2: r5 op_c r6  r7 ( status) 3: if status == cc go to 0

1

2

dependence through condition codes

3

Fig. 3. Two processing operations with a single status write.

The PowerPC is another status register architecture but, in contrast to the x86, its data processing instructions format include a bit used to indicate whether the condition bits must be updated or not. This effectively limits the coupling produced by condition codes to the cases where it really has computational meaning, and the compiler is in charge of driving the decision. Now the resulting graph, shown in Fig. 3 only has a true dependence arc and so the block may be processed into less number of computing steps. Although condition codes are typically used for conditional branching in status register architectures, in the case of the x86ISA status flags can also be used as input operands for some operations thus creating true dependences. We need then to take into account

this special use for the analysis. Instructions in the x86ISA can be classified into five groups: • Group I: for data transfer instructions whose purpose is to copy condition codes into the accumulator or to the stack and viceversa. • Group II: for processing instructions using condition codes as an extra input operand. This includes instructions for BDC representation adjustments, instructions for extended arithmetic and rotations through the carry flag. • Group III: for processing instructions (arithmetic or logical) accessing status flags exclusively for writing in order to qualify the result of the operation performed (see Table 1). • Group IV: for conditional branch instructions, shown in Table 2. The condition may require to logically combine more than one status flag, resulting on only 8 different access patterns. Shadowed cells are used to mean complementary conditions, that is, those looking for a false (0) instead of a true (1) value in the same flags. • Group V: for special instructions implementing loops, prefix instructions, string handling, conditional overflow interrupt and carry flag handling instructions using condition flags in miscellaneous manners.

Arithmetic

Group III Read O S Z A P C O

ADD CMP DEC DIV IDIV IMUL INC MUL NEG SUB

S

X X X X X X X X X X

Logic

O S Z A P C

AND OR ROL ROR SHL/SAL SAR SHR TEST XOR

X X X X X X X X X X

O

S

X X X X X X X X X

Write Z A X X X X X X X X X X

X X X X X X X X X X

Z

A

X X

X X

X X X X X

X X X X X

P X X X X X X X X X X

P X X

X X X X

X X X X X

C X X X X X X X X

C X X X X X X X X X

Table 1. Processing instructions.

Branching JB/JNAE JBE/JNA JE/JZ JL/JNGE JLE/JNG JNB/JAE JNBE/JA JNE/JNZ JNL/JGE JNLE/JG JNO JNP/JPO JNS JO JP/JPE JS

O

S

Group IV Read Z A P C X X

X X X X

X X

X X X

X X X X X

X X

Write O S Z A P C

X X

X X X X

Table 2. Conditional branch instructions.

Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

4 Coupling in the basic block due to condition codes The basic block is an advantageous structure for analyzing the impact produced by condition codes to the potential parallelism of a program code as it is a good scenario to identify and understand different data coupling patterns. As we are interested in the coupling produced by condition codes access only, we shall identify how these accesses impose an order of precedence in the execution of the block. Thus, considering condition code data dependences only, each basic block necessarily contains a true dependence between the last instruction updating the status flags before the conditional branch instruction that reads it and the branch instruction itself. From a statistical point of view, the number of true dependences will increase with the number of basic blocks present in the code. In other words, the smaller the basic block, the higher the number of true dependences in the trace. Another typical coupling in the basic block is due to output dependences: the order imposed by processing instructions performing successive writes to the same resource, in this case, the status register. The limitation on the amount of available parallelism due to this type of dependence is a direct consequence of the instruction set architecture and it has no computational meaning at all. Statistically, the average length of output dependences chains grow with the number of processing instructions in a basic block. Large basic blocks also tend to contain large number of processing instructions. In summary we can reasonably state the following: • Larger basic block sizes decrease the hazard of true dependences caused by condition codes. • Larger basic block sizes may increase the length of output dependence chains caused by condition codes.

5 Quantifying the impact of condition codes accesses 5.1 Defining compositions The complete space of contributing data dependence sources selected depends on the objective of the ongoing analysis. In our case, we are focusing on condition codes, so we have: a. contribution from all data types b. contribution from condition codes only c. contribution from non condition codes only We also wish to distinguish among the different types of dependences: true dependences and non-true

dependences further divided into anti-dependences and output dependences. Combining data types contributions and dependence types into meaningful compositions lead to the combinations shown in Table 3. Composition ID 1

2

3

4

5

a b c a b c a b c a b c a b c

Composition Mix ALL

TRUE

NON-TRUE

ANTIdependences

OUTPUT

All data CC only Non CC All data CC only Non CC All data CC only Non CC All data CC only Non CC All data CC only Non CC

Table 3. List of each composition identifier and its components.

According to Equation 6, the longest path length among all partial components is the lower bound for the composition path length L. That is, for the full composition 1a, either 1b path (Lcc) or 1c path (Lncc) may be the limit for the available parallelism by exhibiting the longest path. If Lcc is longer than Lncc, then there is room for improving the concurrency of the full composition by eventually minimizing condition codes contribution. Furthermore, if this happens, then any efforts to minimize dependences caused by other sources will have no effect as condition codes influence is dominant.

5.2 Testbench The testbench is a set of DOS utility programs (comp, find and debug) compiled in real mode as well as some popular common applications like file compressor rar (v.1.52) and the tcc C-language compiler (v. 1.0). Program go from the SPECint95 suite has also been included using two different compilation options: one optimizes for size and the other optimizes for speed. The programs were run in step by step mode and under specific workload conditions to avoid excessively long traces. Nevertheless, more than 190 million instructions were executed.

5.3 Experimental framework We selected static 512 instruction sequence windows. Sliding windows, the typical mode used for the physical layer of processors and simulators, is an excessively heavy load for the computation and it

Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

adds no additional precision compared to a scenario using sufficiently large static windows. We tested window sizes up to 2,048 instructions and found practically no changes in results obtained while computing time substantially increased. On the other hand, relevant literature also confirms that, for a large size of instructions windows, the information obtained from sliding and static windows is the same [8]. We can also argue that there is a very significant difference of magnitudes between the number of instructions in a large window and the number of data locations supported in the ISA, even considering memory as a single resource, so the frontier effects caused by a static window can be neglected. The mathematical manipulation for the quantitative evaluation described before is performed automatically by a software application designed for this very purpose [9]. It allows the analysis of variable size of instructions windows. Given a profile for the dependence contributions, it builds the relevant dependence matrices and obtains the parameter set presented in Section 2 for each component and the total composition.

large basic block; this is agreement with the tendency of larger basic blocks to increase the length of output dependence chains caused by condition codes mentioned above.

5.4 Results Figure 4 contrasts the length of the critical path when condition codes are included or not in the dependences sources compositions. Each graph is normalized against the critical path corresponding to the contribution from all data types. The first graph shows the condition codes contribution for all dependences types. In the case of rar-compressing and debug the path length Lcc is a low bound limit to the complete composition's critical path length L because it is longer than Lncc. For these two cases the impact of condition codes is dominant and it will jeopardize any general decoupling expected from changes operating on the rest of data. These results are in agreement with the results presented in [7] where the absence of dependences caused by condition codes produces a very important performance improvement for these same programs. The second graph of Fig. 4 shows that impact caused by condition codes in true dependences is neglectable compared to the impact caused by other data types contribution and it corresponds to a processing instruction writing a condition code read by the branch instruction following it. Then 3rd and 4th graphs show that condition codes contribute basically as output dependences and cause practically no anti-dependences compared to other sources. According to 4th graph the most important output dependence contribution is for program rar-decompressing, which also has a rather

Fig. 4. Contribution of cc / non-cc for each dependence type or dependence composition.

In the context of basic blocks anti-dependences occur when we have a couple of instructions where the first one reads a condition (to evaluate a branch condition) and the next one writes it (a processing instruction after the branch); this corresponds to interblock dependences crossing the boundary between two consecutive basic blocks and the length of these chains is 2 which is a much smaller value compared to the length in other data type sources (non-cc) compositions. This explains the aspect of the 3rd graph. Figure 5 provides a view of the condition codes source (solely) contribution to each dependence type on all program traces. Data is normalized considering 100% as the length of the critical path for all data dependence sources and all types of dependences, that is, the critical path length of the full composition.

Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Computation, Elounda, Greece, August 21-23, 2006 (pp214-219)

due to the architecture of the x86ISA, it makes this hardware solution an absolute waste of resources. The analysis of particular features of instruction sets and the quantification of its impact in concurrent execution environments is an important aspect of the design of instruction sets architecture. The analytical method we present here can be applied to study several other aspects related to concurrent execution over any ISA. Fig. 5. Isolated contribution of condition codes to the different compositions of dependence types for each program trace testbench.

In Fig. 5 we observe that, mainly, condition codes produce output dependences and the columns for true and anti-dependences are very low. However, in general, the combination of these two (true dependences and anti-dependences) with the output dependences seem to enlarge the overall dependence chains. It seems like the few existing true and antidependences would link two or more output dependence chains producing a new longer chain. Another interesting observation is that the resulting total dependences are much larger for program traces rar-compressing and debug than it is for the rest of program traces. It seems like, when output dependences are accounted together with the other dependence types (true dependences and antidependences), the combination produces very different magnitudes of L. As both of these programs exhibit a block size quite smaller compared to other programs from the testbench, there seems to be a correlation between the size of the basic block and this effect of irregular enlargements.

6 Conclusions In the x86ISA, condition codes are used mainly to asses branch decisions and eventhough they may also be used as input operands in some instructions causing true dependences, this usage is very unusual. The analysis made show a correlation between the size of the basic block and dependence hazards. Large basic block decrements the hazard of coupling due to true dependences caused by condition codes, although a larger block size may also produce lengthening of output dependence chains due to condition codes. Condition codes decrease the amount of available parallelism generating output dependences basically. These types of dependences can be avoided using register renaming techniques, but because they have no computational meaning and are only originated

Acknowledgment This work was supported in part by the Vicerrectorado de Investigación de la Universidad de Alcalá under Grant UAH PI2005/072.

References: [1] T.L. Adams and R. E. Zimmerman, An analysis of 8086 instruction set usage in MS DOS programs, Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, 1989, pp. 152–160. [2] D. Clark and H. Levy, Measurement and analysis of instruction set use in the VAX-11/780, Proceedings of the 9th Symposium on Computer Architecture, 1982, pp. 9-17. [3] R. Durán and R. Rico, On Applying Graph Theory to ILP Analysis, Technical Note TN-UAH-AUTGAP-2005-01, 2005, (http://atc2.aut.uah.es/~gap). [4] R. Durán and R. Rico, Quantification of ISA Impact on Superscalar Processing, Proceeding of EUROCON 2005, 2005, pp. 701–704. [5] I. J. Huang and T. C. Peng, Analysis of x86 Instruction Set Usage for DOS/Windows Applications and Its Implication on Superscalar Design, IEICE Transactions on Information and Systems, Vol.E85-D, No. 6, 2002, pp. 929–939. [6] R. Rico, Proposal of test-bench for the x86 instruction set (16 bits subset), Technical Report TR-UAH-AUTGAP-2005-21, 2005, (http://atc2.aut.uah.es/~gap). [7] R. Rico, et al., The impact of x86 instruction set architecture on superscalar processing, Journal of Systems Architecture, vol. 51-1, 2005, pp. 63–77. [8] D. W. Wall, Limits of instruction-level parallelism, Proc. of 4th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 176-188. [9] Software tool (source code, configuration files, documentation): data dependence analyzer ADD; CVS repository (anonymous user): CVSROOOT:pserver:[email protected]:2 401/home/cvsmgr/repositorio