Evaluating Inlining Techniques - CiteSeerX

9 downloads 16989 Views 240KB Size Report
Aug 14, 1996 - In some compilers (e.g., CMCH92]), the calls that may be inlined ... There is an arc (Pi; Pj) labelled a in the call graph i a is a call site ...... In ACM Conference on Functional Programming Languages and Computer Architecture ...
Evaluating Inlining Techniques TR-96-001

C. R. Ramakrishnan Owen Kaser Dept. of MSCS Dept. of CS UNBSJ SUNY at Stony Brook Saint John, N.B. Stony Brook, NY, 11794 E2L 4L5 e-mail: [email protected] e-mail: [email protected] August 14, 1996

1 Introduction Procedure inlining is widely used by optimizing compilers for imperative and declarative languages [App92, BT86, Bal82, CMCH92, PS94, Sch77, Sta93]. Inlining permits a trade-o between code size and execution speed. The goal of inlining has traditionally been to reduce the execution time with a limited expansion of code space. The gain in execution speed may be due to direct e ects, such as the reduction in the number of call and return instructions executed or due to indirect e ects, such as cache and virtual memory behaviour and in-context optimizations permitted on the inlined code [DH92, CHT91, McF91]. For eciency and ease of implementation, many compilers restrict the conditions under which a procedure may be inlined. In some compilers (e.g., [CMCH92]), the calls that may be inlined depend on the order in which the procedures are declared. Several diverse heuristics are used to restrict inlining of recursive procedures. We call the restrictions imposed on inlining the inlining policy. An inlining technique consists of an inlining policy and a strategy for choosing a sequence of inlining operations that is consistent with the policy. Given a policy and a goal, it is often dicult to nd an ecient strategy. For instance, maximizing the reduction in dynamic function calls, given a limited code space expansion, is NP-hard [Sch77]. Thus, strategies are often based on heuristics, and a greedy strategy is often chosen. The focus of this report is on the comparison of inlining policies and techniques. First, we formalize the power and exibility of inlining policies (see Section 3). We then identify policies based on the version of the inlined procedure that is used, as described below.

The Version Issue: There may be many versions of each procedure during inlining, but only

one version will be used at a particular step. For instance, suppose a call site within the body of procedure P is inlined. After inlining, we have a new version of P , say P 0 , which is semantically equivalent to the version of P before inlining. Now, if a call to P is subsequently inlined, it is equally valid to substitute P 0 (in fact, any version of P ) in place of the call. However, the two versions may have important operational di erences: for instance, substituting with P 0 may result in fewer function calls but may consume more code space. Thus, the choice of the version to inline may impact the e ectiveness of the inlining technique. Although some inliners have such restricted policies that the version issue is unimportant, the less restricted inliners all implicitly use the current version of a procedure, i.e., the most recent version available when the call is inlined. We show that any policy based exclusively on inlining current versions is inherently less powerful than a policy that allows any version to be inlined. In contrast, we describe a policy based exclusively on inlining original versions of procedures (the version before any inlining was performed) is as powerful as a policy that allows any version to be inlined. We also show, in Section 3.3, that the version issue is important only in the context of recursive programs. We then consider greedy strategies with limited lookahead and propose a method for estimating the e ect of one inlining step (see Section 4). We show that the lookahead-based strategy naturally generalizes the special-purpose greedy strategies used in previous inliners. For instance, we demostrate that, if Schei er's inlining policy [Sch77] is used, the result is equivalent to Schei1

er's inliner; if our less-restrictive policy is used, we obtain a new \hybrid inliner". We present experimental results (in Section 5) which show that our hybrid inliner performs uniformly better when compared with Schei er's inliner, and with the inliners described in [Sta93] and [CMCH92].

2 Preliminaries We assume a C-like language with block structure, and without nested procedure declarations. A program P is composed of a collection of n procedures P1 ; P2; : : :; Pn . The locations in Pi from which calls are made to statically known procedures are called the call sites of Pi . For convenience, all call sites in a program are uniquely numbered1 . There may be other locations where calls are made to procedures that are known only at run-time (as in C, via function pointers), but this article does not consider these locations to be call sites. A call graph provides an abstract of this program information.

Call Graph: A call graph for a program P is a labelled, directed multigraph with a node

for each procedure Pi . There is an arc (Pi ; Pj ) labelled a in the call graph i a is a call site in Pi (the caller) at which Pj (the callee) is invoked. The notation caller (a) and callee (a) is used to indicate the caller and callee of arc a. (We may also refer to a as the ath call site in the program.) Following [CMCH92], we add an extra node, SYSTEM, to the call graph. It re ects the interaction of the program with its environment, especially the operating system. The call graph contains an arc from SYSTEM to every other node, since the operating system can potentially invoke any procedure. Since we need not consider arcs from other procedures to SYSTEM, we omit them from our call graphs. We say that a program is recursive if its call graph contains a circuit; otherwise, it is said to be nonrecursive. Though intuitive, this di ers slightly from the normal de nitions for these terms| a program with a call-graph circuit may never actually make a recursive call, and our acyclic call graphs do not guarantee that recursion is not created, for instance, by function pointers. An inlining operation replaces a single call site, for instance a call from Pi to Pj , with the code of Pj , nesting this new code within Pi as a new block. Minor adjustments to the code may be required to preserve the semantics of Pi | for instance, parameter passing must be simulated. Note that only one replacement is done, even if there are other call sites invoking Pj . Moreover, any semantically equivalent version of Pj 's code may be used. Also note that the operation will create new call sites, i the chosen version of Pj contains call sites. After the operation, Pi now has another, semantically equivalent, version| the current version of Pi |available for future inlining operations. After all inlining operations, we obtain a collection of versions for each procedure. The nal program is assumed to include the most recently derived version of each procedure.

Code Size: Consider inlining a call to Pj from Pi to create a new version of Pi, say, Pi0. The size of Pi0 is often less than size(Pi ) + size(Pj ) due to the elimination of the space overhead in 1

Call sites are not \renumbered" when other sites are eliminated or new sites are added.

2

parameter passing, and the increased scope for optimization. Moreover, if this were the only call to Pj in the program, and Pj cannot be called directly from SYSTEM (e.g., static functions in C), we could omit Pj from the program. Hence, the code size of an inlined program may actually be smaller than that of the original program. For simplicity, though, we conservatively assume the size of the new version is the sum of its parts' sizes, and that no procedure is omitted from the program after inlining.

2.1 Flexibility and Power Recall than an inlining policy speci es the conditions under which calls may be inlined, and hence restricts the set of programs that can be derived by inlining. Examples of inlining policies are: \No directly recursive calls may be inlined", and \Only the current version of the callee's code may be used when inlining". Let I1 and I2 be two inlining policies and PI1 and PI2 be the sets of nal programs that can be formed when inlining is constrained to follow I1 or I2 , respectively. The notion of relative exibility of two inlining policies is de ned as follows.

De nition 1 (Flexibility) The policy I1 is at least as exible as I2 i PI2  PI1 . If PI2  PI1 then we say I1 is more exible than I2. Although exibility is a useful metric to compare two policies, it is possible that the additional programs that one policy can derive are inferior to the best program that another policy can derive, under some metric. To capture this, we de ne the notion of relative power of two policies as follows.

De nition 2 (Power) Given a metric  that measures the merit of a program, a policy I1 is more powerful than policy I2 under  i I1 is more exible than I2, and maxp2P 1 (p) > I

maxp2PI2 (p).

Note that exibility is purely a structural notion, whereas a metric must be xed to discuss power. For a simple example comparing policies, consider I1 = \no directly recursive calls may be inlined" and I2 = \only calls to leaf procedures may be inlined". Since by de nition a leaf procedure contains no call site, and a directly recursive call implies a nonleaf procedure, clearly I1 is at least as exible as I2 . It is also easy to see that I1 is, in fact, more exible than I2 .

3 Power of Inlining Original Versions There is an implicit assumption in precious inlining work| that the current version of callee's code is inlined. This is not always the best policy, as we next show. We consider following three inlining policies.

cv| The current versions of the caller and callee are always used. ov| The original version of the callee and the current version of the caller are always used. av| Any version of the caller and callee may be used. 3

proc p a

call p

e

a

c

g

b

site 1 4

f call q

site 2

b

c

f 1

2

g 3

5 tree for q

g call r

site 3

tree for p

e call q

site 4

proc q g call r

site 5

return

Figure 1: A block-nesting graph. Left: Current versions of two procedures of a program. Middle: Block-nesting diagram for these two procedure versions. The circled letters in the procedures represent markers for current or former call sites. The former site marked by a has been inlined, while current site 1 has marker c. Right: New tree added to the block nesting graph when call site 2 is inlined. Clearly, av-inlining is the least restrictive policy. In this section, we show that ov-inlining is at least as exible as av-inlining. We then show that ov-inlining is more powerful than cv-inlining under a reasonable metric.

3.1 Block Nesting Graph In order to measure the exibilty of one policy wrt to another, we abstract the program structure by a block nesting graph. A block nesting graph abstracts the structure of each version of each procedure, showing how the major blocks in each version nest (see Figure 1). Each procedure initially has only a single major block, which includes all its code. Whenever a call site is inlined, all major blocks within the callee are duplicated within the caller. The block nesting graph is a forest, with a tree Tiv for each version v of each procedure Pi . The tree has two kinds of nodes, mb-nodes that represent major blocks, and c-nodes that represent call sites. The c-nodes may only be leaves, whereas the mb-nodes may be internal nodes or leaves. If a call site c (or another major block M 0) is nested within a major block M , then the mb-node for M has the node for c (or M 0 , respectively) as a child. Each arc in the tree is labelled with a call site in the original program, and each c-node is in 1-1 correspondence with a call site in the version being represented and is labelled thus. For a tree T in a block nesting graph, we de ne Ein(T ) to be the set of arcs entering mb-nodes. The initial forest contains trees with a single mb-node and as many c-nodes as call sites. Suppose a new version vi0 of Pi is formed by taking version vi of Pi and, at site s, inlining version vj of Pj . This inlining step is re ected in the block nesting graph by adding a new tree Tivi . The 0

4

new tree is constructed by replacing the c-node labelled s in a copy of Tivi by a copy of Tjvj and relabelling all the c-nodes derived from Tjvj . The rightmost part of Figure 1 shows the new tree added to the block nesting graph as a result of inlining at site 2. xcept for small variations introduced by the \minor adjustments" performed when the inlined procedure's code is integrated, we make the following assumption: Consider program R to which two di erent sequences of inlining operations have been applied, leading to programs R1 and R2 . For some procedure Pi in the program, let T 1 = Tiv1 in R1, and T 2 = Tiv2 in R2 . If T 1 = T 2, then version v1 in R1 and version v2 in R2 contain the same code.

3.2 Power of ov-inlining Using the abstraction of block nesting graphs, we can establish the following theorem:

Theorem 1 The ov-inlining policy is at least as exible as the av-inlining policy. Proof Sketch: Consider any nal program P obtained using av-inlining. We show how the

same program can be obtained using ov-inlining. First, consider the subset of the block nesting diagram for P that contains only one tree per procedure| the tree for the version of the procedure that appears in P . Observe that, if we can construct a program Q whose block nesting diagram matches that of P , then Q and P are the same (structurally, as well as semantically.) (We require the trees be isomorphic and arc labels, but not node labels, must match.) However, any required tree is easily obtained by \growing" it top down| wherever the desired tree has an mb-node and the current tree has a c-node, perform an ov-inlining operation on the call site corresponding to the c-node. This can be repeated until the desired tree obtains2 . For a precise proof, see [Kas93].

2

Next, we consider a gure of merit  for inlined programs, which rewards the reduction of function calls executed at runtime, while requiring the program's size not be excessive. Let f (P ) denote the number of function calls performed when a program P is executed with a prespeci ed input. Let  (P ) denote the size of P , and let Porig be the original program whose inlined form is P.

(P ) =

(

?1 if  (P ) > S f (Porig) ? f (P ) otherwise

Theorem 2 Under , ov-inlining is more powerful than cv-inlining. To prove the theorem, consider the program in Figure 2. Clearly, the call site within f should be inlined. Inlining the original version twice leaves us with a program where f(x) invokes f(x-3), with f now thrice its original size. If this is the maximum code growth permitted, cv-inlining is only able to inline once at the recursive call site. (The next inlining would leave f at quadruple In our experiments, we have used this approach to simulate cv-inlining with ov-inlining. Each cv-inlining operation is replaced by the sequence of ov-inlining operations that makes the same additions to the block nesting graph. 2

5

proc f(x) begin if x=1 then return 0 else y = f(x-1) return 1 - y endif end proc main begin z = f(300) end

Figure 2: Program used to demonstrate that ov-inlining is more powerful than cvinlining. its original size.) Thus, the best cv-inlined program would perform about 50% more calls than the ov-inlined program.

Note: In [Kas93], we nd an example where (maxp2P (p))=(maxp2P (p))  N= log N . ov

cv

3.3 Importance of Recursion It is not coincidence that the program in Figure 2 was recursive. For non-recursive programs, we have the following result:

Theorem 3 For a non-recursive program, cv-inlining is at least as exible as av-inlining. Proof Sketch: Consider some program P , from which av-inlining has obtained P 0. Since the call graph of P is acyclic, its procedures can be ordered topologically; thus, a procedure precedes

all procedures it calls (directly or transitively). We can obtain the block-nesting diagram for P 0 (and thus P 0 itself) by working exclusively on one tree at a time. First, we inline call sites within the topologically rst procedure, until the nal form of the tree (procedure) is achieved. It will be, since only the current version of each callee is the original version. Then we proceed similarly to the topologically second procedure, and so forth. For every inlining operation, the callee has only one version, since it appears topologically after the caller. 2

4 A Multi-version Inlining Technique Having shown theoretical advantages to the policy of ov-inlining, we now show how to exploit the

exibility in an inliner. The goal of the inliner is to obtain the maximum reduction in function calls when the program is run with a xed input, without exceeding a code-size bound. This 6

goal has previously been used in [Sch77, CMCH92, BT86, Bal82]. Schei er has shown that the problem of choosing a sequence of inlining steps to attain the above goal is NP-hard3 . We examine suitable greedy heuristics to solve this problem in Sections 4.2 and 4.3. At every step, we will nd the call site whose inlining yields the largest reduction in the number of calls. An obvious problem arises at this point: how are we to know the e ect of an inlining operation? It is relatively easy to gather pro le data to obtain the count of the number of calls performed at each call site for a given input in the original program. However, except when inlining a leaf procedure, new call sites will be created. To compute accurately the number of calls that will be performed at these sites, we would need the complete call history upon reaching every call site in the original program. This is prohibitively expensive, as discussed in [CMCH92, Sch77]. Instead of attempting to predict the e ect of an inlining step exactly, we estimate its e ect using pro le data with a one-level history: the number of times each procedure was originally entered, and the number of times each call site was taken. The fundamental assumption behind the estimation technique is that the behavior of a procedure can be approximated well by an \average behavior" that is independent of where the procedure is called from. This corresponds to the constant ratios assumption made in [Sch77]). The use of the assumption is not new, and has been validated in [Sch77, Kas93]. The estimation technique, however, is novel and is based on a probabilistic model4 described below.

4.1 Probabilistic Model In this model we associate, with each call site a, an expected execution frequency denoted by a, that corresponds to the average number of times a is reached per invocation5 of callee(a). We also maintain the expected call frequency mij for every pair of procedures Pi and Pj , which is the expected number of times Pi directly invokes Pj (from statically known call sites) for P each invocation of Pi . Clearly, mij = a a for sites a in the program with callee (a) = i and caller (a) = j . For a n-procedure program, we arrange the expected call frequencies as a matrix M, the direct call matrix of the program. Each column of M is a vector that describes the calling behaviour of one invocation of (the current version of) a particular procedure. The total number of times procedure Pi is entered, denoted by i , is divided into direct entries, from known call sites, and indirect entries, from SYSTEM. The latter do not vary with inlining, unlike the former. The number of indirect entries to Pi is denoted by si . Let  = (1; 2 ; : : :; n )T and s = (s1; s2 ; : : :; sn )T . From the de nition of M;  and s, we have  = M   + s and hence It should be noted that although the hardness result was established under cv-inlining, the result is general since the result was established for the case of nonrecursive programs. 4 Our model resembles the stochastic program model described for owcharts in [Deo74, pp. 439{448]. 5 In programs free of loop constructs, the expected execution frequency of site a coincides with the conditional probability of reaching a on any invocation of callee(a). 3

7

SYSTEM 1

0.75 P1 0.065779

2

1.5

0.10

P2 2 P3

Figure 3: Top: Call graph for a program with three procedures. Dotted arcs represent indirect calls from the system, made the indicated number of times. Solid arcs represent direct calls with the shown  values.

 = (I ? M)?1  s. Refer to Figure 3 for an example, where 1 0 1 0 : 75 : 165779 0 CC BB 1 CC BB  = @ 1:5 0 0 A  + @ 2 A ; and its solution is  = (1000; 1502; 3004)T : 0

2

0

0

It is easy to see that  = (I + M + M2 +   )  s: Clearly, the in nite sum in the above equation converges whenever the prgram execution terminates. We say that the matrix U = P Mk is the indirect call matrix of the program: the (i; j )th entry u represents (I ? M)?1 = 1 ij k=0 the expected number of times Pj will be invoked (directly or indirectly) for every invocation of Pi from SYSTEM.

4.2 Estimating the E ect of an Inlining Operation When a copy of a call site is created by inlining, the model assumes that the new call site maintains the expected execution frequency with respect to the new (outermost) nested block that is introduced into the block nesting diagram. More formally, consider inlining at the call site a such that callee (a) = i and caller (a) = j . For each call site b in Pj , a new call site c is now introduced in the new version of Pi , with c = a b. For instance, in Figure 1, if all call sites a in the original program had a = :5, we now have 4 = 5 = 9 = :5; 1 = 6 = 2 = 3 = 8 = :25; 7 = :125. Call sites 1{4 are not part of the program, since they are not within the current version of procedure p. The bene t due to an inlining operation is the number of calls it saves. At any point while inlining, there may be several potential inlining operations possible, and one must be selected as the next operation. Using a greedy strategy, at each step we select the inlining operation (the arc to be inlined, and the version to be used) that results in the largest number of calls saved 8

per unit increase in code space. For the inlining process to be fast, ecient selection of the best inlining operation is crucial. Therefore, we next derive an ecient procedure to determine the best operation and its e ect, under both ov-inlining and cv-inlining policies. Consider inlining an arc a = (Pi ; Pj ). The e ect on M is that the ith column of M changes to re ect the the addition of any new call sites and the removal of the old call site. Speci cally, under the cv-inlining policy, we have

Mnew = M + aM  1j;i ? a1j;i where 1j;i is a matrix which is 1 at the (j; i)th element and zero elsewhere. Under the ov-inlining policy, we have:

Mnew = M + aMorig  1j;i ? a1j;i where Morig denotes the direct call matrix of the original program (before any inlining). From (1) and the invariant U?1 = (I ? M) we obtain ?1 ?1 U?1 new = U + Uorig  a1j;i

(1)

It should be noted that since inlining operations do not a ect the termination characteristics of programs, the matrix Unew = (I ? Mnew )?1 exists. Next, let Unew = B  U. The matrix B essentially captures the change in U due to the inlining step. Note that B = Unew  U?1 and ?1 ?1 ?1 ?1 hence B is invertible. Since U?1 new = U  B = U + Uorig  a 1j;i , we have

B?1 = I + U  U?1 orig  a1j;i : Now, let A = U  U?1 orig. Hence B = (I + A  a1j;i)?1:

(2)

We will now use the identity that

(I + cD  1k;l )?1 = (I ? ( 1 +ccd )D  1k;l ) lk

to get

B = I ? ( 1 +a a )A  1j;i a ij

and

I ? B = ( 1 +a a )A  1j;i: a ij

Note that (1 + aaij ) cannot be zero, otherwise the entire ith row of B would have been zero and hence B would not have been invertible. P Let j j = i i . The bene t of an inlining step is simply j j? jnew j. Let e be a row-vector of length n such that ei = 1 for all i. 9

Then, using the de nitions of U and Unew , we have

j j ? jnewj = e   ? e  new = e  U  s ? e  Unew  s = e  (U ? Unew )  s = e  (U ? B  U)  s = e  (I ? B)  U  s = e  (I ? B)   Now, from de nition of (I ? B) and e, e  (I ? B) is a row-vector f of length n such that

8Pa < fl = : 1+ a 0 a

k kj a ij

if l = i otherwise

Continuing, the number of calls saved

j j ? jnewj = e  (I ? B)   = f i i P = ( 1a+ k aakj )i a ij

Note that the matrix A = U  U?1 orig need be computed only once per inlining step. At that P time, we can also precompute k akj for each value of j . Thus, at every inlining step, the number of calls saved by inlining each arc can be computed in time proportional to the number of arcs. Moreover, we need compute only A at each step, and Anew can be computed directly from B and A. In fact, using (2) and the de nitions of U and Unew, Anew can be computed in O(n2) time, a akj aik since its (k; k0)th entry is akk ? 1+ a aij . Thus the choice of arc to be inlined can be made in 2 time O(no. of arcs) + O(n ). 0

0

Specialization to cv-inlining: Note that for cv-inlining, the above procedure still holds with P A = U  U?1 = I. So k akj = 1. If we are inlining a non-recursive arc (i.e., i 6= j ) then aij = 0 and hence the number of calls saved is ai : If the arc inlined is recursive (i = j ), then aij = 1  and hence the number of calls saved is 1+aa i : Thus we get Schei er's equations.

4.3 A Hybrid Inlining Strategy It is straightforward to obtain a greedy strategy that considers only ov-inlining steps based on the above procedure to estimate the e ect of an inlining operation. There is an additional diculty, however. A cv-inlining operation always leads to fewer calls at runtime. This does not hold for ov-inlining, where an ov-inlining operation can replace a call to a highly optimized procedure by its unoptimized code. In many such cases, an ov-inlining operation will actually lead to more calls at runtime| though further ov-inlining would lead to a superior result. These small, local moves in the state space can result in the greedy strategy becoming trapped at a local maximum. In 10

such cases[Kri84] it helps to add some look-ahead capability, so the strategy can greedily choose a small sequence of operation. We can exploit the fact that cv-inlining always leads to fewer calls (in reality, and in predictions). The greedy strategy now considers both ov- and cv-inlining at every call site. Since a cvinlining operation can be duplicated by a sequence ov-inlining operations, our solution gives the greedy strategy some lookahead along certain chosen paths through the state space, and these non-local choices provide escapes from local maxima. We refer to this form of inlining as hybrid inlining.

Discussion: The resulting inlining technique specializes to that in [Bal82, Sch77] if we adopt a cv

policy and a greedy strategy without additional lookahead. Since these inliners have shown good performance, and our technique relies on the same probabilistic model but has fewer restrictions, we also expect good performance.

5 Emperic Evaluation In this section, we use experiments to compare the performance of di erent inlining techniques. We compare our hybrid inlining technique against the ov- inlining technique (without lookahead), against the cv- inlining technique of Ball and Schei er, and against restricted techniques that resemble those used in [CMCH92] and GCC. Ultimately, the success of a particular inliner is likely to be judged by the speedup it gives to programs running on some particular architecture. However, our goal is now to con rm that additional exibility (for instance, that a orded by ov-inlining) translates to more success in inlining. Approaches based on Schei er's method are guided by an attempt to minimize the number of function calls performed at runtime. In our experimental comparison, then, it is most appropriate to measure the percentage reduction in function calls predicted, with a xed code-size expansion. The relationship between execution times and the number of calls performed has been examined in [DH92], and the accuracy of the predictions has been examined in [Sch77, Kas93]. In [Kas93] we also examine the greedy strategy with cv-, ov-, and hybrid inlining, using exact call information instead of relying on predictions made using the probabilistic model.

5.1 Our Inliner To gather data, an inliner was designed for the C language. The GNU C compiler[Sta93] was modi ed to emit a call graph for each separately compiled module and generate code to permit suitable pro ling. Our inliner then pro led the program, built a program-wide call graph, and emitted information on the desired inlining operations and the predicted number of calls executed. In principle, the emitted inlining commands could have been read and acted upon; however, GCC has a very restrictive inlining policy. Our measurements, however, must not be tainted by the arbitrary restrictions of the compiler chosen for modi cation. Thus, inter-module inlining was permitted, as in [CMCH92]. 11

The chosen programs included Bison (version 1.24), processing the grammar for GCC 2.7.2; Flex (version 2.5.2), creating the lexical analyzer for equals[KRRS97]; Cccp, the macro preprocessor for GCC 2.5.8, processing cccp.c, and a variety of C programs (discussed below) automatically produced by the equals compiler. We note that the increased use of C as a \portable assembly language" by compilers for declarative languages[Emb94] means that previous assumptions about C programs and their inlining needs may no longer be valid. For instance, the observation that handwritten C programs do not typically perform many recursive calls may tempt a compiler writer to forbid inlining at recursive call sites. However, the C programs produced, for instance, by the equals compiler are highly recursive. Thus, our measurements include both machine-generated as well as handwritten C programs. The equals programs included several toy programs (Euler, MatMult, MergeSort, N b, NumInt, Queens, QuickSort, Sieve, Strassen, SumFrom, and Tak that have been described before in [Kas93]. Several slightly larger programs were also included: FFT, ODProv, PCProv, Pascal, Event, and Ida. The last two are from Hartel's benchmark suite[HL93], while ODProv[O'D85] and PCProv are toy theorem provers. Pascal is an interpreter for a subset of Pascal. All of the larger programs except FFT make extensive use of lazy evaluation[Pey87]. This reduces the number of calls that procedures make directly to one another; instead, procedures invoke (and are invoked by) the runtime system frequently. Recall that we assume such calls cannot be inlined by any technique. To obtain meaningful results, then, we show the predicted percentage of inlinable calls remaining. The more successful a technique, the lower its numbers. See Tables 1 and 2, where smaller benchmarks were permitted to double or triple, respectively. Larger benchmarks were permitted a 5% and 20% expansion, respectively, in these tables. Note that GCC's inliner does not take pro le information into account; rather, it either relies on programmer declarations, or inlines suitably small procedures within a module. We have followed the latter approach, and it is clear that pro le information would have been helpful. The inliner that was based on the technique discussed in [CMCH92] did not perform particularly well, especially on the recursive programs. This was anticipated, as the strategy does not inline recursive calls, which predominate in these benchmarks. An extension to their basic inliner, which preprocesses directly recursive calls, was sketched but apparently not used in [CMCH92]. Since it is vague on how to deal with procedures containing several direct calls, such our divide-and-conquer benchmarks, we did not implement a preprocessing step. Examining the di erences between the ov-, cv- and hybrid inliners, it is clear that, while the ov policy is more exible than cv, trapping at local maxima is a serious problem unless lookahead is used. Further, we see that hybrid inlining can be an improvement on both cv- and ov- inlining| for instance, in Table 2 there are four such examples. In these cases, a mixture of cv- and ov- steps has been used to achieve a result that would not have been found by either technique alone. On the other hand, we see in Table 1 ve benchmarks on which the greedy strategy did not bene t from the additional exibility; rather, it lead to choices that, ultimately, were counterproductive. Additional lookahead would thus be useful, to further improve the hybrid inliner.

12

Benchmark Euler MatMult MergeSort N b NumInt Queens QuickSort Sieve Strassen SumFrom Tak Bison Cccp Event FFT Flex Ida ODProv Pascal PCProv Geom. Mean Wins

CV

OV Hybrid impact gcc 7.9 25.4 8.7 61.6 61.7 10.0 25.2 9.9 99.5 75.4 39.4 36.1 36.4 83.2 88.1 44.4 46.2 46.2 100.0 100.0 33.3 28.5 28.5 50.0 50.0 8.4 24.1 8.4 24.1 100.0 24.3 32.1 24.3 68.0 53.4 13.2 12.5 12.5 99.8 100.0 74.4 61.8 61.8 69.6 100.0 44.4 46.6 46.6 100.0 100.0 42.2 39.1 39.1 100.0 100.0 42.9 44.2 42.9 68.3 82.3 49.8 52.9 49.8 60.1 99.7 23.7 23.7 23.7 54.8 100.0 35.6 30.8 30.8 98.3 100.0 40.7 41.0 40.7 60.9 99.0 3.1 23.1 3.1 26.6 100.0 61.8 60.2 60.2 71.8 100.0 5.5 9.2 9.2 11.0 94.5 67.3 63.3 63.3 87.0 100.0 25.0 32.6 24.9 ?? ?? 11 9 15 0 0

Table 1: Inlinable calls remaining after a moderate code-size increase, for several inlining techniques. These include the standard (current version) inliner due to Schei er; the ov policy with a greedy strategy without lookahead; the hybrid technique, corresponding to an ov policy with a greedy strategy and cv lookahead; a technique similar to that used in IMPACT [CMCH92]; and a technique similar to that used by GCC. The best results in each row are highlighted. Top: Smaller benchmarks with a 100% expansion. Bottom: Larger benchmarks with a 5% expansion.

13

Benchmark Euler MatMult MergeSort N b NumInt Queens QuickSort Sieve Strassen SumFrom Tak Bison Cccp Event FFT Flex Ida ODProv Pascal PCProv Wins

CV 5.4 5.3 31.4 44.4 22.2

3.4

16.3 7.0 65.4 44.4 37.9

12.9 27.7 0.1

16.0

16.0 1.3

39.8

0.0

43.3 7

OV Hybrid impact 24.4 4.5 61.6 25.1 5.2 99.5 29.8 28.2 83.2 34.6 34.6 100.0 25.7 21.3 50.0 24.1 3.4 24.1 22.5 15.9 68.0 6.8 6.8 99.8 44.5 44.5 60.9 35.0 35.0 100.0 31.6 31.6 100.0 24.8 12.9 41.6 35.5 27.7 35.2 0.1 0.1 19.1 15.2 15.2 96.5 23.3 16.0 41.8 23.0 1.3 26.0 37.6 37.6 54.9 0.0 0.0 11.0 36.4 36.4 85.4 10 20 0

gcc

61.7 75.4 88.8 100.0 50.0 100.0 64.1 100.0 100.0 100.0 100.0 82.3 98.4 89.8 100.0 98.6 100.0 100.0 94.5 100.0 0

Table 2: Inlinable calls remaining after a moderate code-size increase The best results in each row are highlighted. Top: Smaller benchmarks with a 200% expansion. Bottom: Larger benchmarks with a 20% expansion.

14

6 Discussion References [App92] [Bal82]

Andrew W. Appel. Compiling with Continuations. Cambridge University Press, Cambridge, UK, 1992. John Eugene Ball. Program Improvement by the Selective Integration of Procedure Calls. PhD thesis, University of Rochester, 1982. [BT86] Henri E. Bal and Andrew S. Tanenbaum. Language- and machine-independent global optimization on intermediate code. Journal of Computer Languages, 11:105{121, 1986. [CHT91] Keith D. Cooper, Mary W. Hall, and Linda Torczon. An experiment with inline substitution. Software| Practice and Experience, 21:581{601, 1991. [CMCH92] Pohua P. Chang, Scott A. Mahlke, William Y. Chen, and Wen-mei W. Hwu. Pro le-guided automatic inline expansion for C programs. Software|Practice and Experience, 25:349{369, 1992. [Deo74] Narsingh Deo. Graph Theory with Applications to Engineering and Computer Science. Prentice Hall, 1974. [DH92] Jack W. Davidson and Anne M. Holler. Subprogram inlining: A study of its e ects on program execution time. IEEE Transactions on Software Engineering, 18:89{101, 1992. [Emb94] Hugh Emberson. Compiling to C references. posted to comp.lang.functional, November 1994. bibliography summarizing request for information. [HL93] Pieter H. Hartel and Koen G. Langendoen. Benchmarking implementations of lazy functional languages. In ACM Conference on Functional Programming Languages and Computer Architecture (FPCA), pages 341{349, New York, June 1993. ACM. [Kas93] Owen Kaser. Computational Aspects of Inlining. PhD thesis, SUNY at Stony Brook, May 1993. [Kri84] Balakrishnan Krishnamurthy. An improved min-cut algorithm for partitioning VLSI networks. IEEE Transactions on Computers, C-33:438{446, May 1984. [KRRS97] Owen Kaser, C. R. Ramakrishnan, I. V. Ramakrishnan, and R. C. Sekar. Equals|a fast parallel implementation of a lazy language. Journal of Functional Programming, 1997. tentatively scheduled to appear March 1997. [McF91] Scott McFarling. Procedure merging with instruction caches. In Proc. of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 71{79, 1991. [O'D85] Michael J. O'Donnell. Equational Logic as a Programming Language. Foundations of Computing. MIT Press, 1985. [Pey87] Simon L. Peyton-Jones. The Implementation of Functional Programming Languages. Prentice-Hall, 1987. [PS94] Simon L. Peyton-Jones and Andre Santos. Compilation by transformation in the Glasgow Haskell compiler. In Functional Programming, Glasgow '94, Workshops in Computing, pages 184{204. SpringerVerlag, 1994. September. [Sch77] Robert W. Schei er. An analysis of inline substitution for a structured programming language. Communications of the ACM, 20(9):647{654, 1977. [Sta93] Richard M. Stallman. Using and Porting GCC. Free Software Foundation, October 1993. for version 2.5.

15