Bayesian Forecasting of Parts Demand Phillip M. Yelland Sun Microsystems Laboratories, 16 Network Circle, Menlo Park, CA 94025, U.S.A.

Abstract As supply chains for high technology products increase in complexity, and as the performance expected of those supply chains also increases, forecasts of parts demand have become indispensable to effective operations management in these markets. Unfortunately, rapid technological change and an abundance of product configurations mean that demand for parts in high-tech is frequently volatile and hard to forecast. The paper describes a Bayesian statistical model developed to forecast parts demand for Sun Microsystems, Inc., a major vendor of network computer products. The model embodies a parametric description of the part life-cycle, allowing it to anticipate changes in demand over time. Furthermore, using hierarchical priors, the model is able to pool demand patterns for a collection of parts, producing calibrated forecasts for new parts with little or no demand history. The paper discusses the problem addressed by the model, the model itself and a procedure for calibrating it, and compares its forecast performance with that of alternatives. Key words: Bayesian methods, demand forecasting, forecasting practice, state space models, supply chain

1

Introduction

1.1 Background Manufacturing modern high technology products like computers is an exacting business: Competition is intense, product life cycles fleeting, components are frequently expensive and prone to rapid obsolescence, and supply chains often span the globe. As Lapide (2006) observes, demand forecasts have become increasingly central to supply chain management for participants in these markets. This paper focuses on one such market participant —Sun Microsystems Inc., a vendor of enterprise computing products—and on forecasts of the demand for the manufacturing parts used in its supply chain. Email address: [email protected]

1

Bayesian Forecasting of Parts Demand α = 3.6,, δ = 5.98

400

α = 4.1,, δ = 33.8

120

α = 5.8,, δ = 18.8

150

100 300 80 200

100

60 40

50

100 20

α = 2.4,, δ = 12.9

250

α = 5,, δ = 21.6

500

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q2P1

0 Q3P1

0 P3 Q3P1 P2 P3 Q4P1 P2 P3 Q1P1 P2 P3 Q2P1 P2 P3 Q3P1 P2

0

α = 3.6,, δ = 13.0 1000

200

400

150

300

100

200

50

100

200

0

0

0

600

200

300

150

200

100

100

50

0

0

80

Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q4P1

Q3P1

Q2P1

α = 4.1,, δ = 14.4

400

α = 3.0,, δ = 14.1

60

20

P3 Q2P1 P2 P3 Q3P1 P2 P3 Q4P1 P2 P3 Q1P1 P2 P3 Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q3P1

0 Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

40

Q4P1

Units demanded

α = 4.4,, δ = 18.1

Q1P1

Q4P1

Q3P1

Q2P1

400

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

800

Period

Figure 1. Sample part demands (solid lines): Time intervals on the horizontal axes of the graphs are financial planning periods, of roughly one month’s duration; Q2P1, for example, is the first planning period of the second quarter of a financial year. Unit shipments have been normalized pro rata to 20 trading days in each period. Dotted curves and values recorded for α and δ delineate part life cycles—they are discussed in detail in Section 2.3.

2

Bayesian Forecasting of Parts Demand Minimum

1st Quartile

Median

Mean

3rd Quartile

Maximum

474

1,058

1,951

2,817

3,958

8,916

Length in periods

13.00

17.00

21.00

20.78

24.00

31.00

Mean units/period

23.79

62.15

110.66

136.94

210.08

405.28

Coef. of variation

0.49

0.68

0.87

0.85

0.97

1.27

Total units

Table 1. Summary statistics of full sample of 45 part demand series (selected members of which are illustrated in Figure 1).

As might be expected for a diversified IT vendor with approximately $14 billion in annual sales, 1 Sun’s manufacturing operations consume a multitude of parts, ranging from semipopulated computer chassis and disk-drive assemblies to power cables and case fasteners. Since these components often have appreciable costs and lead times, 2 short term forecasts of the quantities to be consumed in manufacturing are used to ensure appropriate levels of supply in advance (either by stocking inventories of parts, or securing “just-in-time” delivery commitments from suppliers). Forecast accuracy can have a substantial bearing on the effectiveness of a supply chain, as de Kok, Janssen, van Doremalen, van Wachem, Clerkx, and Peters (2005) indicate. Surveys of commercial practice including (Dalrymple 1987; Wisner and Stanley 1994; Sanders and Manrodt 1994; Klassen and Flores 2001; Sanders and Manrodt 2003) also attest to the importance of forecasting in supply chain operation. Tens of thousands of different parts are used in Sun’s product lines, and around 1,000 of these (the exact number fluctuates constantly) are sufficiently expensive and subject to sufficiently long lead times that they require forecast updates at least every month. Prior to the work described in this paper, the company produced forecasts for these latter parts using a heuristic calculation—detailed later in the paper—based on the product-level sales forecasts produced by its sales and marketing organizations (Yelland 2004). The study described in this paper was initiated with the aim of establishing whether it might be possible to improve upon these heuristic forecasts by using a statistical model. Figure 1 displays the lifetime demands for a selection of the manufacturing parts at issue. 3 Here demand is represented by unit shipments in a planning period, which are of roughly one month’s duration. For the purposes of supply chain planning, the company’s fiscal quarters are divided into three such planning periods, of four, four and five weeks—Q2P1, for example, is the first planning period of the second quarter of a financial year. Unit 1 2 3

$13.88 billion in the financial year ending June 2008. The lead time of a component is time required to make or procure it. To preserve commercial confidentiality, the data have been mildly disguised.

3

Bayesian Forecasting of Parts Demand shipments in the figure (and throughout this paper) have been normalized pro rata to 20 trading days in each period. The parts in the figure are taken from a larger sample of some 45 parts that was used to guide the development of the model described in this paper. Characteristics of the series in this full sample are summarized in Table 1. In selecting parts for study, we were careful to ensure that: (1) The per-unit cost of a part (which ranged into thousands of dollars for those in the study) was sufficient to justify the effort involved in developing the model and running it on a routine basis, and (2) Unit shipments per planning period (as displayed in the Table) were generally high enough that issues such as demand intermittency (Boylan 2005) and discreteness (McCabe and Martin 2005; Yelland 2009) might reasonably be put aside. These selection criteria notwithstanding, the Figure and the Table illustrate the challenge involved in producing forecasts for manufacturing parts at Sun. Demand for parts is normally much more variable than that of the products to which they belong, since part demand depends on customer configuration choices, supplier sourcing arrangements, technological changes, the availability of substitutes, and so on. (This variability is roughly quantified by the coefficients of variation in Table 1, which are calculated by taking the ratio of the standard deviation of demand in a series to the series mean.) In addition, technological development and changes in manufacturing and procurement generally means that part life cycles are very short—appreciably shorter than those of the products themselves.

1.2 Prior Work The literature exploring operations management in the face of uncertain demand is vast, and dates back to the 1950’s and beyond (Arrow, Karlin, and Scarf 1958). By and large however, researchers have concentrated on investigating policy responses to demand that is generated by a known stochastic process. For example, early works such as (Veinott 1965) assume that demand conforms to a known set of independent probability distributions, while more recent exercises assume a known member of a class of processes, such as ARIMA 4 (Gilbert 2005) or linear/Gaussian state-space (Aviv 2003). Specific prescriptions for using formal quantitative methods to forecast demand for supply chain management are far less numerous. (Harvey and Snyder 1990; Snyder, Koehler, and Ord 2002; Gardner 1990) are amongst the few examples describing generally applicable methods, while (Boylan 2005) and (Shenstone and Hyndman 2005) concentrate on specialized forecasting techniques applicable to items such as spare parts with intermittent demands. Documented instances of statistical forecasting models actually employed in commercial supply chain management are rarer still: While commercially-available software packages for supply chain management include forecasting techniques such as ARIMA, linear regres4

ARIMA is an abbreviation of “Auto-Regressive Integrated Moving Average”—c.f. (Box, Jenkins, and Reinsel 1994).

4

Bayesian Forecasting of Parts Demand sion and spectral-decomposition-based smoothing (Yurkiewicz 2006), both Sanders and Manrodt (1994) and Lapide (2006) note the continued heavy reliance on judgmental and ad-hoc methods in most companies. Our own experience at Sun attested to the difficulties involved in forecasting for operations using statistical models—we found that such approaches were frequently stymied by little or no historical data (which rendered techniques such as ARIMA and spectral smoothing inapplicable) or lack of suitable predictors (ruling out linear regression, for example). The quantitative forecasting technique that perhaps most appeals to both academics and practitioners is one of the most venerable: exponential smoothing. For practical applications, exponential smoothing is technically straightforward to implement, requires little or no historical data for calibration, and no predictors or regressors. Armstrong and Green (2005) aver that “exponential smoothing is the most popular and cost effective of the statistical extrapolation methods”, and Gardner (1990) documents the use of exponential smoothing in operations management. On the theoretical side, as summarized by Gardner (1985, 2006), the technique has enjoyed almost 40 years of continuous development since its invention by Brown (1959), and has recently engendered the development of a new class of structural time series models (Hyndman, Koehler, Ord, and Snyder 2008). Although exponential smoothing appeared a natural choice as a generic approach to the parts forecasting problem, experience with the product-level sales model documented in (Yelland 2004), suggested that a specially-designed statistical model might well yield superior results, and also pointed to the effectiveness of Bayesian techniques for extrapolating short time series like part demand histories. Unfortunately, the model in (Yelland 2004) relies on informative Bayesian priors deduced from judgmental forecasts for product sales, and these priors are an unreliable guide to demand for parts in those products, for the reasons described in the previous section. On the other hand, the use of diffuse or noninformative priors is ruled out by the necessity of producing forecasts early in the life of a part, before sufficient observations accrue to produce proper forecast distributions. Therefore, though the model presented here is also Bayesian, it uses a hierarchical prior (Gelman and Hill 2006) to produce initial parameter estimates from sales records of established parts. This sort of “forecasting by analogy” echoes the work of Duncan, Gorr, and Szczypula (2001), who observe that Bayesian pooling also helps deal with time-series volatility. Früwirth-Schatter and Kaufmann (2008) use hierarchical Bayesian priors for time-series analysis, too, though their focus is on clustering, rather than forecasting.

2

A Model for Parts Demand at Sun

This section describes the Bayesian model developed for the parts forecasting problem. For reference purposes, notation, quantities and probability distributions used throughout the paper are documented in Appendix A and Appendix B. A summary of the model is provided in Section 2.9. We begin with a few prefatory remarks.

5

Bayesian Forecasting of Parts Demand 2.1 Motivation In the conventional forecasting situation examined in the literature, a suitably long time series of observed values—y = (yt , . . . , y T ), say—is presented for extrapolation, and (restricting the discussion to a single period forecast horizon for simplicity) the task of the forecaster is to predict the value of the next value of the series, y T +1 . In the case of the part demand forecasting exercise examined here, however, short life cycles mean that individual parts frequently lack sufficient observed demand values to support reliable extrapolation. In this application, therefore, input to the forecasting process consists consists not only of previous demands for a particular part, but also of observed demands for other parts, too—even parts which were recently withdrawn from the manufacturing process. Thus the data takes the form of a collection of series, y1 , . . . , yN , of potentially differing lengths (some of which may be zero), so that for i = 1, . . . , N, yi = (yi1 , . . . , yiTi ). The aim is to forecast yk Tk +1 , for some chosen part k. The objective of the model presented in this section is a statistical representation of such a collection of demand series using a set of random quantities. This latter set—which we denote in the abstract by the vector θ—contains not only model parameters in the conventional sense, but also the values of latent variables or processes. 5 The representation in terms of θ is sufficiently detailed that all the elements of the series are conditionally independent given θ, i.e.: 6 N

p(y1 , . . . , yN |θ) =

Ti

∏ ∏ p(yit |θ).

(1)

i =1 t =1

A Bayesian forecast of yk Tk +1 rests on its posterior predictive distribution. The latter is simply the conditional distribution of yk Tk +1 given the historical demands, p(yk Tk +1 |y1 , . . . , yN ). 7 The conditional independence property of the model expressed in equation (1) is pivotal to the derivation of this distribution, since on the assumption that yk Tk +1 is also wellrepresented by θ, it should be conditionally independent of the historical demands, just as the historical demands were conditionally independent of each other: p(yk Tk +1 , y1 , . . . , yN |θ) = p(yk Tk +1 |θ) p(y1 , . . . , yN |θ). 5

(2)

Jackman (2000) highlights the fact that the Bayesian approach actually makes no formal distinction between such quantities. 6 Since some of the T may be zero, we adopt the convention that for any expression •, 0 • = 1. ∏ i =1 i 7 For a definitive account of prediction in Bayesian statistics, see (Geisser 1993).

6

Bayesian Forecasting of Parts Demand Now it is easy to show 8 that with the provision of a prior distribution for θ, p(θ), the posterior predictive distribution may be expressed as: p(yk Tk +1 |y1 , . . . , yN ) = ∝

Z Z

p(yk Tk +1 |θ) p(θ|y1 , . . . , yN )dθ

(3)

p(yk Tk +1 |θ) p(y1 , . . . , yN |θ) p(θ)dθ.

(4)

Note that the second factor in the integral on the right hand side of equation (3) is the posterior distribution of θ given the observed data, y1 , . . . , yN . As many treatments of Bayesian statistics illustrate (Bernardo and Smith 1994; Gelman et al. 2003, for example), provided that the observed data is sufficiently informative, even if p(θ) is diffuse or noninformative for elements of θ, the posterior distribution will be sharp enough to yield reasonably precise predictions for yk Tk +1 in equation (3). This is an advantage in this application, where very little prior information was available in advance of the model’s deployment. We also note that given a mechanism—such as the Markov chain Monte Carlo (MCMC) simulator described in Section 3—for drawing samples from the posterior distribution of θ, the right hand side of equation (3) shows how one may sample from the posterior predictive distribution by drawing a value θ˜ from the posterior distribution p(θ|y1 , . . . , yN ) and then drawing one from the conditional distribution p(yk Tk +1 |θ˜ ). The resulting sample may then be used to forecast yk Tk +1 . Finally, we should point out that in order to achieve reasonable fidelity to the data, we have adopted a so-called hierarchical or multilevel prior (Gelman 2006; Gelman and Hill 2006) in the model. This may be thought of abstractly as dividing θ into a number of collections of sub-vectors: A collection ζ = (ζ 1 , . . . , ζ N ) of parameter vectors associated with parts, a collection φ = (φ1 , . . . , φ J ) of parameter vectors associated with the products to which the parts belong, and finally a single vector ϑ of common “population-level” parameters. The prior for θ as a whole is expressed by defining the priors for the part-level parameters in terms of the values of the product- and population-level parameters, and the product-level parameters in terms of the population-level parameters; the populationlevel parameters receive their own free-standing priors. This means that: ( p(θ) =

J

"

N

∏ ∏ p(ζ i |φ, ϑ ) j =1

#

) p(φj |ϑ )

p ( ϑ ).

(5)

i =1

As a general rule, “pooling” model information by using common parameters at the population level makes for convenient expression, speedier estimation and more stable predictions; the product- and part-level parameters are vital, however, to capture the heterogeneity exhibited by the data (see Gelman and Hill 2006, for a general discussion). 8

See (Gelman, Carlin, Stern, and Rubin 2003, p. 9) for details.

7

Bayesian Forecasting of Parts Demand 2.2 Constituents of Part Demand As a first step in making concrete the general discussion above, we need to specify the conditional distribution p(yit |θ) of unit demand for part i in period t. Rather than defining the distribution directly to begin with (an explicit definition is given in Section 2.9), it is more helpful for the purposes of exposition to consider part demand as determined by the combination of four random quantities: yit = (λit + ε it + xit )γi .

(6)

The right hand side of this equation is the product of two factors: The first—which for convenience during the discussion we refer to as the shape process associated with part i— is a discrete-time stochastic process that captures life cycle effects, temporally uncorrelated and autocorrelated errors as they influence demand throughout the part’s life cycle; these are discussed in further detail below. The second factor in the product is a part-specific quantity, γi , which scales the part’s shape process to match its period demands. This partspecific scaling allows the shape processes of different parts to be parameterized similarly, even though their demands might differ in overall magnitude, and—as we will show— this in turn permits the use of common parameters to pool forecast information.

2.3 Life Cycle Curve Like the forecasting model in (Yelland 2004), the model in this paper incorporates a stylized representation of a product’s life cycle, so as to capture systematic changes in demand (illustrated in Figure 1) as a part is introduced to and withdrawn from the manufacturing process. The representation used in this model is derived from the Weibull distribution, following the work of Moe and Fader (2002), who use it in the analysis of new product adoption. 9 Thus the quantity λit in equation (6), which traces the evolution of demand over the life cycle of a part, is determined by the difference between the values of a suitably parameterized Weibull cumulative distribution function (CDF) at t and t − 1: λit = W(t|αi , δi ) − W(t − 1|αi , δi ).

(7)

In their work, Moe and Fader use the conventional parameterization of the Weibull curve, η according to which the value of the Weibull CDF at t is equal to 1 − e−(t/k) , where η and k are (positive) parameters of the distribution. To help the convergence of the MCMC simulator described in Section 3, and as an aid to interpretability, we actually use an alternate parameterization of the Weibull in equation (7), indexed by αi and δi , which are respectively the 20th percentile of the distribution and the difference between its 95th and 20th 9

Moe and Fader draw in turn on the body of research into new product diffusion modeling summarized by Mahajan, Muller, and Wind (2000).

8

Bayesian Forecasting of Parts Demand percentiles. 10 A little algebraic manipulation yields conventional Weibull parameters ηi and k i corresponding to αi and δi , so that: ηi

W(t|αi , δi ) = 1 − e−(t/ki ) ,

where ηi =

αi 2.6 , ki = . log(αi + δi ) − log(αi ) 0.221/ηi

(8)

Moe and Fader’s (2002) use of a Weibull curve to describe the time to first purchase of a new product has theoretical appeal, given the Weibull distribution’s origins in the analysis of events that occur after a period of random duration. The use of the Weibull in this application is rather more pragmatic—as Moe and Fader observe, it offers an appealing combination of parsimony and flexibility. 11 An informal measure of how well the Weibull model captures the trend in the Sun’s part demands with only two parameters can be gauged from Figure 1, where (appropriately scaled) Weibull curves have been fitted to the sample series. Completing the specification of λit given by equations (7) and (8) requires that we provide prior distributions for the part-specific parameters αi and δi . The priors for both of these parameters are hierarchical, incorporating information from parts that constitute the same product, and from parts in general. Treating the case of αi in detail (the treatment of δi is analogous): αi —which is necessarily positive—is drawn from a normal distribution, truncated on the left at 0. 12 The scale parameter σα of this truncated normal distribution is common to all parts, but the location parameter, aprod(i) , is shared only with other parts for the same product (we use the expression “prod(i )” to denote the index of the product to which part i belongs). At the next level of the hierarchy, for all products j, a j is drawn from a normal distribution with mean and variance common to all products. Finally, the mean µ a of this latter distribution has a non-informative prior. In symbols: αi ∼ N[0,∞) ( aprod(i) , σα2 ),

a j ∼ N(µ a , σa2 ),

p(µ a ) ∝ 1.

(9)

Priors for scale parameters σα and σa are discussed in Section 2.7. We use the difference between the 20th and 95th percentiles rather than the 95th percentile itself because αi and δi might reasonably be considered a priori independent, making for easier specification of the model prior. 11 It could be said that our outlook conforms with the technological approach to modeling of Bernardo and Smith (1994, p.238), in that we are concerned less with the “‘true’ mechanisms of the phenomenon under study . . . [than] simply with providing a reliable basis for practical action in predicting . . . the phenomena of interest”. In fact an ab initio argument for the Weibull model might be made by postulating some form of “adoption” process for the parts themselves, along the lines of that presented by Norton and Bass (1987), for example. However, without detailed information about end-user behavior to support it (very difficult to obtain in this context), such a construction would amount to little more than “armchair theorizing”. 12 Strictly speaking, truncation on the left should be at a point slightly greater than 0, but the technical elision is of no practical consequence.

10

9

Bayesian Forecasting of Parts Demand 2.4 Uncorrelated Errors The second constituent, ε it , of the shape process in equation (6) represents deviations in demand from the value specified by λit that are uncorrelated over time (for convenience, we refer to these deviations somewhat loosely as “uncorrelated errors”). In the model, ε it is drawn from a normal distribution with zero mean and a standard deviation specific to part i and period t: ε it ∼ N(0, ς2it ). (10) Here, ς it changes to accommodate occasional outliers in part demand, in keeping with the binary selection model described by Congdon (2003, sec. 3.6.1). Using the latter construction, each observation of demand yit is associated with a latent binary variable vit ∈ {0, 1}, which identifies yit as an outlier iff vit = 1. Observations identified as outliers are drawn from a distribution with the same zero mean, but with a standard deviation four times 13 as large as the standard deviation for non-outlying observations, σε . Thus: ς it = (1 + 3vit )σε .

(11)

Since the occurrence of an outlier is ipso facto a rare event, the prior for the indicator vit is a Bernoulli distribution such that the probability that vit = 1 is 5%. The prior for σε will be detailed in Section 2.7; note that since ς it is independent of the scale of part i’s demand, the same standard deviation parameter σε may be used for all parts. Assuming that ε it is normally-distributed is a substantial technical convenience, and follows the precedent set by Srinivasan and Mason (1986), who use normal errors in a similar context. Strictly speaking, however, it gives rise to forecast distributions that lack coherence in the sense of McCabe and Martin (2005), since they give support to negative demand values. Fortunately, we have found that in practice, point forecasts (the focus of interest for Sun’s supply chain) produced using the mean of the forecast distributions from the model (see Section 4 for details) are invariably positive.

2.5 Autocorrelated Errors The quantity xit in equation (6) represents the value in period t of a latent autoregressive process associated with part i, intended to capture random demand variations that are correlated across time. West and Harrison (1997, p. 300) suggest the use of a time series process of this form as a “catch-all noise model used as a purely empirical representation of residual variation,” and we have found that including a latent autoregressive process significantly improves short-term forecasts made with the model. As the name suggests, 13

A scale inflation factor of 3 or 4 is suggested for general use in characterizing discrepant observations by West and Harrison (1997, p. 400).

10

Bayesian Forecasting of Parts Demand the value of the autoregressive process in period t is a randomly-perturbed linear combination of the values in the preceding two periods: xit = ϕi1 xi t−1 + ϕi2 xi t−2 + ξ it ,

where ξ it ∼ N(0, σξ2 ).

(12)

For identifiability, the terms ξ it and ε it are assumed to be independent (see West and Harrison 1997, sec. 2.1), and as a precaution against over-fitting, the standard deviation σξ in equation (12) is fixed at a constant multiple of the standard deviation, σε , used in equation (11) to characterize uncorrelated errors. This multiple constitutes a tuning parameter of the model; we found that setting σξ = 0.8 σε yields good results (as before, the scale-free property of a part’s shape process allows the same parameter to be used across parts). Starting values xi0 and xi,−1 of the autoregression are provided informative priors centered around zero—a zero expectation is appropriate for an “error” term, and it seems reasonable a priori to assert that there is little “residual variation” (to use West and Harrison’s terminology) at the outset of a part’s life. Again, since the shape process of each part is scale-free, the same priors may be used for all parts:

( xi0 , xi,−1 )> ∼ N(µ x0 , Σ x0 ),

>

µ x0 = (0, 0) ,

Σ x0 = diag(2, 2).

(13)

The regression coefficients in equation (12) are specified using a hierarchical prior: ϕi1 and ϕi2 are drawn directly from a multivariate normal distribution that is common across all parts, and the mean and variance of this common distribution are given a noninformative multivariate Jeffrey’s prior (Sun and Berger 2006). Thus:

( ϕi1 , ϕi2 )> ∼ N(µ ϕ , Σ ϕ ),

2 p(µ ϕ , Σ ϕ ) ∝ Σ ϕ .

(14)

2.6 Scale Factor The prior for the (necessarily positive) scale factor γi of part i in equation (6) has the same hierarchical structure as that provided for parameters αi and δi —namely, a left-truncated normal distribution whose location parameter is shared with other parts of the same product, governed itself by a normal distribution with non-informative priors (again, see Section 2.7 for the prior of σg ): γi ∼ N[0,∞) ( gprod(i) , σγ2 ),

g j ∼ N(µ g , σg2 ),

p(µ g ) ∝ 1.

(15)

Even in combination with the likelihood induced by equation (6), however, this prior determines γi with insufficient precision to produce good forecasts—with only a few demands observed for part i, the posterior distribution for γi gives undue support to very large values. To overcome this problem, we use “superannuated” parts, whose entire life cycles have been observed, to place additional constraints on the distribution parameters

11

Bayesian Forecasting of Parts Demand in (15)—and thus, indirectly, on the scale factors of current parts, whose life cycles are yet to complete. To see how, consider that conditional on γi , the expectation (in period 0) of the total demand for part i is equal to γi , since from equation (6): ∞

∞

E[ ∑ yit ] = γi E[ ∑ λit + ε it + xit ] t =1

n

t =1 ∞

∞

∞

t =1

t =1

= γi E[ ∑ λit ] + ∑ E[ε it ] + ∑ E[ xit ] t =1 ∞

o

= γi E[ ∑ λit ]

(16)

= γi E[W(1|αi , δi ) − W(0|αi , δi ) + W(2|αi , δi ) − W(1|αi , δi ) + . . .] = γi E[W(∞|αi , δi ) − W(0|αi , δi )] = γi ( 1 − 0 ) ,

(17)

t =1

where in equation (16) we use the fact that the expectations in period 0 of both error terms is 0, and equation (17) follows from the definition of the Weibull CDF. Therefore, if the entire life cycle of part i has been observed, it is plausible to assert that the observed total demand for that part, si , is approximately equal to γi . Operationally, we have found that asserting si ∼ N(γi , [0.2γi ]2 ) for superannuated parts yields good results.

2.7 Priors for Scale Parameters We use non-informative priors for scale parameters of normal and truncated normal distributions throughout the model. Gelman’s (2006) paper demonstrates that producing a truly “non-informative” prior for variance parameters is a delicate business, particularly in hierarchical models such as this (where, for example, the popular “weakly informative” inverse-gamma prior of Spiegelhalter, Thomas, Best, Gilks, and Lunn (2003) can lead to degenerate posterior distributions for variances of group parameters). Here, we use a uniform density on the positive half-line as the prior for the standard deviations in (9), which as Gelman (2006) indicates, is formally equivalent to an inverse-χ2 density with -1 degrees of freedom for the corresponding variances. Explicitly, therefore, for σ◦ ∈ {σα , σa , σδ , σd , σε , σγ , σg }, we have p(σ◦ ) ∝ I(σ◦ > 0), where the indicator function I() is equal to 1 if its argument is true, and 0 otherwise.

2.8 Distribution of Part Demand To complete the specification of the model, we can derive explicitly the conditional distribution of part demand introduced in Section 2.1, which we passed over in Section 2.2. First: yit = γi (λit + xit ) + γi ε it

12

from (6)

Bayesian Forecasting of Parts Demand Part demand

⇒ yit ∼ N(γi (λit + xit ), [γi ς it ]2 )

yit = (λit + ε it + xit )γi ,

Life cycle curve

λit = W(t|αi , δi ) − W(t − 1|αi , δi ) αi ∼ N[0,∞) ( aprod(i) , σα2 ),

a j ∼ N(µ a , σa2 )

δi ∼ N[0,∞) (dprod(i) , σδ2 ),

d j ∼ N(µd , σd2 )

Uncorrelated errors

ε it ∼ N(0, ς2it ),

vit ∼ Bern(0.05)

ς it = (1 + 3vit )σε , Autocorrelated errors

ξ it ∼ N(0, σξ2 ), 2 p(µ ϕ , Σ ϕ ) ∝ Σ ϕ

xit = ϕi1 xi t−1 + ϕi2 xi t−2 + ξ it , >

( ϕi1 , ϕi2 ) ∼ N(µ ϕ , Σ ϕ ), ( xi0 , xi,−1 )> ∼ N(µ x0 , Σ x0 ),

>

µ x0 = (0, 0) ,

σξ = 0.8 σε

Σ x0 = diag(2, 2)

Scale factor

γi ∼ N[0,∞) ( gprod(i) , σγ2 ),

g j ∼ N(µ g , σg2 )

si ∼ N(γi , [0.2γi ]2 )

Figure 2. Model summary: Unless otherwise stated, location parameters of the form µ◦ and scale parameters σ◦ have non-informative priors p(µ◦ ) ∝ 1 and p(σ◦ ) ∝ I(σ◦ > 0), respectively. Indexes i, j and t range over parts, products and periods, resp.

Furthermore: ε it ∼ N(0, ς2it )

from (10) 2

⇒ γi ε it ∼ N(0, [γi ς it ] ) Therefore conditional on the model parameters: yit ∼ N(γi (λit + xit ), [γi ς it ]2 ).

13

Bayesian Forecasting of Parts Demand — Unobserved

µa

Σϕ

µϕ

σε

vit

µd

σa

aprod(i)

xit−1

ϕi

— Observed

σα

σd

dprod(i)

αi

σδ

δi

µg

σg

gprod(i)

σγ

xit−2 ε it

xit

λit

yit

γi

si

Figure 3. Conditional independence relationships of the model in Figure 2: Indexes i, j and t range over parts, products and periods, resp.

2.9 Model Summary Figure 2 summarizes the model, collecting together the definitions made in this section. The model’s hierarchical structure is illustrated by Figure 3, which captures dependencies in the form of a directed graph of the sort described by Rossi, Allenby, and McCulloch (2005, pp. 67–81), for example. Quantities in the model are represented as nodes in the graph, with a node’s parents comprising those quantities bearing directly on the definition of the quantity represented by that node. 14 Rossi et al. (ibid.) demonstrate how such a directed graph may be used to derive an explicit expression for the prior distribution along the lines set out in equation (5). We refrain from writing out the prior in full here, as it would contribute much clutter but little additional information, and estimation of the model depends not on the explicit expression of the prior distribution, but on the conditional distributions given in the MCMC sampler in the next section. 14

Conditional independence assertions involving ranges of index variables—written out schematically in equation (5)—are left implicit in the diagram. Buntine’s (1994) plates provide a graphical mechanism for expressing such relationships explicitly, but the complexity of this model means that the use of plates here tends to obscure matters, particularly given the presence of the autoregressive process xi .

14

Bayesian Forecasting of Parts Demand 3

Estimation

This section discusses the simulation procedure used to approximate the posterior distributions of the model parameters, and summarizes the approximate posteriors produced with the full sample of part demands described in Section 1.

3.1 Gibbs Sampler As is commonly the case in modern applied Bayesian statistics, estimation of posterior distributions for the model described in the previous section is carried out using a Markov chain Monte Carlo simulator—specifically a Gibbs sampler. Schematic descriptions of Gibbs sampling now abound in the literature—see (Gilks, Richardson, and Spiegelhalter 1996, chp. 1), for example; briefly, beginning with an arbitrary configuration of the random variables in the model, such a simulator constructs a Markov chain whose states converge to a dependent sample from the joint posterior of those random variables. Each transition in this Markov chain involves drawing a new value of one of the variables from its distribution conditional on the current value of the other variables and the observed data. The individual transitions of this particular sampler are described below. Many of the steps rely on standard results concerning conjugate updating in Bayesian analysis, which may be found in texts such as (Gelman et al. 2003) or (Bernardo and Smith 1994). Where such closed-form updates are not available, we resort to Metropolis-Hastings sampling (also discussed by Gilks et al.); the latter relies on proposed values that are generated using Geweke and Tanizaki’s (2003) Taylored chain procedure, details of which are provided in Appendix C. In the following, each step is introduced by the conditional distribution from which a sample is to be drawn. Variables of which the sampled quantity is conditionally independent are omitted from the conditioning set. We use the abbreviations: vi = (vi1 , . . . , viTi ), xi = ( xi1 , . . . , xiTi ) and yi = (yi1 , . . . , yiTi ). Also, for each product j, let parts( j) = {i | prod(i ) = j} be the set of associated parts. In the interests of brevity, we specify draws for a j , σα , µ a and σa only; samples for d j , σδ , µd and σd , and g j , σγ , µ g and σg are generated in an analogous fashion. γi yi , σε , αi , δi , xi , vi , gprod(i) , σγ , si The kernel of the full conditional distribution is given by the expression: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ] t =1

15

# 2

) × N[0,∞) (γi | gprod(i) , σγ2 ) × ϕi ,

Bayesian Forecasting of Parts Demand where: ( ϕi =

N(si |γi ,[0.2γi ]2 )

if the entire lifecycle of part i has been observed

1

otherwise.

In either case, sampling is carried out using the Geweke-Tanizaki Taylored chain to generate a proposal in a Metropolis-Hastings step. αi yi , γi , σε , δi , xi , vi , aprod(i) , σα The full conditional is proportional to the expression: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ]

# 2

) × N[0,∞) (αi | aprod(i) , σα2 ).

t =1

This is sampled using a Taylored chain proposal in a Metropolis-Hastings step. δi yi , γi , σε , αi , xi , vi , dprod(i) , σδ As above, but this time sampling from: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ]

# 2

) × N[0,∞) (δi |dprod(i) , σδ2 ).

t =1

vit yit , γi , σε , αi , δi , xit Let: yit = γi (λit + xit ), 2

p0 = (1 − 0.05) × N(yit |yit ,[γi σε ] ), p1 = 0.05 × N(yit |yit ,[4γi σε ]2 ). Sample vit from the Bernoulli distribution with success probability p1 /( p0 + p1 ). xi yi , γi , σε , αi , δi , vi , µ x0 , Σ x0 Following West and Harrison (1997, example 9.6), xi is given by the state vector of the dynamic linear model, or DLM, specified by ( Ft , Gt , Vt , Wt ), for t ∈ 1, . . . , Ti , where: 1 Ft = , 0

Gt =

ϕi1 ϕi2 1

Vt = [(1 + 3vit )σε ]2 ,

,

0

Wt =

σξ2 0

.

0 0

Artificial “observations” for this DLM are equal to yit /γi − λit , for t ∈ 1, . . . , Ti . First and second moments of the multivariate normal prior for the initial state configuration (m0 and C0 in West and Harrison’s formulation) are µ x0 and Σ x0 , respectively.

16

Bayesian Forecasting of Parts Demand Procedures for sampling the state of such a DLM (generally under the moniker forward filtering/backwards sampling algorithms) are described by Frühwirth-Schnatter (1994), West and Harrison (1997) and Durbin and Koopman (2001), amongst others. ϕi1 , ϕi2 xi , σξ , µ ϕ , Σ ϕ Draw from the posterior distribution of the coefficients in the linear regression xit ∼ N( ϕi1 xt−1t + ϕi2 xt−2t , σξ2 ), given a prior ( ϕi1 , ϕi2 )> ∼ N(µ ϕ , Σ ϕ )—see e.g. (Gelman et al. 2003, chp. 8). µ ϕ , Σ ϕ ϕ1 , . . . , ϕ N Conjugate updating for parameters of a multivariate normal distribution. a j {αk | k ∈ parts( j)}, σα Sampling is carried out using a device due to Griffiths (2004): Specifically, for k ∈ parts( j), let: α −a −a Φ iσα j − Φ σα j − 1 , (18) α˜ k = a j + σα Φ −aj 1 − Φ σα where Φ(·) denotes the standard normal cumulative distribution function. Then as Griffiths demonstrates, supposing that α˜ k ∼ N( a j , σα2 ) and drawing from the conditional distribution a j {α˜ k | k ∈ parts( j)}, σα (a straightforward application of semi-conjugate updating) is equivalent to drawing from a j {αk | k ∈ parts( j)}, σα given that αk ∼ N[0,∞) ( a j , σα2 ). σα α1 , . . . , α N , a1 , . . . , a J Again, using Griffiths’s device, draw from σα α˜ 1 , . . . , α˜ N , given that α˜ i − aprod(i) ∼ N(0, σα2 ), where α˜ i is defined in equation (18). µ a , σa a1 , . . . , a J Two step semi-conjugate updating for parameters of a normal distribution, drawing µ a first from µ a a1 , . . . , a J , σa , and then σa from σa a1 , . . . , a J , µ a . σε y1 , α1 , δ1 , γ1 , x1 , v1 , . . . , yN , α N , δN , γ N , x N , v N Let rit = (1 + 3vit )−1 (yit /γi − λit − xit ), for i = 1, . . . , N, t = 1, . . . , Ti . Then rit ∼ N(0, σε2 ), and conjugate updating applies.

17

Bayesian Forecasting of Parts Demand Convergence

Posterior distribution

α pt.8 pt.10 pt.29 pt.35

●

4.9 (0.8) 4.2 (0.66) 4.2 (0.5) 4.5 (0.7) 3

4

5

6

0.32 [0.75] −0.95 [0.34] 0.14 [0.89] −0.53 [0.60]

● ● ●

7

δ pt.8 pt.10 pt.29 pt.35

●

21 (2.1) 19 ( 2) 15 (1.6) 16 ( 2) 12

14

16

18

20

22

● ● ●

1.02 [0.31] −0.43 [0.67] 0.19 [0.85] 1.11 [0.27]

24

γ pt.8 pt.10 pt.29 pt.35

●

3.6e+03 (1.5e+02) 1.5e+03 ( 57) 1.8e+03 ( 58) 1.3e+03 ( 58) 1500

2000

2500

3000

● ● ●

0.48 [0.63] −0.94 [0.35] 0.54 [0.59] 0.89 [0.37]

3500

λ.1 pt.8 pt.10 pt.29 pt.35

0.19 (0.27) 0.22 (0.25) 0.091 (0.23) 0.35 (0.28) −0.4

−0.2

0.0

0.2

0.4

0.6

0.8

●

−2.43 [0.02] −0.73 [0.47] −1.25 [0.21] −0.65 [0.52]

● ● ●

1.0

y.5 pt.8 pt.10 pt.29 pt.35

2e+02 ( 88) 1.2e+02 ( 37) 1.3e+02 ( 46) 67 ( 32) 0

100

200

300

● ● ● ●

−1.39 [0.17] −1.89 [0.06] 0.61 [0.54] 0.26 [0.79]

400

a pd.3 pd.4 pd.5 pd.8

4.4 (0.6) 5.1 (0.38) 4.5 (0.38) 3.2 (0.63) 2

3

4

5

● ● ● ●

−0.50 [0.62] −1.44 [0.15] −1.20 [0.23] 0.46 [0.65]

6

σ a alpha 0.5

1.0

1.5

2.0

2.5

3.0

●

1.7 (0.87) 0.99 (0.18)

●

0.08 [0.94] −0.06 [0.96]

0.021 (0.00081)

●

−0.12 [0.91]

3.5

σ epsilon 0.019

0.020

0.021

0.022

0.023

Figure 4. Posterior distributions and convergence diagnostics for model fit to full sample of demand series: On the left are interquartile ranges and 1.5×interquartile ranges of quantities associated with selected parts (pt.-), products (pd.-) and hyperparameters, together with the mean and standard deviation of the corresponding distribution. On the right are values of Geweke’s convergence statistic; all but three of the values lie between the 5th and 95th percentiles of the standard normal distribution (statistic values and p-values appear on the far right), so convergence is reasonably assured. 18

Bayesian Forecasting of Parts Demand 3.2 Posterior Estimates and Diagnostics Figure 4 (inspired by similar figures on e.g. p. 351 of Gelman and Hill 2006) illustrates the results of estimating the model with the full sample of demand histories described in Section 1.1. Here the Gibbs sampler was run in a single chain for 4,000 iterations, with samples from the first 1,200 discarded; no thinning of the remaining samples was performed. On the left of the Figure are displayed posterior distributions for the quantities associated with a random selection of parts (“pt.”- 8, 10, 29 and 35) and randomly selected products (“pd.”-, 3, 4, 5 and 8), as well as population-level parameters. Parameters αi , δi , γi and ϕi1 are summarized for the selected parts, and the fitted value of yit in the fifth period of each part’s life cycle 15 is displayed as “y.5 ”. Also shown are the product-level location parameters a j for the selected products, as well as scale parameters σα , σa and σε (the latter are labeled “a,” “alpha” and “epsilon” in the Figure). Each posterior distribution is summarized graphically by a condensed box-and-whisker plot, 16 with the distribution’s mean and (parenthesized) standard deviation given numerically. The right-hand side of the Figure plots Geweke’s (1992) convergence diagnostic for each of the quantities in question. Derived from a comparison of the first and last segments of the Markov chain associated with a quantity, Geweke’s statistic z has an asymptotically standard normal distribution if the chain is stationary (i.e. convergence has occurred). On the diagram, values of z are plotted on the 5th and 95th percentiles of the standard normal distribution. The plots indicate that convergence has been achieved—note that with 27 quantities displayed, we would expect around 3 values of z to fall outside of the percentile bounds even with convergence.

4

Testing Forecast Performance

To test the forecasting effectiveness of the model, we used it to produce forecasts for the full demand series sample described in Section 1.1, and compared its performance with that of the three other forecasting methods.

4.1 Setup Data used in the test comprised records of period demand for the sample of 45 parts in Section 1.1, which were used in 9 products in all. Together, the life cycles of the parts in the test spanned 54 planning periods. Each part i was associated with a period Si , the period 15

By analogy with regression diagnostics, the “fitted value” of yit is the value predicted by the other random quantities in the model. 16 “Boxes” delimit the interquartile range of the distributions, and “whiskers” extend 1.5 times the interquartile range from the ends of the boxes—see Tukey (1977) for further details.

19

Bayesian Forecasting of Parts Demand in which the part was first used (and in which demand for the part was first observed), and a life cycle length Li , the number of periods during which the part was in use. From the demand records, a 45 × 54 matrix was assembled, with element Dip equal to the number of units of part i demanded in test period p—Dip = 0 for p < Si or p ≥ Si + Li . 17 Since all of the forecasting methods require some historical data at the outset of the test to make forecasts, the parts from 3 randomly-chosen products—16 parts in total—were held in reserve. Forecasts were produced for the remaining 29 holdout parts, belonging to 6 products. Forecasts were made using the candidate methods in each of the 54 periods spanned by the test products’ life cycles. The forecast horizon was a single planning period, since this encompassed the lead time of the bulk of the manufacturing parts at issue. In producing the forecasts, we endeavored to mimic as closely as possible the actual demand information available in each forecast period—reserve parts aside, only demands which had occurred prior to the forecast period were provided to the forecasting methods. For the forecasting method based on the model described in this paper (and to a lesser extent for the other methods tested) it was imperative that all demand series provided were aligned, so that the demand recorded in the first period of a part’s life cycle appeared as the first value in the associated demand series (the yi of Section 2.1) input to the forecasting method, regardless of which period the part’s life cycle actually began. Therefore, for each period p = 1, . . . , 54, the following procedure was followed: (1) Define the set C = {i | Si ≤ p}, which collects “current” parts whose life cycles began on or before p. (2) For i = 1, . . . , 45, let Ti equal the number of periods of demand observed for part i prior to period p: if i is reserved, Li Ti = 0 (19) if i ∈ / C, min( L , p − S ) otherwise. i

i

(3) Define the set F = {k ∈ C | Tk < Lk }, of parts for which forecasts are to be calculated (namely, current parts barring reserved parts and parts whose life cycles concluded before p). (4) Assemble demand series (y1 , . . . , y45 ), where for i = 1, . . . , 45, yi is the empty sequence if Ti = 0, and otherwise consists of the sequence ( DiSi , . . . , Di Si +Ti ). (5) Now using each of the forecasting methods, for each of the parts k in the set F , produce a forecast yˆ k Tk +1 , which corresponds to the actual part demand recorded as Dkp . The four candidate forecasting methods are described below. 17

Throughout this section, we use the index p to refer to a period within the span of the test, and t to index the demand series for a particular part, i.e. the series yi given below. Since yi actually begins in test period Si , in general t = p − Si + 1, for a given part i.

20

Bayesian Forecasting of Parts Demand Bench This is an all-but-trivial “benchmark” method that serves as a baseline for comparison. If Tk > 0, it simply repeats the observation from the previous period. For the first period of a part’s life cycle (i.e. when Tk = 0), it uses the mean of the first-period demands for other parts of the same product, if there are any suitable demand histories in the data provided to the method, and the mean of the available first-period demands of all parts otherwise: ( yˆ k Tk +1 =

ykTk

if Tk > 0

mean{yi1 | i ∈ Πk }

otherwise.

(20)

Here, the expression “mean S” denotes the arithmetic mean of the set S, and the set Πk is defined as follows: ( Πk =

Pk

if Pk 6= ∅,

H

otherwise

(21)

where:

Pk = {i ∈ H | prod(k) = prod(i )}, H = {i | Ti > 0}.

Judg

This emulates the forecasting method currently used in Sun’s supply chain. It relies on the provision (usually by the Company’s sales and marketing units) of a ˆ ip for each part i, of total unit demand for the product of which part i is forecast w a component, for the quarter into which planning period p falls. 18 Then with Πk defined as in equation (21), define an attach rate, ρk , which for Tk > 0 represents demand for part k in period Tk as a proportion of the forecast quarterly demand of the corresponding product, and for Tk = 0 is the mean of the first-period attach rates for like parts: ( ρk =

ˆ i Si | i ∈ Π k } mean{yi1 /w ˆ k Sk +Tk −1 ykTk /w

if Tk = 0, otherwise.

(22)

The assumption in forecasting with this method is that the attach rate should remain reasonably stable from period to period, so we let: ˆ kp . yˆ k Tk +1 = ρk w

ExpS

(23)

As an exemplar of the exponential smoothing techniques outlined in Section 1.2, we use the forecast package developed for the R statistical programming environment (Venables and Smith 2002) by Hyndman and Khandakar (2008). An embodiment of the forecasting framework developed by Hyndman, Koehler, Snyder,

18

ˆ i Q2P1 is a forecast of demand for product prod(i ) For example (with minor abuse of notation), w in quarter Q2. For the test, we used published forecasts of product demand from the Company’s historical records.

21

Bayesian Forecasting of Parts Demand and Grose (2002)—itself an outgrowth of the structural approach to exponential smoothing detailed at length by Hyndman et al. (2008)—the forecast package provides for the automatic selection, estimation and extrapolation of an exponential smoothing model from a taxonomy of such models that incorporate additive or multiplicative seasonal and error components and/or additive, multiplicative or damped trends. Its performance is demonstrably superior in a number of applications, as Hyndman et al. (2002) relate. A number of considerations recommended the forecast package as candidate for comparison: a) The package provides access to state-of-the-art exponential forecasting models, together with sophisticated techniques for selecting, fitting and forecasting with them; b) by operating entirely automatically, the package represents a plausible representative of the best “off the shelf” technology that might be available to operations managers; and c) again, by dint of its automatic operation, the package avoids the possibility that incompetence or invidiousness on our part might affect its performance in the comparison. Forecasting the value yˆ k Tk +1 given the (possibly empty) demand history yk is very straightforward: (1) Load the package forecast into R. (2) Attempt to fit an exponential smoothing model to yk using the package’s ets(·) function. (3) If the attempt is unsuccessful (that is, if ets(·) returns an error—almost invariably because yk is too short), use the forecast produced by applying the Bench method to the same series. (4) Otherwise, apply the function forecast(·) to the model produced in step 3. The point forecast used for yˆ k Tk +1 is the mean component of the list returned by forecast(·). 19

Mod

Forecasting with the Bayesian model simply involves subjecting the assembled demand series (y1 , . . . , y45 ) to the procedure set out in Section 2.1, using the MCMC simulator described in Section 3.1 to produce a sample from the posterior predictive distribution for yk Tk +1 . In keeping with the approach used in the method ˆ k T +1 . ExpS, we use the mean of this sample as the point estimate y k

4.2 Results The exercise described in the previous section resulted in rolling one-step-ahead forecasts for the 29 parts in the holdout set. A preliminary illustration of the relative performance 19

R code for steps 2 – 4 is (roughly) as follows: m ϕi = ( ϕi1 , ϕi2 )

Value of the latent autoregressive process for part i in period t. Error associated with xit . Coefficients of autoregression for part i. continued on next page

34

Bayesian Forecasting of Parts Demand Notation

Meaning

Scale factor γi gj si

Scale factor associated with part i; equal to expected life-time demand for part i. Location of platform-level prior for γi , for i ∈ parts( j). Observed total demand for part i over its entire life cycle (if available).

Parameters µθ , σθ

Parameters (usually mean and std. dev., resp.) of prior distribution for generic parameter θ.

Testing Si Li p ∈ {1, . . . , 54} Dip yˆ k Tk +1

C F 0 m, m ∈ 1, . . . , M

The period in which part i was first used. The length of part i’s entire life cycle. Index ranging over planning periods spanned by the test. Element of matrix recording actual part demands, equal to the unit demand for part i during period p. A point estimate of yk Tk +1 . Set of parts whose life cycles began in or before period p. Set of parts to be provided forecasts in period p. Indexes ranging over forecasting methods.

Miscellaneous ∅ mean S I(·) x>

The empty set. The mean of set S. The indicator function, equal to 1 if its argument is true, and 0 otherwise. The transpose of x.

35

Bayesian Forecasting of Parts Demand B

Standard Probability Distributions

Distribution Description

Density/mass function

N(µ, σ2 )

Normal distribution with mean µ and standard deviation σ.

N( x |µ,σ2 ) = √ 1 exp − 1 2 ( x − µ )2 2σ 2πσ

Weib(λ, k )

Weibull distribution with shape λ and scale k.

Weib( x |λ, k ) = λk−λ θ −1+λ e−( k ) ,

The Bernoulli distribution with success probability p.

Bern( x | p) = p x (1 − p)

Multivariate normal distribution with mean µ and +ve definite d × d covariance matrix Σ.

N( x|µ, Σ) = (2π )−d/2 |Σ|−1/2 i h > exp − 21 ( x − µ) Σ−1 ( x − µ)

The normal distribution N(µ, σ2 ), truncated on the left at 0.

N[0,∞) ( x |µ, σ2 ) = 2N( x |µ,σ2 ),

Exp(λ)

The exponential distribution with parameter λ.

Exp( x |λ) = λe−λx , x > 0

Mult(n; p1 , . . . , pk )

The multinomial distribution with n trials and bin probabilities p1 , . . . , pk .

Mult( x |n; p1 , . . . , pk ) =

Bern( p)

N(µ, Σ)

N[0,∞) (µ, σ2 )

θ

λ

x≥0 (1− x )

,

x ∈ {0, 1}

x≥0

n ) p1x1 , . . . , pkxk , ( x1 ,...,x k

x j = 0, 1, 2, . . . , n; ∑ x j = n Inv–χ2 (ν)

The inverse chi-squared distribution with ν degrees of freedom.

Inv–χ2 ( x |ν) = 2−ν/2 −(ν/2+1) x Γ(ν/2)

exp[−1/(2x )],

x>0

C

The Geweke -Tanizaki (2003) 26 Taylored Chain

With x the current state of the sampler, to produce a proposal x + for a target kernel p(z), let q(z) = log p(z), with q0 (z) and q00 (z) the first and second derivatives thereof. Proceed by cases: 26

Sampling techniques similar to the Taylored chain are also discussed by Qi and Minka (2002).

36

Bayesian Forecasting of Parts Demand Case 1: q00 ( x ) < −ε, where ε is a suitable small constant, such as 0.1 . 27 Rewrite the Taylor expansion of q(z) around x: 1 q(z) ≈ q( x ) + q0 ( x )(z − x ) + q00 ( x )(z − x )2 2 2 1 q0 ( x ) 00 = q( x ) − (−q ( x )) z − x − 00 2 q (x) | {z } ‡

Since q00 ( x )

< 0, the component (‡) of the latter expression constitutes the exponential part of a normal distribution, which implies that the target kernel in the vicinity of x may be approximated with mean x − q0 ( x )/q00 ( x ) and p by a normal distribution + 00 standard deviation 1/ −q ( x ); sample x accordingly.

Case 2: q0 ( x ) ≥ −ε and q0 ( x ) < 0 Approximate q(z) by a line passing through x and x1∗ , the largest mode of q(z) smaller than x: q( x1∗ ) − q( x ) (z − x1∗ ) q(z) ≈ q( x1∗ ) + x1∗ − x {z } | ‡

In this case, the component (‡) indicates an exponential distribution, and the proposal is: 28 x + = xˆ 1 + w,

where w ∼ Exp(λ1 ), q( x1∗ ) − q( x ) , λ1 = x1∗ − x xˆ 1 = x1∗ − 1/λ1

Case 3: q0 ( x ) ≥ −ε and q0 ( x ) > 0 Approximate q(z) by a line passing through x and x2∗ , the smallest mode of q(z) larger than x. The proposal is developed in a manner parallel to that in Case 2: x + = xˆ 2 − w,

where w ∼ Exp(λ2 ), q( x2∗ ) − q( x ) , λ2 = x∗ − x 2

xˆ 2 = x2∗ − 1/λ2 Case 4: q0 ( x ) ≥ −ε and q0 ( x ) = 0 By ensuring that | q00 ( x ) | > 0, using ε rather than 0 reduces the occurrence of proposed values that depart too markedly from the current state. 28 The origin of the proposal, x ˆ 1 , is offset from the mode x ∗ in order to guarantee irreducibility of 1 the resulting Markov chain; see (Geweke and Tanizaki 2003) for details.

27

37

Bayesian Forecasting of Parts Demand In this instance, x + is sampled from a uniform distribution over a range [ x1 , x2 ], such that x1 < x < x2 . End points x1 and x2 are set to suitable modes of q(·), if they can be found, and to user-supplied values otherwise.

D

Forecast Error Metrics

Assume that forecast method m provides forecast yˆ mit for part i and t ∈ 1, . . . , Ti , and that yit is the corresponding actual value. Then the mean absolute error, root mean square error and mean absolute percentage error resp. for method m and part i are defined as follows: 29 1 Ti | yˆ mit − yit | Ti t∑ =1 " # 12 1 Ti = (yˆ mit − yit )2 Ti t∑ =1 Ti yˆ − yit 1 = 100 × mit ∑ Ti t=1 yit

MAEmi = RMSEmi MAPEmi

Relative versions of the above are defined by ratios to the corresponding error metric for the benchmark forecast method, Bench: 30 RelMAEmi = MAEmi /MAEBench i RelRMSEmi = RMSEmi /RMSEBench i RelMAPEmi = MAPEmi /MAPEBench i Finally, average metrics for a particular method across all parts in the sample are derived using the geometric mean; for example: " RelMAEm =

N

∏ RelMAEmi

# N1

i =1

29

The MAPE is included in the set of metrics used in this paper since it is widely favored by practitioners of forecasting for supply chain management. It should be noted, however, that it is prone to erratic performance, particularly when actual demands are low—see (Makridakis 1993; Koehler 2001; Hyndman and Koehler 2006; Coleman and Swanson 2007) for a discussion. 30 Metric RelMAE mi is closely related to the mean absolute scaled error of Hyndman and Koehler (2006), and RelRMSEmi conforms to one definition of Theil’s U2 statistic (Armstrong and Collopy 1992).

38

Abstract As supply chains for high technology products increase in complexity, and as the performance expected of those supply chains also increases, forecasts of parts demand have become indispensable to effective operations management in these markets. Unfortunately, rapid technological change and an abundance of product configurations mean that demand for parts in high-tech is frequently volatile and hard to forecast. The paper describes a Bayesian statistical model developed to forecast parts demand for Sun Microsystems, Inc., a major vendor of network computer products. The model embodies a parametric description of the part life-cycle, allowing it to anticipate changes in demand over time. Furthermore, using hierarchical priors, the model is able to pool demand patterns for a collection of parts, producing calibrated forecasts for new parts with little or no demand history. The paper discusses the problem addressed by the model, the model itself and a procedure for calibrating it, and compares its forecast performance with that of alternatives. Key words: Bayesian methods, demand forecasting, forecasting practice, state space models, supply chain

1

Introduction

1.1 Background Manufacturing modern high technology products like computers is an exacting business: Competition is intense, product life cycles fleeting, components are frequently expensive and prone to rapid obsolescence, and supply chains often span the globe. As Lapide (2006) observes, demand forecasts have become increasingly central to supply chain management for participants in these markets. This paper focuses on one such market participant —Sun Microsystems Inc., a vendor of enterprise computing products—and on forecasts of the demand for the manufacturing parts used in its supply chain. Email address: [email protected]

1

Bayesian Forecasting of Parts Demand α = 3.6,, δ = 5.98

400

α = 4.1,, δ = 33.8

120

α = 5.8,, δ = 18.8

150

100 300 80 200

100

60 40

50

100 20

α = 2.4,, δ = 12.9

250

α = 5,, δ = 21.6

500

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q2P1

0 Q3P1

0 P3 Q3P1 P2 P3 Q4P1 P2 P3 Q1P1 P2 P3 Q2P1 P2 P3 Q3P1 P2

0

α = 3.6,, δ = 13.0 1000

200

400

150

300

100

200

50

100

200

0

0

0

600

200

300

150

200

100

100

50

0

0

80

Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q4P1

Q3P1

Q2P1

α = 4.1,, δ = 14.4

400

α = 3.0,, δ = 14.1

60

20

P3 Q2P1 P2 P3 Q3P1 P2 P3 Q4P1 P2 P3 Q1P1 P2 P3 Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q3P1

0 Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

40

Q4P1

Units demanded

α = 4.4,, δ = 18.1

Q1P1

Q4P1

Q3P1

Q2P1

400

Q1P1

Q4P1

Q3P1

Q2P1

Q1P1

Q4P1

Q3P1

Q2P1

800

Period

Figure 1. Sample part demands (solid lines): Time intervals on the horizontal axes of the graphs are financial planning periods, of roughly one month’s duration; Q2P1, for example, is the first planning period of the second quarter of a financial year. Unit shipments have been normalized pro rata to 20 trading days in each period. Dotted curves and values recorded for α and δ delineate part life cycles—they are discussed in detail in Section 2.3.

2

Bayesian Forecasting of Parts Demand Minimum

1st Quartile

Median

Mean

3rd Quartile

Maximum

474

1,058

1,951

2,817

3,958

8,916

Length in periods

13.00

17.00

21.00

20.78

24.00

31.00

Mean units/period

23.79

62.15

110.66

136.94

210.08

405.28

Coef. of variation

0.49

0.68

0.87

0.85

0.97

1.27

Total units

Table 1. Summary statistics of full sample of 45 part demand series (selected members of which are illustrated in Figure 1).

As might be expected for a diversified IT vendor with approximately $14 billion in annual sales, 1 Sun’s manufacturing operations consume a multitude of parts, ranging from semipopulated computer chassis and disk-drive assemblies to power cables and case fasteners. Since these components often have appreciable costs and lead times, 2 short term forecasts of the quantities to be consumed in manufacturing are used to ensure appropriate levels of supply in advance (either by stocking inventories of parts, or securing “just-in-time” delivery commitments from suppliers). Forecast accuracy can have a substantial bearing on the effectiveness of a supply chain, as de Kok, Janssen, van Doremalen, van Wachem, Clerkx, and Peters (2005) indicate. Surveys of commercial practice including (Dalrymple 1987; Wisner and Stanley 1994; Sanders and Manrodt 1994; Klassen and Flores 2001; Sanders and Manrodt 2003) also attest to the importance of forecasting in supply chain operation. Tens of thousands of different parts are used in Sun’s product lines, and around 1,000 of these (the exact number fluctuates constantly) are sufficiently expensive and subject to sufficiently long lead times that they require forecast updates at least every month. Prior to the work described in this paper, the company produced forecasts for these latter parts using a heuristic calculation—detailed later in the paper—based on the product-level sales forecasts produced by its sales and marketing organizations (Yelland 2004). The study described in this paper was initiated with the aim of establishing whether it might be possible to improve upon these heuristic forecasts by using a statistical model. Figure 1 displays the lifetime demands for a selection of the manufacturing parts at issue. 3 Here demand is represented by unit shipments in a planning period, which are of roughly one month’s duration. For the purposes of supply chain planning, the company’s fiscal quarters are divided into three such planning periods, of four, four and five weeks—Q2P1, for example, is the first planning period of the second quarter of a financial year. Unit 1 2 3

$13.88 billion in the financial year ending June 2008. The lead time of a component is time required to make or procure it. To preserve commercial confidentiality, the data have been mildly disguised.

3

Bayesian Forecasting of Parts Demand shipments in the figure (and throughout this paper) have been normalized pro rata to 20 trading days in each period. The parts in the figure are taken from a larger sample of some 45 parts that was used to guide the development of the model described in this paper. Characteristics of the series in this full sample are summarized in Table 1. In selecting parts for study, we were careful to ensure that: (1) The per-unit cost of a part (which ranged into thousands of dollars for those in the study) was sufficient to justify the effort involved in developing the model and running it on a routine basis, and (2) Unit shipments per planning period (as displayed in the Table) were generally high enough that issues such as demand intermittency (Boylan 2005) and discreteness (McCabe and Martin 2005; Yelland 2009) might reasonably be put aside. These selection criteria notwithstanding, the Figure and the Table illustrate the challenge involved in producing forecasts for manufacturing parts at Sun. Demand for parts is normally much more variable than that of the products to which they belong, since part demand depends on customer configuration choices, supplier sourcing arrangements, technological changes, the availability of substitutes, and so on. (This variability is roughly quantified by the coefficients of variation in Table 1, which are calculated by taking the ratio of the standard deviation of demand in a series to the series mean.) In addition, technological development and changes in manufacturing and procurement generally means that part life cycles are very short—appreciably shorter than those of the products themselves.

1.2 Prior Work The literature exploring operations management in the face of uncertain demand is vast, and dates back to the 1950’s and beyond (Arrow, Karlin, and Scarf 1958). By and large however, researchers have concentrated on investigating policy responses to demand that is generated by a known stochastic process. For example, early works such as (Veinott 1965) assume that demand conforms to a known set of independent probability distributions, while more recent exercises assume a known member of a class of processes, such as ARIMA 4 (Gilbert 2005) or linear/Gaussian state-space (Aviv 2003). Specific prescriptions for using formal quantitative methods to forecast demand for supply chain management are far less numerous. (Harvey and Snyder 1990; Snyder, Koehler, and Ord 2002; Gardner 1990) are amongst the few examples describing generally applicable methods, while (Boylan 2005) and (Shenstone and Hyndman 2005) concentrate on specialized forecasting techniques applicable to items such as spare parts with intermittent demands. Documented instances of statistical forecasting models actually employed in commercial supply chain management are rarer still: While commercially-available software packages for supply chain management include forecasting techniques such as ARIMA, linear regres4

ARIMA is an abbreviation of “Auto-Regressive Integrated Moving Average”—c.f. (Box, Jenkins, and Reinsel 1994).

4

Bayesian Forecasting of Parts Demand sion and spectral-decomposition-based smoothing (Yurkiewicz 2006), both Sanders and Manrodt (1994) and Lapide (2006) note the continued heavy reliance on judgmental and ad-hoc methods in most companies. Our own experience at Sun attested to the difficulties involved in forecasting for operations using statistical models—we found that such approaches were frequently stymied by little or no historical data (which rendered techniques such as ARIMA and spectral smoothing inapplicable) or lack of suitable predictors (ruling out linear regression, for example). The quantitative forecasting technique that perhaps most appeals to both academics and practitioners is one of the most venerable: exponential smoothing. For practical applications, exponential smoothing is technically straightforward to implement, requires little or no historical data for calibration, and no predictors or regressors. Armstrong and Green (2005) aver that “exponential smoothing is the most popular and cost effective of the statistical extrapolation methods”, and Gardner (1990) documents the use of exponential smoothing in operations management. On the theoretical side, as summarized by Gardner (1985, 2006), the technique has enjoyed almost 40 years of continuous development since its invention by Brown (1959), and has recently engendered the development of a new class of structural time series models (Hyndman, Koehler, Ord, and Snyder 2008). Although exponential smoothing appeared a natural choice as a generic approach to the parts forecasting problem, experience with the product-level sales model documented in (Yelland 2004), suggested that a specially-designed statistical model might well yield superior results, and also pointed to the effectiveness of Bayesian techniques for extrapolating short time series like part demand histories. Unfortunately, the model in (Yelland 2004) relies on informative Bayesian priors deduced from judgmental forecasts for product sales, and these priors are an unreliable guide to demand for parts in those products, for the reasons described in the previous section. On the other hand, the use of diffuse or noninformative priors is ruled out by the necessity of producing forecasts early in the life of a part, before sufficient observations accrue to produce proper forecast distributions. Therefore, though the model presented here is also Bayesian, it uses a hierarchical prior (Gelman and Hill 2006) to produce initial parameter estimates from sales records of established parts. This sort of “forecasting by analogy” echoes the work of Duncan, Gorr, and Szczypula (2001), who observe that Bayesian pooling also helps deal with time-series volatility. Früwirth-Schatter and Kaufmann (2008) use hierarchical Bayesian priors for time-series analysis, too, though their focus is on clustering, rather than forecasting.

2

A Model for Parts Demand at Sun

This section describes the Bayesian model developed for the parts forecasting problem. For reference purposes, notation, quantities and probability distributions used throughout the paper are documented in Appendix A and Appendix B. A summary of the model is provided in Section 2.9. We begin with a few prefatory remarks.

5

Bayesian Forecasting of Parts Demand 2.1 Motivation In the conventional forecasting situation examined in the literature, a suitably long time series of observed values—y = (yt , . . . , y T ), say—is presented for extrapolation, and (restricting the discussion to a single period forecast horizon for simplicity) the task of the forecaster is to predict the value of the next value of the series, y T +1 . In the case of the part demand forecasting exercise examined here, however, short life cycles mean that individual parts frequently lack sufficient observed demand values to support reliable extrapolation. In this application, therefore, input to the forecasting process consists consists not only of previous demands for a particular part, but also of observed demands for other parts, too—even parts which were recently withdrawn from the manufacturing process. Thus the data takes the form of a collection of series, y1 , . . . , yN , of potentially differing lengths (some of which may be zero), so that for i = 1, . . . , N, yi = (yi1 , . . . , yiTi ). The aim is to forecast yk Tk +1 , for some chosen part k. The objective of the model presented in this section is a statistical representation of such a collection of demand series using a set of random quantities. This latter set—which we denote in the abstract by the vector θ—contains not only model parameters in the conventional sense, but also the values of latent variables or processes. 5 The representation in terms of θ is sufficiently detailed that all the elements of the series are conditionally independent given θ, i.e.: 6 N

p(y1 , . . . , yN |θ) =

Ti

∏ ∏ p(yit |θ).

(1)

i =1 t =1

A Bayesian forecast of yk Tk +1 rests on its posterior predictive distribution. The latter is simply the conditional distribution of yk Tk +1 given the historical demands, p(yk Tk +1 |y1 , . . . , yN ). 7 The conditional independence property of the model expressed in equation (1) is pivotal to the derivation of this distribution, since on the assumption that yk Tk +1 is also wellrepresented by θ, it should be conditionally independent of the historical demands, just as the historical demands were conditionally independent of each other: p(yk Tk +1 , y1 , . . . , yN |θ) = p(yk Tk +1 |θ) p(y1 , . . . , yN |θ). 5

(2)

Jackman (2000) highlights the fact that the Bayesian approach actually makes no formal distinction between such quantities. 6 Since some of the T may be zero, we adopt the convention that for any expression •, 0 • = 1. ∏ i =1 i 7 For a definitive account of prediction in Bayesian statistics, see (Geisser 1993).

6

Bayesian Forecasting of Parts Demand Now it is easy to show 8 that with the provision of a prior distribution for θ, p(θ), the posterior predictive distribution may be expressed as: p(yk Tk +1 |y1 , . . . , yN ) = ∝

Z Z

p(yk Tk +1 |θ) p(θ|y1 , . . . , yN )dθ

(3)

p(yk Tk +1 |θ) p(y1 , . . . , yN |θ) p(θ)dθ.

(4)

Note that the second factor in the integral on the right hand side of equation (3) is the posterior distribution of θ given the observed data, y1 , . . . , yN . As many treatments of Bayesian statistics illustrate (Bernardo and Smith 1994; Gelman et al. 2003, for example), provided that the observed data is sufficiently informative, even if p(θ) is diffuse or noninformative for elements of θ, the posterior distribution will be sharp enough to yield reasonably precise predictions for yk Tk +1 in equation (3). This is an advantage in this application, where very little prior information was available in advance of the model’s deployment. We also note that given a mechanism—such as the Markov chain Monte Carlo (MCMC) simulator described in Section 3—for drawing samples from the posterior distribution of θ, the right hand side of equation (3) shows how one may sample from the posterior predictive distribution by drawing a value θ˜ from the posterior distribution p(θ|y1 , . . . , yN ) and then drawing one from the conditional distribution p(yk Tk +1 |θ˜ ). The resulting sample may then be used to forecast yk Tk +1 . Finally, we should point out that in order to achieve reasonable fidelity to the data, we have adopted a so-called hierarchical or multilevel prior (Gelman 2006; Gelman and Hill 2006) in the model. This may be thought of abstractly as dividing θ into a number of collections of sub-vectors: A collection ζ = (ζ 1 , . . . , ζ N ) of parameter vectors associated with parts, a collection φ = (φ1 , . . . , φ J ) of parameter vectors associated with the products to which the parts belong, and finally a single vector ϑ of common “population-level” parameters. The prior for θ as a whole is expressed by defining the priors for the part-level parameters in terms of the values of the product- and population-level parameters, and the product-level parameters in terms of the population-level parameters; the populationlevel parameters receive their own free-standing priors. This means that: ( p(θ) =

J

"

N

∏ ∏ p(ζ i |φ, ϑ ) j =1

#

) p(φj |ϑ )

p ( ϑ ).

(5)

i =1

As a general rule, “pooling” model information by using common parameters at the population level makes for convenient expression, speedier estimation and more stable predictions; the product- and part-level parameters are vital, however, to capture the heterogeneity exhibited by the data (see Gelman and Hill 2006, for a general discussion). 8

See (Gelman, Carlin, Stern, and Rubin 2003, p. 9) for details.

7

Bayesian Forecasting of Parts Demand 2.2 Constituents of Part Demand As a first step in making concrete the general discussion above, we need to specify the conditional distribution p(yit |θ) of unit demand for part i in period t. Rather than defining the distribution directly to begin with (an explicit definition is given in Section 2.9), it is more helpful for the purposes of exposition to consider part demand as determined by the combination of four random quantities: yit = (λit + ε it + xit )γi .

(6)

The right hand side of this equation is the product of two factors: The first—which for convenience during the discussion we refer to as the shape process associated with part i— is a discrete-time stochastic process that captures life cycle effects, temporally uncorrelated and autocorrelated errors as they influence demand throughout the part’s life cycle; these are discussed in further detail below. The second factor in the product is a part-specific quantity, γi , which scales the part’s shape process to match its period demands. This partspecific scaling allows the shape processes of different parts to be parameterized similarly, even though their demands might differ in overall magnitude, and—as we will show— this in turn permits the use of common parameters to pool forecast information.

2.3 Life Cycle Curve Like the forecasting model in (Yelland 2004), the model in this paper incorporates a stylized representation of a product’s life cycle, so as to capture systematic changes in demand (illustrated in Figure 1) as a part is introduced to and withdrawn from the manufacturing process. The representation used in this model is derived from the Weibull distribution, following the work of Moe and Fader (2002), who use it in the analysis of new product adoption. 9 Thus the quantity λit in equation (6), which traces the evolution of demand over the life cycle of a part, is determined by the difference between the values of a suitably parameterized Weibull cumulative distribution function (CDF) at t and t − 1: λit = W(t|αi , δi ) − W(t − 1|αi , δi ).

(7)

In their work, Moe and Fader use the conventional parameterization of the Weibull curve, η according to which the value of the Weibull CDF at t is equal to 1 − e−(t/k) , where η and k are (positive) parameters of the distribution. To help the convergence of the MCMC simulator described in Section 3, and as an aid to interpretability, we actually use an alternate parameterization of the Weibull in equation (7), indexed by αi and δi , which are respectively the 20th percentile of the distribution and the difference between its 95th and 20th 9

Moe and Fader draw in turn on the body of research into new product diffusion modeling summarized by Mahajan, Muller, and Wind (2000).

8

Bayesian Forecasting of Parts Demand percentiles. 10 A little algebraic manipulation yields conventional Weibull parameters ηi and k i corresponding to αi and δi , so that: ηi

W(t|αi , δi ) = 1 − e−(t/ki ) ,

where ηi =

αi 2.6 , ki = . log(αi + δi ) − log(αi ) 0.221/ηi

(8)

Moe and Fader’s (2002) use of a Weibull curve to describe the time to first purchase of a new product has theoretical appeal, given the Weibull distribution’s origins in the analysis of events that occur after a period of random duration. The use of the Weibull in this application is rather more pragmatic—as Moe and Fader observe, it offers an appealing combination of parsimony and flexibility. 11 An informal measure of how well the Weibull model captures the trend in the Sun’s part demands with only two parameters can be gauged from Figure 1, where (appropriately scaled) Weibull curves have been fitted to the sample series. Completing the specification of λit given by equations (7) and (8) requires that we provide prior distributions for the part-specific parameters αi and δi . The priors for both of these parameters are hierarchical, incorporating information from parts that constitute the same product, and from parts in general. Treating the case of αi in detail (the treatment of δi is analogous): αi —which is necessarily positive—is drawn from a normal distribution, truncated on the left at 0. 12 The scale parameter σα of this truncated normal distribution is common to all parts, but the location parameter, aprod(i) , is shared only with other parts for the same product (we use the expression “prod(i )” to denote the index of the product to which part i belongs). At the next level of the hierarchy, for all products j, a j is drawn from a normal distribution with mean and variance common to all products. Finally, the mean µ a of this latter distribution has a non-informative prior. In symbols: αi ∼ N[0,∞) ( aprod(i) , σα2 ),

a j ∼ N(µ a , σa2 ),

p(µ a ) ∝ 1.

(9)

Priors for scale parameters σα and σa are discussed in Section 2.7. We use the difference between the 20th and 95th percentiles rather than the 95th percentile itself because αi and δi might reasonably be considered a priori independent, making for easier specification of the model prior. 11 It could be said that our outlook conforms with the technological approach to modeling of Bernardo and Smith (1994, p.238), in that we are concerned less with the “‘true’ mechanisms of the phenomenon under study . . . [than] simply with providing a reliable basis for practical action in predicting . . . the phenomena of interest”. In fact an ab initio argument for the Weibull model might be made by postulating some form of “adoption” process for the parts themselves, along the lines of that presented by Norton and Bass (1987), for example. However, without detailed information about end-user behavior to support it (very difficult to obtain in this context), such a construction would amount to little more than “armchair theorizing”. 12 Strictly speaking, truncation on the left should be at a point slightly greater than 0, but the technical elision is of no practical consequence.

10

9

Bayesian Forecasting of Parts Demand 2.4 Uncorrelated Errors The second constituent, ε it , of the shape process in equation (6) represents deviations in demand from the value specified by λit that are uncorrelated over time (for convenience, we refer to these deviations somewhat loosely as “uncorrelated errors”). In the model, ε it is drawn from a normal distribution with zero mean and a standard deviation specific to part i and period t: ε it ∼ N(0, ς2it ). (10) Here, ς it changes to accommodate occasional outliers in part demand, in keeping with the binary selection model described by Congdon (2003, sec. 3.6.1). Using the latter construction, each observation of demand yit is associated with a latent binary variable vit ∈ {0, 1}, which identifies yit as an outlier iff vit = 1. Observations identified as outliers are drawn from a distribution with the same zero mean, but with a standard deviation four times 13 as large as the standard deviation for non-outlying observations, σε . Thus: ς it = (1 + 3vit )σε .

(11)

Since the occurrence of an outlier is ipso facto a rare event, the prior for the indicator vit is a Bernoulli distribution such that the probability that vit = 1 is 5%. The prior for σε will be detailed in Section 2.7; note that since ς it is independent of the scale of part i’s demand, the same standard deviation parameter σε may be used for all parts. Assuming that ε it is normally-distributed is a substantial technical convenience, and follows the precedent set by Srinivasan and Mason (1986), who use normal errors in a similar context. Strictly speaking, however, it gives rise to forecast distributions that lack coherence in the sense of McCabe and Martin (2005), since they give support to negative demand values. Fortunately, we have found that in practice, point forecasts (the focus of interest for Sun’s supply chain) produced using the mean of the forecast distributions from the model (see Section 4 for details) are invariably positive.

2.5 Autocorrelated Errors The quantity xit in equation (6) represents the value in period t of a latent autoregressive process associated with part i, intended to capture random demand variations that are correlated across time. West and Harrison (1997, p. 300) suggest the use of a time series process of this form as a “catch-all noise model used as a purely empirical representation of residual variation,” and we have found that including a latent autoregressive process significantly improves short-term forecasts made with the model. As the name suggests, 13

A scale inflation factor of 3 or 4 is suggested for general use in characterizing discrepant observations by West and Harrison (1997, p. 400).

10

Bayesian Forecasting of Parts Demand the value of the autoregressive process in period t is a randomly-perturbed linear combination of the values in the preceding two periods: xit = ϕi1 xi t−1 + ϕi2 xi t−2 + ξ it ,

where ξ it ∼ N(0, σξ2 ).

(12)

For identifiability, the terms ξ it and ε it are assumed to be independent (see West and Harrison 1997, sec. 2.1), and as a precaution against over-fitting, the standard deviation σξ in equation (12) is fixed at a constant multiple of the standard deviation, σε , used in equation (11) to characterize uncorrelated errors. This multiple constitutes a tuning parameter of the model; we found that setting σξ = 0.8 σε yields good results (as before, the scale-free property of a part’s shape process allows the same parameter to be used across parts). Starting values xi0 and xi,−1 of the autoregression are provided informative priors centered around zero—a zero expectation is appropriate for an “error” term, and it seems reasonable a priori to assert that there is little “residual variation” (to use West and Harrison’s terminology) at the outset of a part’s life. Again, since the shape process of each part is scale-free, the same priors may be used for all parts:

( xi0 , xi,−1 )> ∼ N(µ x0 , Σ x0 ),

>

µ x0 = (0, 0) ,

Σ x0 = diag(2, 2).

(13)

The regression coefficients in equation (12) are specified using a hierarchical prior: ϕi1 and ϕi2 are drawn directly from a multivariate normal distribution that is common across all parts, and the mean and variance of this common distribution are given a noninformative multivariate Jeffrey’s prior (Sun and Berger 2006). Thus:

( ϕi1 , ϕi2 )> ∼ N(µ ϕ , Σ ϕ ),

2 p(µ ϕ , Σ ϕ ) ∝ Σ ϕ .

(14)

2.6 Scale Factor The prior for the (necessarily positive) scale factor γi of part i in equation (6) has the same hierarchical structure as that provided for parameters αi and δi —namely, a left-truncated normal distribution whose location parameter is shared with other parts of the same product, governed itself by a normal distribution with non-informative priors (again, see Section 2.7 for the prior of σg ): γi ∼ N[0,∞) ( gprod(i) , σγ2 ),

g j ∼ N(µ g , σg2 ),

p(µ g ) ∝ 1.

(15)

Even in combination with the likelihood induced by equation (6), however, this prior determines γi with insufficient precision to produce good forecasts—with only a few demands observed for part i, the posterior distribution for γi gives undue support to very large values. To overcome this problem, we use “superannuated” parts, whose entire life cycles have been observed, to place additional constraints on the distribution parameters

11

Bayesian Forecasting of Parts Demand in (15)—and thus, indirectly, on the scale factors of current parts, whose life cycles are yet to complete. To see how, consider that conditional on γi , the expectation (in period 0) of the total demand for part i is equal to γi , since from equation (6): ∞

∞

E[ ∑ yit ] = γi E[ ∑ λit + ε it + xit ] t =1

n

t =1 ∞

∞

∞

t =1

t =1

= γi E[ ∑ λit ] + ∑ E[ε it ] + ∑ E[ xit ] t =1 ∞

o

= γi E[ ∑ λit ]

(16)

= γi E[W(1|αi , δi ) − W(0|αi , δi ) + W(2|αi , δi ) − W(1|αi , δi ) + . . .] = γi E[W(∞|αi , δi ) − W(0|αi , δi )] = γi ( 1 − 0 ) ,

(17)

t =1

where in equation (16) we use the fact that the expectations in period 0 of both error terms is 0, and equation (17) follows from the definition of the Weibull CDF. Therefore, if the entire life cycle of part i has been observed, it is plausible to assert that the observed total demand for that part, si , is approximately equal to γi . Operationally, we have found that asserting si ∼ N(γi , [0.2γi ]2 ) for superannuated parts yields good results.

2.7 Priors for Scale Parameters We use non-informative priors for scale parameters of normal and truncated normal distributions throughout the model. Gelman’s (2006) paper demonstrates that producing a truly “non-informative” prior for variance parameters is a delicate business, particularly in hierarchical models such as this (where, for example, the popular “weakly informative” inverse-gamma prior of Spiegelhalter, Thomas, Best, Gilks, and Lunn (2003) can lead to degenerate posterior distributions for variances of group parameters). Here, we use a uniform density on the positive half-line as the prior for the standard deviations in (9), which as Gelman (2006) indicates, is formally equivalent to an inverse-χ2 density with -1 degrees of freedom for the corresponding variances. Explicitly, therefore, for σ◦ ∈ {σα , σa , σδ , σd , σε , σγ , σg }, we have p(σ◦ ) ∝ I(σ◦ > 0), where the indicator function I() is equal to 1 if its argument is true, and 0 otherwise.

2.8 Distribution of Part Demand To complete the specification of the model, we can derive explicitly the conditional distribution of part demand introduced in Section 2.1, which we passed over in Section 2.2. First: yit = γi (λit + xit ) + γi ε it

12

from (6)

Bayesian Forecasting of Parts Demand Part demand

⇒ yit ∼ N(γi (λit + xit ), [γi ς it ]2 )

yit = (λit + ε it + xit )γi ,

Life cycle curve

λit = W(t|αi , δi ) − W(t − 1|αi , δi ) αi ∼ N[0,∞) ( aprod(i) , σα2 ),

a j ∼ N(µ a , σa2 )

δi ∼ N[0,∞) (dprod(i) , σδ2 ),

d j ∼ N(µd , σd2 )

Uncorrelated errors

ε it ∼ N(0, ς2it ),

vit ∼ Bern(0.05)

ς it = (1 + 3vit )σε , Autocorrelated errors

ξ it ∼ N(0, σξ2 ), 2 p(µ ϕ , Σ ϕ ) ∝ Σ ϕ

xit = ϕi1 xi t−1 + ϕi2 xi t−2 + ξ it , >

( ϕi1 , ϕi2 ) ∼ N(µ ϕ , Σ ϕ ), ( xi0 , xi,−1 )> ∼ N(µ x0 , Σ x0 ),

>

µ x0 = (0, 0) ,

σξ = 0.8 σε

Σ x0 = diag(2, 2)

Scale factor

γi ∼ N[0,∞) ( gprod(i) , σγ2 ),

g j ∼ N(µ g , σg2 )

si ∼ N(γi , [0.2γi ]2 )

Figure 2. Model summary: Unless otherwise stated, location parameters of the form µ◦ and scale parameters σ◦ have non-informative priors p(µ◦ ) ∝ 1 and p(σ◦ ) ∝ I(σ◦ > 0), respectively. Indexes i, j and t range over parts, products and periods, resp.

Furthermore: ε it ∼ N(0, ς2it )

from (10) 2

⇒ γi ε it ∼ N(0, [γi ς it ] ) Therefore conditional on the model parameters: yit ∼ N(γi (λit + xit ), [γi ς it ]2 ).

13

Bayesian Forecasting of Parts Demand — Unobserved

µa

Σϕ

µϕ

σε

vit

µd

σa

aprod(i)

xit−1

ϕi

— Observed

σα

σd

dprod(i)

αi

σδ

δi

µg

σg

gprod(i)

σγ

xit−2 ε it

xit

λit

yit

γi

si

Figure 3. Conditional independence relationships of the model in Figure 2: Indexes i, j and t range over parts, products and periods, resp.

2.9 Model Summary Figure 2 summarizes the model, collecting together the definitions made in this section. The model’s hierarchical structure is illustrated by Figure 3, which captures dependencies in the form of a directed graph of the sort described by Rossi, Allenby, and McCulloch (2005, pp. 67–81), for example. Quantities in the model are represented as nodes in the graph, with a node’s parents comprising those quantities bearing directly on the definition of the quantity represented by that node. 14 Rossi et al. (ibid.) demonstrate how such a directed graph may be used to derive an explicit expression for the prior distribution along the lines set out in equation (5). We refrain from writing out the prior in full here, as it would contribute much clutter but little additional information, and estimation of the model depends not on the explicit expression of the prior distribution, but on the conditional distributions given in the MCMC sampler in the next section. 14

Conditional independence assertions involving ranges of index variables—written out schematically in equation (5)—are left implicit in the diagram. Buntine’s (1994) plates provide a graphical mechanism for expressing such relationships explicitly, but the complexity of this model means that the use of plates here tends to obscure matters, particularly given the presence of the autoregressive process xi .

14

Bayesian Forecasting of Parts Demand 3

Estimation

This section discusses the simulation procedure used to approximate the posterior distributions of the model parameters, and summarizes the approximate posteriors produced with the full sample of part demands described in Section 1.

3.1 Gibbs Sampler As is commonly the case in modern applied Bayesian statistics, estimation of posterior distributions for the model described in the previous section is carried out using a Markov chain Monte Carlo simulator—specifically a Gibbs sampler. Schematic descriptions of Gibbs sampling now abound in the literature—see (Gilks, Richardson, and Spiegelhalter 1996, chp. 1), for example; briefly, beginning with an arbitrary configuration of the random variables in the model, such a simulator constructs a Markov chain whose states converge to a dependent sample from the joint posterior of those random variables. Each transition in this Markov chain involves drawing a new value of one of the variables from its distribution conditional on the current value of the other variables and the observed data. The individual transitions of this particular sampler are described below. Many of the steps rely on standard results concerning conjugate updating in Bayesian analysis, which may be found in texts such as (Gelman et al. 2003) or (Bernardo and Smith 1994). Where such closed-form updates are not available, we resort to Metropolis-Hastings sampling (also discussed by Gilks et al.); the latter relies on proposed values that are generated using Geweke and Tanizaki’s (2003) Taylored chain procedure, details of which are provided in Appendix C. In the following, each step is introduced by the conditional distribution from which a sample is to be drawn. Variables of which the sampled quantity is conditionally independent are omitted from the conditioning set. We use the abbreviations: vi = (vi1 , . . . , viTi ), xi = ( xi1 , . . . , xiTi ) and yi = (yi1 , . . . , yiTi ). Also, for each product j, let parts( j) = {i | prod(i ) = j} be the set of associated parts. In the interests of brevity, we specify draws for a j , σα , µ a and σa only; samples for d j , σδ , µd and σd , and g j , σγ , µ g and σg are generated in an analogous fashion. γi yi , σε , αi , δi , xi , vi , gprod(i) , σγ , si The kernel of the full conditional distribution is given by the expression: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ] t =1

15

# 2

) × N[0,∞) (γi | gprod(i) , σγ2 ) × ϕi ,

Bayesian Forecasting of Parts Demand where: ( ϕi =

N(si |γi ,[0.2γi ]2 )

if the entire lifecycle of part i has been observed

1

otherwise.

In either case, sampling is carried out using the Geweke-Tanizaki Taylored chain to generate a proposal in a Metropolis-Hastings step. αi yi , γi , σε , δi , xi , vi , aprod(i) , σα The full conditional is proportional to the expression: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ]

# 2

) × N[0,∞) (αi | aprod(i) , σα2 ).

t =1

This is sampled using a Taylored chain proposal in a Metropolis-Hastings step. δi yi , γi , σε , αi , xi , vi , dprod(i) , σδ As above, but this time sampling from: "

Ti

∏ N(yit |γi (λit + xit ),[γi (1 + 3vit )σε ]

# 2

) × N[0,∞) (δi |dprod(i) , σδ2 ).

t =1

vit yit , γi , σε , αi , δi , xit Let: yit = γi (λit + xit ), 2

p0 = (1 − 0.05) × N(yit |yit ,[γi σε ] ), p1 = 0.05 × N(yit |yit ,[4γi σε ]2 ). Sample vit from the Bernoulli distribution with success probability p1 /( p0 + p1 ). xi yi , γi , σε , αi , δi , vi , µ x0 , Σ x0 Following West and Harrison (1997, example 9.6), xi is given by the state vector of the dynamic linear model, or DLM, specified by ( Ft , Gt , Vt , Wt ), for t ∈ 1, . . . , Ti , where: 1 Ft = , 0

Gt =

ϕi1 ϕi2 1

Vt = [(1 + 3vit )σε ]2 ,

,

0

Wt =

σξ2 0

.

0 0

Artificial “observations” for this DLM are equal to yit /γi − λit , for t ∈ 1, . . . , Ti . First and second moments of the multivariate normal prior for the initial state configuration (m0 and C0 in West and Harrison’s formulation) are µ x0 and Σ x0 , respectively.

16

Bayesian Forecasting of Parts Demand Procedures for sampling the state of such a DLM (generally under the moniker forward filtering/backwards sampling algorithms) are described by Frühwirth-Schnatter (1994), West and Harrison (1997) and Durbin and Koopman (2001), amongst others. ϕi1 , ϕi2 xi , σξ , µ ϕ , Σ ϕ Draw from the posterior distribution of the coefficients in the linear regression xit ∼ N( ϕi1 xt−1t + ϕi2 xt−2t , σξ2 ), given a prior ( ϕi1 , ϕi2 )> ∼ N(µ ϕ , Σ ϕ )—see e.g. (Gelman et al. 2003, chp. 8). µ ϕ , Σ ϕ ϕ1 , . . . , ϕ N Conjugate updating for parameters of a multivariate normal distribution. a j {αk | k ∈ parts( j)}, σα Sampling is carried out using a device due to Griffiths (2004): Specifically, for k ∈ parts( j), let: α −a −a Φ iσα j − Φ σα j − 1 , (18) α˜ k = a j + σα Φ −aj 1 − Φ σα where Φ(·) denotes the standard normal cumulative distribution function. Then as Griffiths demonstrates, supposing that α˜ k ∼ N( a j , σα2 ) and drawing from the conditional distribution a j {α˜ k | k ∈ parts( j)}, σα (a straightforward application of semi-conjugate updating) is equivalent to drawing from a j {αk | k ∈ parts( j)}, σα given that αk ∼ N[0,∞) ( a j , σα2 ). σα α1 , . . . , α N , a1 , . . . , a J Again, using Griffiths’s device, draw from σα α˜ 1 , . . . , α˜ N , given that α˜ i − aprod(i) ∼ N(0, σα2 ), where α˜ i is defined in equation (18). µ a , σa a1 , . . . , a J Two step semi-conjugate updating for parameters of a normal distribution, drawing µ a first from µ a a1 , . . . , a J , σa , and then σa from σa a1 , . . . , a J , µ a . σε y1 , α1 , δ1 , γ1 , x1 , v1 , . . . , yN , α N , δN , γ N , x N , v N Let rit = (1 + 3vit )−1 (yit /γi − λit − xit ), for i = 1, . . . , N, t = 1, . . . , Ti . Then rit ∼ N(0, σε2 ), and conjugate updating applies.

17

Bayesian Forecasting of Parts Demand Convergence

Posterior distribution

α pt.8 pt.10 pt.29 pt.35

●

4.9 (0.8) 4.2 (0.66) 4.2 (0.5) 4.5 (0.7) 3

4

5

6

0.32 [0.75] −0.95 [0.34] 0.14 [0.89] −0.53 [0.60]

● ● ●

7

δ pt.8 pt.10 pt.29 pt.35

●

21 (2.1) 19 ( 2) 15 (1.6) 16 ( 2) 12

14

16

18

20

22

● ● ●

1.02 [0.31] −0.43 [0.67] 0.19 [0.85] 1.11 [0.27]

24

γ pt.8 pt.10 pt.29 pt.35

●

3.6e+03 (1.5e+02) 1.5e+03 ( 57) 1.8e+03 ( 58) 1.3e+03 ( 58) 1500

2000

2500

3000

● ● ●

0.48 [0.63] −0.94 [0.35] 0.54 [0.59] 0.89 [0.37]

3500

λ.1 pt.8 pt.10 pt.29 pt.35

0.19 (0.27) 0.22 (0.25) 0.091 (0.23) 0.35 (0.28) −0.4

−0.2

0.0

0.2

0.4

0.6

0.8

●

−2.43 [0.02] −0.73 [0.47] −1.25 [0.21] −0.65 [0.52]

● ● ●

1.0

y.5 pt.8 pt.10 pt.29 pt.35

2e+02 ( 88) 1.2e+02 ( 37) 1.3e+02 ( 46) 67 ( 32) 0

100

200

300

● ● ● ●

−1.39 [0.17] −1.89 [0.06] 0.61 [0.54] 0.26 [0.79]

400

a pd.3 pd.4 pd.5 pd.8

4.4 (0.6) 5.1 (0.38) 4.5 (0.38) 3.2 (0.63) 2

3

4

5

● ● ● ●

−0.50 [0.62] −1.44 [0.15] −1.20 [0.23] 0.46 [0.65]

6

σ a alpha 0.5

1.0

1.5

2.0

2.5

3.0

●

1.7 (0.87) 0.99 (0.18)

●

0.08 [0.94] −0.06 [0.96]

0.021 (0.00081)

●

−0.12 [0.91]

3.5

σ epsilon 0.019

0.020

0.021

0.022

0.023

Figure 4. Posterior distributions and convergence diagnostics for model fit to full sample of demand series: On the left are interquartile ranges and 1.5×interquartile ranges of quantities associated with selected parts (pt.-), products (pd.-) and hyperparameters, together with the mean and standard deviation of the corresponding distribution. On the right are values of Geweke’s convergence statistic; all but three of the values lie between the 5th and 95th percentiles of the standard normal distribution (statistic values and p-values appear on the far right), so convergence is reasonably assured. 18

Bayesian Forecasting of Parts Demand 3.2 Posterior Estimates and Diagnostics Figure 4 (inspired by similar figures on e.g. p. 351 of Gelman and Hill 2006) illustrates the results of estimating the model with the full sample of demand histories described in Section 1.1. Here the Gibbs sampler was run in a single chain for 4,000 iterations, with samples from the first 1,200 discarded; no thinning of the remaining samples was performed. On the left of the Figure are displayed posterior distributions for the quantities associated with a random selection of parts (“pt.”- 8, 10, 29 and 35) and randomly selected products (“pd.”-, 3, 4, 5 and 8), as well as population-level parameters. Parameters αi , δi , γi and ϕi1 are summarized for the selected parts, and the fitted value of yit in the fifth period of each part’s life cycle 15 is displayed as “y.5 ”. Also shown are the product-level location parameters a j for the selected products, as well as scale parameters σα , σa and σε (the latter are labeled “a,” “alpha” and “epsilon” in the Figure). Each posterior distribution is summarized graphically by a condensed box-and-whisker plot, 16 with the distribution’s mean and (parenthesized) standard deviation given numerically. The right-hand side of the Figure plots Geweke’s (1992) convergence diagnostic for each of the quantities in question. Derived from a comparison of the first and last segments of the Markov chain associated with a quantity, Geweke’s statistic z has an asymptotically standard normal distribution if the chain is stationary (i.e. convergence has occurred). On the diagram, values of z are plotted on the 5th and 95th percentiles of the standard normal distribution. The plots indicate that convergence has been achieved—note that with 27 quantities displayed, we would expect around 3 values of z to fall outside of the percentile bounds even with convergence.

4

Testing Forecast Performance

To test the forecasting effectiveness of the model, we used it to produce forecasts for the full demand series sample described in Section 1.1, and compared its performance with that of the three other forecasting methods.

4.1 Setup Data used in the test comprised records of period demand for the sample of 45 parts in Section 1.1, which were used in 9 products in all. Together, the life cycles of the parts in the test spanned 54 planning periods. Each part i was associated with a period Si , the period 15

By analogy with regression diagnostics, the “fitted value” of yit is the value predicted by the other random quantities in the model. 16 “Boxes” delimit the interquartile range of the distributions, and “whiskers” extend 1.5 times the interquartile range from the ends of the boxes—see Tukey (1977) for further details.

19

Bayesian Forecasting of Parts Demand in which the part was first used (and in which demand for the part was first observed), and a life cycle length Li , the number of periods during which the part was in use. From the demand records, a 45 × 54 matrix was assembled, with element Dip equal to the number of units of part i demanded in test period p—Dip = 0 for p < Si or p ≥ Si + Li . 17 Since all of the forecasting methods require some historical data at the outset of the test to make forecasts, the parts from 3 randomly-chosen products—16 parts in total—were held in reserve. Forecasts were produced for the remaining 29 holdout parts, belonging to 6 products. Forecasts were made using the candidate methods in each of the 54 periods spanned by the test products’ life cycles. The forecast horizon was a single planning period, since this encompassed the lead time of the bulk of the manufacturing parts at issue. In producing the forecasts, we endeavored to mimic as closely as possible the actual demand information available in each forecast period—reserve parts aside, only demands which had occurred prior to the forecast period were provided to the forecasting methods. For the forecasting method based on the model described in this paper (and to a lesser extent for the other methods tested) it was imperative that all demand series provided were aligned, so that the demand recorded in the first period of a part’s life cycle appeared as the first value in the associated demand series (the yi of Section 2.1) input to the forecasting method, regardless of which period the part’s life cycle actually began. Therefore, for each period p = 1, . . . , 54, the following procedure was followed: (1) Define the set C = {i | Si ≤ p}, which collects “current” parts whose life cycles began on or before p. (2) For i = 1, . . . , 45, let Ti equal the number of periods of demand observed for part i prior to period p: if i is reserved, Li Ti = 0 (19) if i ∈ / C, min( L , p − S ) otherwise. i

i

(3) Define the set F = {k ∈ C | Tk < Lk }, of parts for which forecasts are to be calculated (namely, current parts barring reserved parts and parts whose life cycles concluded before p). (4) Assemble demand series (y1 , . . . , y45 ), where for i = 1, . . . , 45, yi is the empty sequence if Ti = 0, and otherwise consists of the sequence ( DiSi , . . . , Di Si +Ti ). (5) Now using each of the forecasting methods, for each of the parts k in the set F , produce a forecast yˆ k Tk +1 , which corresponds to the actual part demand recorded as Dkp . The four candidate forecasting methods are described below. 17

Throughout this section, we use the index p to refer to a period within the span of the test, and t to index the demand series for a particular part, i.e. the series yi given below. Since yi actually begins in test period Si , in general t = p − Si + 1, for a given part i.

20

Bayesian Forecasting of Parts Demand Bench This is an all-but-trivial “benchmark” method that serves as a baseline for comparison. If Tk > 0, it simply repeats the observation from the previous period. For the first period of a part’s life cycle (i.e. when Tk = 0), it uses the mean of the first-period demands for other parts of the same product, if there are any suitable demand histories in the data provided to the method, and the mean of the available first-period demands of all parts otherwise: ( yˆ k Tk +1 =

ykTk

if Tk > 0

mean{yi1 | i ∈ Πk }

otherwise.

(20)

Here, the expression “mean S” denotes the arithmetic mean of the set S, and the set Πk is defined as follows: ( Πk =

Pk

if Pk 6= ∅,

H

otherwise

(21)

where:

Pk = {i ∈ H | prod(k) = prod(i )}, H = {i | Ti > 0}.

Judg

This emulates the forecasting method currently used in Sun’s supply chain. It relies on the provision (usually by the Company’s sales and marketing units) of a ˆ ip for each part i, of total unit demand for the product of which part i is forecast w a component, for the quarter into which planning period p falls. 18 Then with Πk defined as in equation (21), define an attach rate, ρk , which for Tk > 0 represents demand for part k in period Tk as a proportion of the forecast quarterly demand of the corresponding product, and for Tk = 0 is the mean of the first-period attach rates for like parts: ( ρk =

ˆ i Si | i ∈ Π k } mean{yi1 /w ˆ k Sk +Tk −1 ykTk /w

if Tk = 0, otherwise.

(22)

The assumption in forecasting with this method is that the attach rate should remain reasonably stable from period to period, so we let: ˆ kp . yˆ k Tk +1 = ρk w

ExpS

(23)

As an exemplar of the exponential smoothing techniques outlined in Section 1.2, we use the forecast package developed for the R statistical programming environment (Venables and Smith 2002) by Hyndman and Khandakar (2008). An embodiment of the forecasting framework developed by Hyndman, Koehler, Snyder,

18

ˆ i Q2P1 is a forecast of demand for product prod(i ) For example (with minor abuse of notation), w in quarter Q2. For the test, we used published forecasts of product demand from the Company’s historical records.

21

Bayesian Forecasting of Parts Demand and Grose (2002)—itself an outgrowth of the structural approach to exponential smoothing detailed at length by Hyndman et al. (2008)—the forecast package provides for the automatic selection, estimation and extrapolation of an exponential smoothing model from a taxonomy of such models that incorporate additive or multiplicative seasonal and error components and/or additive, multiplicative or damped trends. Its performance is demonstrably superior in a number of applications, as Hyndman et al. (2002) relate. A number of considerations recommended the forecast package as candidate for comparison: a) The package provides access to state-of-the-art exponential forecasting models, together with sophisticated techniques for selecting, fitting and forecasting with them; b) by operating entirely automatically, the package represents a plausible representative of the best “off the shelf” technology that might be available to operations managers; and c) again, by dint of its automatic operation, the package avoids the possibility that incompetence or invidiousness on our part might affect its performance in the comparison. Forecasting the value yˆ k Tk +1 given the (possibly empty) demand history yk is very straightforward: (1) Load the package forecast into R. (2) Attempt to fit an exponential smoothing model to yk using the package’s ets(·) function. (3) If the attempt is unsuccessful (that is, if ets(·) returns an error—almost invariably because yk is too short), use the forecast produced by applying the Bench method to the same series. (4) Otherwise, apply the function forecast(·) to the model produced in step 3. The point forecast used for yˆ k Tk +1 is the mean component of the list returned by forecast(·). 19

Mod

Forecasting with the Bayesian model simply involves subjecting the assembled demand series (y1 , . . . , y45 ) to the procedure set out in Section 2.1, using the MCMC simulator described in Section 3.1 to produce a sample from the posterior predictive distribution for yk Tk +1 . In keeping with the approach used in the method ˆ k T +1 . ExpS, we use the mean of this sample as the point estimate y k

4.2 Results The exercise described in the previous section resulted in rolling one-step-ahead forecasts for the 29 parts in the holdout set. A preliminary illustration of the relative performance 19

R code for steps 2 – 4 is (roughly) as follows: m ϕi = ( ϕi1 , ϕi2 )

Value of the latent autoregressive process for part i in period t. Error associated with xit . Coefficients of autoregression for part i. continued on next page

34

Bayesian Forecasting of Parts Demand Notation

Meaning

Scale factor γi gj si

Scale factor associated with part i; equal to expected life-time demand for part i. Location of platform-level prior for γi , for i ∈ parts( j). Observed total demand for part i over its entire life cycle (if available).

Parameters µθ , σθ

Parameters (usually mean and std. dev., resp.) of prior distribution for generic parameter θ.

Testing Si Li p ∈ {1, . . . , 54} Dip yˆ k Tk +1

C F 0 m, m ∈ 1, . . . , M

The period in which part i was first used. The length of part i’s entire life cycle. Index ranging over planning periods spanned by the test. Element of matrix recording actual part demands, equal to the unit demand for part i during period p. A point estimate of yk Tk +1 . Set of parts whose life cycles began in or before period p. Set of parts to be provided forecasts in period p. Indexes ranging over forecasting methods.

Miscellaneous ∅ mean S I(·) x>

The empty set. The mean of set S. The indicator function, equal to 1 if its argument is true, and 0 otherwise. The transpose of x.

35

Bayesian Forecasting of Parts Demand B

Standard Probability Distributions

Distribution Description

Density/mass function

N(µ, σ2 )

Normal distribution with mean µ and standard deviation σ.

N( x |µ,σ2 ) = √ 1 exp − 1 2 ( x − µ )2 2σ 2πσ

Weib(λ, k )

Weibull distribution with shape λ and scale k.

Weib( x |λ, k ) = λk−λ θ −1+λ e−( k ) ,

The Bernoulli distribution with success probability p.

Bern( x | p) = p x (1 − p)

Multivariate normal distribution with mean µ and +ve definite d × d covariance matrix Σ.

N( x|µ, Σ) = (2π )−d/2 |Σ|−1/2 i h > exp − 21 ( x − µ) Σ−1 ( x − µ)

The normal distribution N(µ, σ2 ), truncated on the left at 0.

N[0,∞) ( x |µ, σ2 ) = 2N( x |µ,σ2 ),

Exp(λ)

The exponential distribution with parameter λ.

Exp( x |λ) = λe−λx , x > 0

Mult(n; p1 , . . . , pk )

The multinomial distribution with n trials and bin probabilities p1 , . . . , pk .

Mult( x |n; p1 , . . . , pk ) =

Bern( p)

N(µ, Σ)

N[0,∞) (µ, σ2 )

θ

λ

x≥0 (1− x )

,

x ∈ {0, 1}

x≥0

n ) p1x1 , . . . , pkxk , ( x1 ,...,x k

x j = 0, 1, 2, . . . , n; ∑ x j = n Inv–χ2 (ν)

The inverse chi-squared distribution with ν degrees of freedom.

Inv–χ2 ( x |ν) = 2−ν/2 −(ν/2+1) x Γ(ν/2)

exp[−1/(2x )],

x>0

C

The Geweke -Tanizaki (2003) 26 Taylored Chain

With x the current state of the sampler, to produce a proposal x + for a target kernel p(z), let q(z) = log p(z), with q0 (z) and q00 (z) the first and second derivatives thereof. Proceed by cases: 26

Sampling techniques similar to the Taylored chain are also discussed by Qi and Minka (2002).

36

Bayesian Forecasting of Parts Demand Case 1: q00 ( x ) < −ε, where ε is a suitable small constant, such as 0.1 . 27 Rewrite the Taylor expansion of q(z) around x: 1 q(z) ≈ q( x ) + q0 ( x )(z − x ) + q00 ( x )(z − x )2 2 2 1 q0 ( x ) 00 = q( x ) − (−q ( x )) z − x − 00 2 q (x) | {z } ‡

Since q00 ( x )

< 0, the component (‡) of the latter expression constitutes the exponential part of a normal distribution, which implies that the target kernel in the vicinity of x may be approximated with mean x − q0 ( x )/q00 ( x ) and p by a normal distribution + 00 standard deviation 1/ −q ( x ); sample x accordingly.

Case 2: q0 ( x ) ≥ −ε and q0 ( x ) < 0 Approximate q(z) by a line passing through x and x1∗ , the largest mode of q(z) smaller than x: q( x1∗ ) − q( x ) (z − x1∗ ) q(z) ≈ q( x1∗ ) + x1∗ − x {z } | ‡

In this case, the component (‡) indicates an exponential distribution, and the proposal is: 28 x + = xˆ 1 + w,

where w ∼ Exp(λ1 ), q( x1∗ ) − q( x ) , λ1 = x1∗ − x xˆ 1 = x1∗ − 1/λ1

Case 3: q0 ( x ) ≥ −ε and q0 ( x ) > 0 Approximate q(z) by a line passing through x and x2∗ , the smallest mode of q(z) larger than x. The proposal is developed in a manner parallel to that in Case 2: x + = xˆ 2 − w,

where w ∼ Exp(λ2 ), q( x2∗ ) − q( x ) , λ2 = x∗ − x 2

xˆ 2 = x2∗ − 1/λ2 Case 4: q0 ( x ) ≥ −ε and q0 ( x ) = 0 By ensuring that | q00 ( x ) | > 0, using ε rather than 0 reduces the occurrence of proposed values that depart too markedly from the current state. 28 The origin of the proposal, x ˆ 1 , is offset from the mode x ∗ in order to guarantee irreducibility of 1 the resulting Markov chain; see (Geweke and Tanizaki 2003) for details.

27

37

Bayesian Forecasting of Parts Demand In this instance, x + is sampled from a uniform distribution over a range [ x1 , x2 ], such that x1 < x < x2 . End points x1 and x2 are set to suitable modes of q(·), if they can be found, and to user-supplied values otherwise.

D

Forecast Error Metrics

Assume that forecast method m provides forecast yˆ mit for part i and t ∈ 1, . . . , Ti , and that yit is the corresponding actual value. Then the mean absolute error, root mean square error and mean absolute percentage error resp. for method m and part i are defined as follows: 29 1 Ti | yˆ mit − yit | Ti t∑ =1 " # 12 1 Ti = (yˆ mit − yit )2 Ti t∑ =1 Ti yˆ − yit 1 = 100 × mit ∑ Ti t=1 yit

MAEmi = RMSEmi MAPEmi

Relative versions of the above are defined by ratios to the corresponding error metric for the benchmark forecast method, Bench: 30 RelMAEmi = MAEmi /MAEBench i RelRMSEmi = RMSEmi /RMSEBench i RelMAPEmi = MAPEmi /MAPEBench i Finally, average metrics for a particular method across all parts in the sample are derived using the geometric mean; for example: " RelMAEm =

N

∏ RelMAEmi

# N1

i =1

29

The MAPE is included in the set of metrics used in this paper since it is widely favored by practitioners of forecasting for supply chain management. It should be noted, however, that it is prone to erratic performance, particularly when actual demands are low—see (Makridakis 1993; Koehler 2001; Hyndman and Koehler 2006; Coleman and Swanson 2007) for a discussion. 30 Metric RelMAE mi is closely related to the mean absolute scaled error of Hyndman and Koehler (2006), and RelRMSEmi conforms to one definition of Theil’s U2 statistic (Armstrong and Collopy 1992).

38