Estimating time-varying parameters in brand choice models: A semiparametric approach WEB APPENDIX Daniel Guhla, Bernhard Baumgartnerb, Thomas Kneibc, and Winfried J. Steinerd,*

a b c d

*

Postdoctoral researcher, Humboldt University Berlin, Institute of Marketing, School of Business and Economics, Spandauer Straße 1, 10178 Berlin, Germany, E-mail: [email protected] Professor of Marketing, University of Osnabrück, Department of Marketing, Rolandstraße 8, 49069 Osnabrück, Germany, E-mail: [email protected] Professor of Statistics, Georg-August-Universität Göttingen, Department of Statistics and Econometrics, Humboldtallee 3, 37073 Göttingen, Germany, E-mail: [email protected] Chair in Management and Marketing and Professor of Marketing, Clausthal University of Technology, Department of Marketing, Julius-Albert-Straße 2, 38678 ClausthalZellerfeld, Germany, E-mail: [email protected] Corresponding author

Appendix A: Penalized maximum likelihood estimation of semiparametric multinomial logit models We describe penalized maximum likelihood estimation of the MNL-TVP model, i.e., based on the model equation1 (%)

(%)

(%) . -/0 𝑥"#- 𝑓- (𝑡).

𝜂"# = 𝑓) (𝑡) +

(A1)

Simplified model versions can then be derived by analogously simplifying the estimation scheme. All nonparametric functions are represented in terms of B-spline basis functions of degree 𝑙 (de Boor, 2001), i.e., (%)

𝑓) (𝑡) =

(%) 5 4/0 𝛾4)

7 𝐵4) (𝑡) = 𝜈"#) ′𝛾) ,

(%)

(%)

7 7 where 𝜈"#) = (𝐵0) (𝑡), … , 𝐵5) (𝑡))′ and 𝛾) (%)

(%)

𝑥"#- 𝑓- (𝑡) = 𝑥"#(%)

(%)

(%)

(A2)

(%)

= (𝛾0) , … , 𝛾5) )′, and 5 4/0 𝛾4-

(%)? (%)

7 𝐵4(𝑡) = 𝜈"#- 𝛾- ,

(A3)

(%)

7 7 where 𝜈"#- = (𝑥"#- 𝐵0(𝑡), … , 𝑥"#- 𝐵5(𝑡))′ and 𝛾- = 𝛾0- , … , 𝛾5- ′. Stacking all brand(%)

(0)

(AB0)

specific predictors in the vector 𝜂"# = (𝜂"# , … , 𝜂"#

)′ allows to rewrite the model specifi-

cation in matrix notation as . -/0 𝑉"#- 𝛾- , where

𝜂"# = 𝑉"#) 𝛾) +

𝑉"#) =

(0)

(AB0)?

and 𝛾) = (𝛾) , … , 𝛾)

𝜈"#) ′

(A4)

(0)?

⋱ 𝜈"#) ′

𝜈"# ⋮ (AB0)? 𝜈"#-

, 𝑉"#- =

(A5)

)′.

The vector of responses is linked to the vector of choice probabilities 𝜋"# = (0)

(AB0)

(𝜋"# , … , 𝜋"#

)′ via the response function ℎ: 𝑅AB0 → 𝑅AB0 corresponding to the multinomi-

al model, i.e., ℎ(𝜂"# ) = 𝜋"# = (ℎ(0) (𝜂"# ), … , ℎ(AB0) (𝜂"# )) and ℎ(%) (𝜂"# ) =

(X)

RST(UVW )

(X] ) 0Y Z[\ X]^\ RST(UVW )

1

.

(A6)

We do not discuss random effects here because this extension is straight forward and only makes the notation more complicated.

1

Finally, stacking all individual observation predictors and design matrices yields the overall (0)

(AB0)

model equation 𝜂 = 𝑉) 𝛾) + 𝑉0 𝛾0 + ⋯ + 𝑉- 𝛾- = 𝑉𝛾, where 𝛾 = (𝛾) , … , 𝛾)

, 𝛾0 , … , 𝛾. )

collects all regression parameters and 𝑉 = (𝑉) , … , 𝑉) , 𝑉0 , … , 𝑉. ) is the complete design matrix. Penalized maximum likelihood estimation is based on the penalized (log-)likelihood 𝑙TRa (𝛾)

= 𝑙(𝛾) −

(%) AB0 (%) (%)? %/0 𝜆) 𝛾) 𝑃) 𝛾)

. -/0 𝜆- 𝛾- ′𝑃- 𝛾-

−

= 𝑙(𝛾) − 𝛾′𝑃𝛾,

(A7)

(0)

where 𝑙(𝛾) is the usual multinomial logit log-likelihood, 𝜆) , … , 𝜆AB0 ) , 𝜆0 , … , 𝜆. are smoothing parameters determining the overall smoothness of the corresponding function estimate, and (0)

(AB0)

𝑃 = blockdiag(𝜆) 𝑃) , … , 𝜆)

𝑃) , 𝜆0 𝑃0 , … , 𝜆. 𝑃. ) is the overall penalty matrix for the com-

bined vector 𝛾. Penalized likelihood estimates can be obtained by adapting the Fisher scoring algorithm (Fahrmeir & Tutz, 2004) to the penalized likelihood setting. Taking first and second derivatives yields the score function 𝑠pqr (𝛾) = 𝑉′(𝑦 − 𝜋) − 𝑃𝛾 and 𝐹pqr (𝛾) = 𝑉′𝑊𝑉 − 𝑃, where 𝑦 = (𝑦"# ) is the vector of observed binary choice decisions, 𝑊 = blockdiag(𝑊"# ) with 𝑊"# = 𝑐𝑜𝑣(𝑦"# ) and (0)

(0)

𝜋"# (1 − 𝜋"# ) (0) (z)

𝑐𝑜𝑣(𝑦"# ) =

−𝜋"# 𝜋"#

(0) (z)

−𝜋"# 𝜋"#

⋮

…

(ABz) (AB0) −𝜋"# 𝜋"# (AB0) (AB0) 𝜋"# (1 − 𝜋"# )

⋱

(0) (AB0)

−𝜋"# 𝜋"#

⋱

⋮ −𝜋"# 𝜋"#

(0) (AB0)

…

(AB0) (AB0) 𝜋"#

−𝜋"#

Fisher scoring then proceeds by iteratively updating 𝛾 via 𝛾

({Y0)

B0 = 𝐹pqr (𝛾

({)

.

(A8)

)𝑠pqr (𝛾

({)

).

Equivalently, this update step can be written in form of an iteratively reweighted least squares fit 𝛾

({Y0)

= (𝑉′𝑊(𝛾

({)

)𝑉)B0 𝑉′𝑊(𝛾

({)

) 𝑦 (𝛾

({)

) with working responses 𝑦 (𝛾) = 𝜂 +

𝑊 B0 (𝑦 − 𝜋). This can also be interpreted as assuming a working Gaussian model for the responses 𝑦, where 𝑦 ∼ 𝑁(𝑉𝛾, 𝑊 B0 ), and estimating 𝛾 is based on weighted penalized least squares in each iteration.

2

This interpretation also forms the basis for determining the smoothing parameters for the various nonparametric functions employing recent developments that connects nonparametric Gaussian models and mixed models (e.g., Fahrmeir et al., 2004; Ruppert et al., 2003). The key in these approaches is to interpret the parameters of the nonparametric functions as random effects and to relate the penalty to a corresponding random effects distribution. The smoothing parameter 𝜆 is then a one-to-one transformation of the random effects variance, and methods to estimate this variance (such as restricted maximum likelihood estimation) can be employed to derive estimates for the smoothing parameters. We will now briefly describe the mixed model representation for a single nonparametric function 𝑓(𝑡) (where function and category indices are being dropped for simplicity). Utilizing the basis function representation, the function evaluations can be written as 𝑓(𝑡) = 𝜈′𝛾 and estimation involves the penalty pen(𝛾) = 𝜆𝛾′𝑃𝛾, where 𝑃 = 𝐷′𝐷 corresponds to the penalty matrix defined in terms of a first or second order difference matrix −1 𝐷0 =

1 −1

1 1 ⋱

⋱ −1

or 𝐷z = 1

−2 1 1 −2 ⋱

1 ⋱ 1

⋱ −2

. (A9) 1

To obtain a mixed model representation of this situation, we reparametrize the regression parameters as 𝛾 = 𝑈𝛽 + 𝑍𝑏, where 𝑈 is a two-column matrix containing a linear basis representing the unpenalized part of function 𝑓(𝑡), and 𝑍 contains the orthogonal deviation from the linear effect. The latter can be obtained from the spectral decomposition of the penalty matrix, utilizing only eigenvectors corresponding to non-zero eigenvalues. Inserting the reparametrization into the model equation yields the following representation of the nonparametric function: 𝑓(𝑡) = 𝜈′(𝑈𝛽 + 𝑍𝑏) = 𝑢′𝛽 + 𝑧′𝑏. Similarly, inserting the reparametrization into the penalty, it can be shown that 𝛽 remains unpenalized, corresponding to a fixed effect in mixed model interpretation. In contrast, the penalty for 𝑏 turns out to be of

3

the ridge type, i.e., pen(𝑏) = 𝜆𝑏′𝑏. This relates to the assumption of i.i.d. Gaussian random effects 𝑏 ∼ 𝑁(0, 𝜎 z 𝐼), where 𝜎 z = 1/𝜆. In summary, any nonparametric effect in the semiparametric MNL can be represented as the sum of fixed and random effects. Combining this interpretation with the working Gaussian model for the working responses 𝑦 allows to perform a REML update for the smoothing parameters within each of the iterations of the Fisher scoring algorithm.

Appendix B: Additional empirical applications In this appendix, we present the results of two additional empirical applications. That way, we can provide more general insights about the performance of the different models with versus without time-varying parameters as well as with versus without heterogeneity when applied to different data sets.2

B.1

Detergent Data

The A.C. Nielsen scanner panel data set used by Kim, Menzefricke, & Feinberg (2005) contains 6682 liquid detergent purchases of 492 households for 4 brands (A, B, C, and D)3 over a time period of 96 weeks. The households made at least 7 purchases (14 purchases on average) and the median interpurchase time is 7 weeks. Table B1: Summary statistics for the detergent data set (492 households, 96 weeks) brand name

no. purchases (estimation)

no. purchases (validation)

price ($ per unit) Mean sd

promotion (% of purchases) display feature

A

1015

987

5.087

0.583

0.148

0.233

B

952

943

5.952

0.632

0.135

0.217

C

627

633

5.928

0.515

0.069

0.064

D Total

774 3368

751 3314

5.426

0.675

0.095

0.118

Table B1 contains summary statistics for this data set, and Figure B1 shows weekly timeseries plots for shares, prices, and promotional activities. 2

Note that we here abstain from adding reference price and brand loyalty terms since we obtained the data sets (which were kindly provided by other researchers) already in preprocessed form for our analysis.

3

The real brand names cannot be revealed due to a confidentiality agreement with the original data provider.

4

Similar to ketchup, we here observe a considerable amount of variation with respect to brand shares, prices, and promotional activities over time and across brands. Further, three of the four brands reveal some interesting trends in the evolution of their price level within the considered time span. For estimation and validation, we randomly split the data into two halves. Note that this setup differs from Kim et al. (2005), where observations from the last 6 weeks were kept for validation purposes. Brand D was chosen as the reference brand. share

price

0.6 6 0.4 5

0.2 0.0 0

26

52

78

0

26

display

52

78

feature

0.75

0.75

0.50

0.50

0.25

0.25

0.00

0.00 0

26

52

78

0

26

52

78

week brand

A

B

C

D

Figure B1: Time-series of brand shares, prices, and promotion variables (detergent data) Table B2 shows the statistical performance for each model in the estimation and validation sample along the measures we already used for the ketchup study. Similar to ketchup, models with heterogeneity and parameter dynamics fit the data better (in- and out-of-sample) as compared to models without heterogeneity and/or parameter dynamics. Also, the parametric MNL-TVP4 model does not improve much over the MNL model and is clearly inferior to the nonparametric models. MXL-VAR and MXL-RVAR are again prone to overfitting (best insample fits but out-of-sample worse than the simple MXL). The model with the best predictive performance is the MXL-TVP3 model with random walk dynamics imposed on all parameters, followed by the two “hybrid” models MXL5

TVP31 and MXL-TVP32. Exactly the same ordering is obtained in the estimation sample (ignoring MXL-VAR and MXL-RVAR, see above). The estimation of the MXL-TVP4 model did not converge, and hence the model is excluded from further discussion. Table B2: Fit and predictive validity (detergent data) Model

Log-Lik

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3602.618 -3522.767 -3519.176 -3450.234 -3591.644 -1794.973 -1749.240 -1741.505 -1719.617 -1726.560 -1726.974 -1474.795 -1448.772

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3541.335 -3498.172 -3495.623 -3470.291 -3538.170 -2400.058 -2367.608 -2363.134 -2346.904 -2351.816 -2351.709 -3759.388 -4321.185

Brier score Spherical score Data set: estimation -1769.785 2257.762 -1717.718 2292.185 -1715.990 2293.201 -1677.722 2318.509 -1758.451 2266.430 -894.985 2857.480 -865.536 2875.327 -861.287 2878.037 -847.983 2887.052 -851.951 2884.220 -852.132 2884.172 -790.444 2906.298 -778.351 2909.457

ARMSE

Data set: validation -1729.832 2230.213 -1693.070 2252.430 -1691.255 2253.475 -1669.224 2267.843 -1724.286 2234.135 -1190.421 2601.103 -1176.283 2610.110 -1173.631 2611.989 -1164.477 2617.790 -1167.749 2615.462 -1167.736 2615.510 -1631.584 2381.478 -1565.313 2408.916

0.0860 0.0740 0.0736 0.0596 0.0845 0.0585 0.0523 0.0517 0.0468 0.0473 0.0473 0.0390 0.0367 0.0853 0.0800 0.0797 0.0754 0.0852 0.0715 0.0673 0.0668 0.0643 0.0649 0.0649 0.0830 0.0830

Note: Best-fitting model is indicated in bold within each data set (estimation vs. validation) for each performance measure.

Figure B2 illustrates how parameters evolve over time. Shown are the parameter paths for the simple MXL (dotted black line; no parameter dynamics), the MXL-TVP1 with smoother parameter paths (dashed green line; cubic splines imposed on all parameters), the MXL-TVP3 (dash-dotted red line; random walk dynamics for all parameters), and the MXL-TVP32 (solid blue line; random walks for brand intercepts, cubic splines for covariate effects).

6

intercept A

intercept B 2

0.0 1

−0.4 −0.8

0 0

26

52

78

0

26

intercept C

52

78

price

parameter value

−5 0.50 −10 0.25 −15

0.00

−20 0

26

52

78

0

feature

26

1.5

1.0

52

78

display

1.0 0.5

0.5

0.0 0.0 −0.5

−0.5 −1.0 0

26

52

78

0

26

52

78

week model

MXL

MXL−TVP1

MXL−TVP3

MXL−TVP32

Figure B2: Estimated parameter paths (detergent data) Estimated brand intercepts need to be interpreted w.r.t. brand D (the reference brand). As suggested by all four models, brand B has the largest brand value (intrinsic brand utility) followed by brands C, D, and A. Except for the intercept of brand C all estimated effects vary over time (only the MXL-TVP1 model suggests a slight positive linear trend for the value of brand C). For example, the perceived brand value of brand A drops during the first year and then considerably increases during the second year toward a higher level as compared to the beginning. The intrinsic utility of brand B shows even more fluctuation over time. It seems like there is a seasonal pattern with a full cycle each year. Note that in the two periods with a rather low perceived value of brand B (around weeks 15 and 60), the gap in brand values between brand B and brand C almost vanishes. In periods where the intrinsic utility of brand B is perceived high (in particular around week 50), brand B has by far the highest brand value

7

on the other hand. Please also note that these differences, as well as the dynamics in brand values, are not that clearly visible from the brand shares displayed in Figure B1. Price sensitivity of households increases during the first year, then strongly decreases during the first half of the second year, and finally increases again to about the same level as in the beginning. The effects of feature and display show interesting patterns, too. While the feature effect decreases over time and turns out insignificant after the first half year, the display effect reveals a seasonal pattern. In some periods, the (mean) display effect is not significant, too (e.g., between weeks 30 to 50). A comparison of the estimated parameter paths for the three time-varying parameter models (MXL-TVP1, MXL-TVP3, MXL-TVP32) together with the fit and predictive validity results (see Table B2) let us conclude that the MXL-TVP1 is not flexible enough to reproduce the high amount of variation in the intercepts for brands A and B. In addition, while the MXL-TVP1 and MXL-TVP32 models suggest nearly the same, more smooth parameter paths for the price effect, the large changes in the price sensitivity of households during the first half year are captured still better by the more flexible MXL-TVP 3 and are obviously not an artefact. Other than for ketchup, more flexibility (in form of the most flexible TVP3 model) seems to pay off here not only for representing time-variation in intrinsic brand utilities but also for covariate effects (here especially for the price parameter). Stated otherwise, since the MXL-TVP3 provides the best predictive performance the higher fluctuation of the price effect as suggested by this model seems robust and not a result of overfitting. To obtain a better understanding why the models of Kim et al. (2005) tend to overfit (best in-sample fits across all MNL and MXL models, worst out-of-sample fits across all MXL models, see Table B2), we display in Figure B3 the estimated parameter paths of the MXL-RVAR against the MXL-TVP3. All parameters of the MXL-RVAR model turn out larger in absolute magnitude. Hence they are estimated on a different scale, which is not unusual for HB-MNL models (Huber & Train, 2001). To solve this issue and make the results 8

comparable, we rescaled the estimated parameters of the MXL-RVAR model such that the average price parameters are the same for both models (Huber & Train, 2001). intercept A

intercept B 4

1 0

2

−1 0 −2 0

26

52

78

0

26

intercept C

52

78

price

parameter value

2 −5 1 −10 0 −15 −1 −20 −2 0

26

52

78

0

26

feature

2

52

78

display 2

1

1 0

0

−1

−1

−2 −2

−3 0

26

52

78

0

26

52

78

week model

MXL−RVAR

MXL−TVP3

Figure B3: Estimated parameter paths (MXL-TVP3 vs. MXL-RVAR; detergent data) Figure B3 shows that the general courses of the parameter paths are quite similar. Nevertheless, the parameter paths obtained from the MXL-RVAR model (solid grey lines) turn out much more volatile compared to the MXL-TVP3 model (dashed red lines) and suggest a very high week-to-week variation,4 which obviously leads to an excellent in-sample fit but at the same time hinders the model from doing a reasonable job out-of-sample. Besides, such wiggly parameter paths are hard to interpret from a managerial point of view.5

4

A closer comparison of the parameter dynamics in Figure B3 with Figures 2 and 3 in Kim et al. (2005) reveals large parallels, e.g., price sensitivities are highest between weeks 30 and 50 and lowest between weeks 70 and 80, despite the different definitions for splitting the data set into an estimation and validation sample. 5

All of the estimated autoregressive effects resulting from the MXL-RVAR model of Kim et al. (2005) are small or even negative (but larger than −1). Thus, the parameter paths are stationary but with low or negative autocorrelation. This results in extreme volatile week-to-week variation in parameters, and we assume that this is (at

9

B.2

Cola data

The second data set used in this appendix is the IRI scanner panel data set from Cosguner, Chan, & Seetharaman (2016).6 This data set contains 8372 cola purchases of 300 households for 4 brands (Coke, Pepsi, Private label, Royal Crown) over a period of 104 weeks (19911993). The households made at least 7 purchases (28 purchases on average), and the median interpurchase time is 5 weeks. Table B3 contains summary statistics for this data set. Table B3: Summary statistics for the cola data set (300 households; 104 weeks) brand name

no. purchases (estimation)

no. purchases (validation)

price ($ per 32 oz.) mean sd

promotion (% of purchases) display feature

Coke

1054

1101

0.755

0.233

0.186

0.267

Pepsi

1983

2012

0.688

0.181

0.320

0.398

Private Label

432

441

0.537

0.140

0.107

0.095

Royal Crown

678

671

0.714

0.214

0.142

0.151

Total

4147

4225

share

price

0.75

0.9 0.8

0.50 0.7 0.6

0.25

0.5 0.00 0

26

52

78

104

0

26

display

52

78

104

78

104

feature

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0 0

26

52

78

104

0

26

52

week brand

Coke

Pepsi

Private Label

Royal Crown

Figure B4: Time-series of brand shares, prices, and promotion variables (cola data) Figure B4 shows weekly time-series plots for shares, prices, and promotional activities. Like for the ketchup and detergent data, we can observe a considerable amount of variation in

least in part) driven by the changes in the weekly panel composition (i.e., not every household purchases in this product category every week). Our approach seems to be more robust regarding this issue. 6

See also Chib, Seetharaman, & Strijnev (2004).

10

shares, prices, and promotional activities over time for the cola brands. However, compared to ketchup and detergent, time trends in shares and/or prices are clearly less pronounced. For estimation and validation, we again randomly split the data into two halves. The Private Label brand is chosen as the reference brand in this application. Table B4: Fit and predictive validity (cola data) Model

Log-Lik

Brier score

Spherical score

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3998.212 -3976.973 -3976.762 -3961.951 -3982.637 -2298.415 -2288.118 -2288.062 -2311.467 -2286.101 -2286.045 -2175.053 -2214.731

Data set: estimation -2143.913 2825.791 -2132.216 2834.305 -2132.111 2834.377 -2124.380 2839.038 -2134.398 2833.454 -1235.506 3432.636 -1229.644 3436.265 -1229.610 3436.289 -1242.464 3429.171 -1228.287 3437.181 -1228.235 3437.212 -1206.579 3447.063 -1202.734 3450.226

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-4059.919 -4037.147 -4036.983 -4034.094 -4039.916 -2839.047 -2827.771 -2827.631 -2831.712 -2827.820 -2827.865 -3940.725 -3918.429

Data set: validation -2180.943 2891.078 -2170.768 2898.899 -2170.692 2898.936 -2168.261 2900.508 -2171.906 2898.761 -1507.478 3346.757 -1501.443 3351.193 -1501.406 3351.215 -1503.545 3349.766 -1501.545 3351.039 -1501.556 3351.029 -1876.719 3147.624 -1852.920 3158.583

ARMSE 0.0674 0.0644 0.0644 0.0624 0.0649 0.0486 0.0467 0.0467 0.0464 0.0460 0.0460 0.0399 0.0387 0.0689 0.0661 0.0661 0.0657 0.0660 0.0572 0.0552 0.0552 0.0555 0.0552 0.0552 0.0799 0.0767

Note: Best-fitting model is indicated in bold within each data set (estimation vs. validation) for each performance measure.

Table B4 reports in- and out-of-sample fit statistics for all models. Note that the MXL-VAR and MXL-RVAR once more suffer from overfitting (best in-sample fits, out-of-sample fits much worse than for the simple MXL), and we therefore do not consider them in the following discussion. Again, adding heterogeneity improves model fit considerably. However, probably due to the lack of clear time trends for shares and prices, the additional improvements in fit and predictive validity from including parameter dynamics turn out somewhat smaller here. Still, the models suggesting the best predictive performance account for both heteroge-

11

neity and time-varying parameters. Again, the estimation of the MXL-TVP4 model did not converge, and hence the model is excluded from further discussion. Interestingly, whereas the TVP-MXL models with cubic splines (MXL-TVP1, MXLTVP2) as well as the two “hybrid” models with cubic splines for covariate effects and random walk dynamics for brand intercepts (MXL-TVP31, MXL-TVP32) perform more or less equally well in the validation sample, the MXL-TVP3 model with random walk dynamics imposed on all parameter paths performs somewhat worse. Also, the in-sample fit of the MXL-TVP3 is worse as compared to the simple MXL, while all other MXL-TVP variants clearly outperform the simple MXL with respect to in-sample fit. Stated differently, the TVP-MXL 3 seems a bit too flexible to reproduce the dynamics in the cola data best possible, and a less flexible dynamic specification like the MXL-TVP1 or MXL-TVP2 seems to be the better choice here. intercept Coke

2.4

intercept Pepsi 2.8

2.1

2.6

1.8

2.4

1.5

2.2 2.0

1.2 0

26

78

104

0

26

intercept Royal Crown

1.4

parameter value

52

52

78

104

78

104

78

104

price −6

1.2

−7 −8

1.0

−9 0.8

−10

0.6

−11 0

26

52

78

104

0

26

feature

52

display

0.75

1.00

0.50

0.75

0.25

0.50

0.00 0

26

52

78

104

0

26

52

week model

MXL

MXL−TVP1

MXL−TVP2

Figure B5: Estimated parameter paths (cola data)

12

MXL−TVP3

Figure B5 shows the estimated parameter paths for the simple MXL (dotted black line; no parameter dynamics), the MXL-TVP 1 (dashed green line; cubic splines imposed on all parameters), the MXL-TVP2 (dash-dotted red line; cubic splines imposed on all parameters but more knots compared to MXL-TVP1), and MXL-TVP3 (solid blue line; random walk dynamics imposed on all parameters). Note that the estimated parameter paths obtained from the MXL-TVP1 and MXL-TVP2 models virtually coincide (the almost perfectly lie on top of each other), hence both models provide virtually the same results. The brand intercepts need to be interpreted with respect to the Private Label brand (the reference brand). Respectively, the estimation results consistently suggest that Pepsi is the brand with the highest intrinsic brand utility, followed by Coke, Royal Crown, and Private Label (independent of whether dynamics are considered or not). Interestingly, the brand value of Coke continuously decreases over time while that of Pepsi continuously increases. Also noteworthy, even the MXL-TVP3 here leads to relatively smooth parameter paths for the brand intercepts of Coke and Pepsi. The price sensitivity of households somewhat increases during the first year and then remains rather stable. Feature and display effects turn out rather flat. The largest difference between the three dynamic models is observed for the feature effect, where the more flexible MXL-TVP3 suggests no dynamics at all while the MXL-TVP1 and MXL-TVP2 models indicate a slightly increasing feature effect over time. Altogether, the plots in Figure B5 (interpreted together with the statistical model performance measures as reported in Table B4) demonstrate why the less flexible time-varying parameter models like the MXL-TVP1 and MXL-TVP2 that lead to more smooth parameter paths already seem sufficient to model dynamics for the cola data set (even for the brand intercepts here). Note that the MXL-TVP3 model still provides a better out-of-sample fit than the static MXL model due to the time-varying effects since it accounts for the time-varying effects in the brand values of Coke and Pepsi as as well as in the price sensitivity of households.

13

Appendix C: Estimation of brand choice models with timevarying parameters using BayesX In this appendix, we present (1) BayesX code for fitting a subset of the proposed brand choice models with time-varying parameters and (2) R code for calling the BayesX code conveniently from R. We describe the code using as example the Cola data (see Web Appendix B2). BayesX (Belitz, Brezger, Klein, Kneib, Lang, & Umlauf, 2015) is a software tool for estimating structured additive regression models. Structured additive regression embraces several popular regression models such as generalized additive models (GAM), generalized additive mixed models (GAMM), generalized geoadditive mixed models (GGAMM), dynamic models, varying coefficient models, and geographically weighted regression within a unifying framework. Besides exponential family regression, BayesX also supports non-standard regression situations such as regression for categorical responses (including the brand choice models considered in this paper), hazard regression for continuous survival times, continuous time multi-state models, quantile regression, distributional regression models and multilevel models.7 In addition to (the standalone software) BayesX, there is an R package (R2BayesX) that provides an interface to BayesX. It provides convenient tools for model specification as well as suitable classes for printing, summarizing, and plotting results. Lastly, the companion R package BayesXsrc (Adler, Kneib, Lang, Umlauf, & Zeileis, 2013) allows installing BayesX from within R (see Umlauf, Adler, Kneib, Lang, & Zeileis, 2015 for an introduction).

C.1

BayesX code

We start describing the Bayes X code with the simple MNL model and then show how to modify the code to account for (different specifications of) parameter dynamics as well as

7

See http://www.uni-goettingen.de/de/bayesx/550513.html for more information.

14

heterogeneity. Code-Block C1 illustrates the code for the MNL model, which we save in the MNL.prg file. Code-Block C1: MNL model (cola data) dataset d d.infile, maxobs = 10000 using cola.raw remlreg r delimiter = ; logopen using MNL/res.log; r.outfile = MNL/res; r.regress brand = price_catspecific + add_catspecific + disp_catspecific weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

The code in this file consists of several parts: (1) we define a data set object with name d. The data set has to be in a “wide” format (i.e., each choice set has only one row). (2) Next, we load the data set assuming that cola.raw is available in the same folder as the MNL.prg file. maxobs = 10000 is used for memory allocation and speeds up reading large data sets. It should therefore be adapted for larger data sets (> 10,000 observations) although BayesX will also manage to read larger data sets if the option has not been specified appropriately. (3) We define the regression object r and specify mixed model based estimation via remlreg. (4) We also define a new delimiter (;) that enables us to use “return” for line breaks to format the code more clearly. (5) The output of the estimation (log-files and estimates) are stored according to the specified relative paths. All output files will start with “res” in a model-specific subfolder (here MNL, which needs to be created). A clear folder and file structure facilitates working with the results in the post-estimation step. (6) The next block is relevant for the model specification. The variable brand which contains the number of the chosen brand for each observation is regressed on the covariates price, add, and disp. Note that the variables names must match with the names in the data set (excl. brand-specific 15

endings, e.g., price1, price2, etc.). The ending _catspecific is important for specifying generic effects of the covariates (i.e., that effects are the same across alternatives). The weight w is a dummy variable in the data set indicating which observations are used for estimation (w = 1) and validation (w = 0). In this example, we specify alternative 3 as reference brand and BayesX automatically adds alternative-specific intercepts for each alternative (except for the reference brand). In general, multinomial logit models are employed when defining family = multinomialcatsp. The maximum number of iterations for estimation, as well as the stopping criterion, are set via maxit and eps. Lastly, we use the data set d for estimation (and validation). (7) In the end, we stop logging, switch the delimiter back to “return”, and delete the used objects d and r. Modifications of the simple MNL model are straightforward (see also the reference manual of BayesX). For example, the code for the MNL-TVP2 model is depicted in CodeBlock C2. Again, we save the code in a .prg file (e.g., MNLTVP2.prg). Code-Block C2: MNL-TVP2 model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r delimiter = ; logopen using MNLTVP2/res.log; r.outfile = MNLTVP2/res; r.regress brand = price_catspecific * week(psplinerw2, nrknots=52, degree=3) + add_catspecific * week(psplinerw2, nrknots=52, degree=3) + disp_catspecific * week(psplinerw2, nrknots=52, degree=3) + week(psplinerw2, nrknots=52, degree=3) + weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

The parameter dynamics are specified via additional interaction terms with splines of the week variable in the data set. In particular, the arguments indicate that we employ a cubic spline (degree=3), with 52 knots (nrknots=52), and second order differences for penali-

16

zation (psplinerw2). Note that the week variable includes week numbers. Hence, we obtain flexible time-varying paths for each variable over weeks. If we want an alternative dynamic specification, we only have to change the arguments for each spline. Note that the settings do not necessarily have to be the same for each variable within a model. For the random walk dynamics in the models MNL-TVP3 and MXLTVP3, we specify week(psplinerw1, nrknots=104, degree=0). Hence, we use a zero-degree spline (degree=0), with as many knots as the number of weeks in the data set (nrknots=104), and first order differences for penalization (psplinerw1). For models with consumer heterogeneity, we also add interactions. For example, if we want to model heterogeneity in price sensitivity via a random effect we just have to specify price_catspecific * id(random), where id is an index variable in the data set that identifies the observations that belong to a specific household. Further, if we want to have both heterogeneity and dynamics, we simply add both. For the MXL-TVP3 model, it follows: price_catspecific * week(psplinerw1, nrknots = 104, degree = 0) + price_catspecific * id(random). Note that for dynamics and/or heterogeneity in the intercepts, we would write just id(random) and/or week(psplinerw2, nrknots=31, degree=3). BayesX automatically builds the corresponding terms for each alternative (except for the reference alternative). The Code-Blocks C3 and C4 show the code for the MXL model and the MXL-TVP2 model, respectively. The codes for all other nonparametric models contained in Table 3 follow directly using the steps mentioned above. We save the code for each model separately in a .prg file with a meaningful name (e.g., MNL.prg). These files will be called subsequently from within R using BayesXsrc. Code-Block C3: MXL model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r

17

delimiter = ; logopen using MXL/res.log; r.outfile = MXL/res; r.regress brand = price_catspecific + price_catspecific * id(random) + add_catspecific + add_catspecific * id(random) + disp_catspecific + disp_catspecific * id(random) + id(random) weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

Code-Block C4: MXL-TVP2 model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r delimiter = ; logopen using MXLTVP2/res.log; r.outfile = MXLTVP2/res; r.regress brand = price_catspecific * week(psplinerw2, nrknots=52, degree=3) + price_catspecific * id(random) + add_catspecific * week(psplinerw2, nrknots=52, degree=3) + add_catspecific * id(random) + disp_catspecific * week(psplinerw2, nrknots=52, degree=3) + disp_catspecific * id(random) + week(psplinerw2, nrknots=52, degree=3) + id(random) weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

C.2

R code

Next, we describe data preparation, model estimation, and post estimation steps using R and the previously described BayesX code. The data of Cosguner et al. (2016) can be downloaded from the companion website of Management Science.8 In particular, we are interested in the file data.dat in the zip-folder mnsc.2016.2649-sm.zip. Code-Block C5 shows the R code for preparing the data such that it can be used with BayesX. Specifically, we delete (for our research) unnecessary variables, keep only choices within the cola category, and randomly split the data into two parts for estimation and validation. 8

The link to the website is http://pubsonline.informs.org/doi/suppl/10.1287/mnsc.2016.2649. Note that the data are part of the replication files of Cosguner et al. (2016) and usage is hence restricted. Prof. Seethu Seetharaman kindly shared the data with us for our specific analysis.

18

Code-Block C5: Date preparation (cola data) # Setup =========================================================================== library("data.table") library("psych") library("mlogit") # clear work space rm(list = ls()) # set working directory to folder with files setwd("~/Desktop/replication files/") # Load data ======================================================================= data

a b c d

*

Postdoctoral researcher, Humboldt University Berlin, Institute of Marketing, School of Business and Economics, Spandauer Straße 1, 10178 Berlin, Germany, E-mail: [email protected] Professor of Marketing, University of Osnabrück, Department of Marketing, Rolandstraße 8, 49069 Osnabrück, Germany, E-mail: [email protected] Professor of Statistics, Georg-August-Universität Göttingen, Department of Statistics and Econometrics, Humboldtallee 3, 37073 Göttingen, Germany, E-mail: [email protected] Chair in Management and Marketing and Professor of Marketing, Clausthal University of Technology, Department of Marketing, Julius-Albert-Straße 2, 38678 ClausthalZellerfeld, Germany, E-mail: [email protected] Corresponding author

Appendix A: Penalized maximum likelihood estimation of semiparametric multinomial logit models We describe penalized maximum likelihood estimation of the MNL-TVP model, i.e., based on the model equation1 (%)

(%)

(%) . -/0 𝑥"#- 𝑓- (𝑡).

𝜂"# = 𝑓) (𝑡) +

(A1)

Simplified model versions can then be derived by analogously simplifying the estimation scheme. All nonparametric functions are represented in terms of B-spline basis functions of degree 𝑙 (de Boor, 2001), i.e., (%)

𝑓) (𝑡) =

(%) 5 4/0 𝛾4)

7 𝐵4) (𝑡) = 𝜈"#) ′𝛾) ,

(%)

(%)

7 7 where 𝜈"#) = (𝐵0) (𝑡), … , 𝐵5) (𝑡))′ and 𝛾) (%)

(%)

𝑥"#- 𝑓- (𝑡) = 𝑥"#(%)

(%)

(%)

(A2)

(%)

= (𝛾0) , … , 𝛾5) )′, and 5 4/0 𝛾4-

(%)? (%)

7 𝐵4(𝑡) = 𝜈"#- 𝛾- ,

(A3)

(%)

7 7 where 𝜈"#- = (𝑥"#- 𝐵0(𝑡), … , 𝑥"#- 𝐵5(𝑡))′ and 𝛾- = 𝛾0- , … , 𝛾5- ′. Stacking all brand(%)

(0)

(AB0)

specific predictors in the vector 𝜂"# = (𝜂"# , … , 𝜂"#

)′ allows to rewrite the model specifi-

cation in matrix notation as . -/0 𝑉"#- 𝛾- , where

𝜂"# = 𝑉"#) 𝛾) +

𝑉"#) =

(0)

(AB0)?

and 𝛾) = (𝛾) , … , 𝛾)

𝜈"#) ′

(A4)

(0)?

⋱ 𝜈"#) ′

𝜈"# ⋮ (AB0)? 𝜈"#-

, 𝑉"#- =

(A5)

)′.

The vector of responses is linked to the vector of choice probabilities 𝜋"# = (0)

(AB0)

(𝜋"# , … , 𝜋"#

)′ via the response function ℎ: 𝑅AB0 → 𝑅AB0 corresponding to the multinomi-

al model, i.e., ℎ(𝜂"# ) = 𝜋"# = (ℎ(0) (𝜂"# ), … , ℎ(AB0) (𝜂"# )) and ℎ(%) (𝜂"# ) =

(X)

RST(UVW )

(X] ) 0Y Z[\ X]^\ RST(UVW )

1

.

(A6)

We do not discuss random effects here because this extension is straight forward and only makes the notation more complicated.

1

Finally, stacking all individual observation predictors and design matrices yields the overall (0)

(AB0)

model equation 𝜂 = 𝑉) 𝛾) + 𝑉0 𝛾0 + ⋯ + 𝑉- 𝛾- = 𝑉𝛾, where 𝛾 = (𝛾) , … , 𝛾)

, 𝛾0 , … , 𝛾. )

collects all regression parameters and 𝑉 = (𝑉) , … , 𝑉) , 𝑉0 , … , 𝑉. ) is the complete design matrix. Penalized maximum likelihood estimation is based on the penalized (log-)likelihood 𝑙TRa (𝛾)

= 𝑙(𝛾) −

(%) AB0 (%) (%)? %/0 𝜆) 𝛾) 𝑃) 𝛾)

. -/0 𝜆- 𝛾- ′𝑃- 𝛾-

−

= 𝑙(𝛾) − 𝛾′𝑃𝛾,

(A7)

(0)

where 𝑙(𝛾) is the usual multinomial logit log-likelihood, 𝜆) , … , 𝜆AB0 ) , 𝜆0 , … , 𝜆. are smoothing parameters determining the overall smoothness of the corresponding function estimate, and (0)

(AB0)

𝑃 = blockdiag(𝜆) 𝑃) , … , 𝜆)

𝑃) , 𝜆0 𝑃0 , … , 𝜆. 𝑃. ) is the overall penalty matrix for the com-

bined vector 𝛾. Penalized likelihood estimates can be obtained by adapting the Fisher scoring algorithm (Fahrmeir & Tutz, 2004) to the penalized likelihood setting. Taking first and second derivatives yields the score function 𝑠pqr (𝛾) = 𝑉′(𝑦 − 𝜋) − 𝑃𝛾 and 𝐹pqr (𝛾) = 𝑉′𝑊𝑉 − 𝑃, where 𝑦 = (𝑦"# ) is the vector of observed binary choice decisions, 𝑊 = blockdiag(𝑊"# ) with 𝑊"# = 𝑐𝑜𝑣(𝑦"# ) and (0)

(0)

𝜋"# (1 − 𝜋"# ) (0) (z)

𝑐𝑜𝑣(𝑦"# ) =

−𝜋"# 𝜋"#

(0) (z)

−𝜋"# 𝜋"#

⋮

…

(ABz) (AB0) −𝜋"# 𝜋"# (AB0) (AB0) 𝜋"# (1 − 𝜋"# )

⋱

(0) (AB0)

−𝜋"# 𝜋"#

⋱

⋮ −𝜋"# 𝜋"#

(0) (AB0)

…

(AB0) (AB0) 𝜋"#

−𝜋"#

Fisher scoring then proceeds by iteratively updating 𝛾 via 𝛾

({Y0)

B0 = 𝐹pqr (𝛾

({)

.

(A8)

)𝑠pqr (𝛾

({)

).

Equivalently, this update step can be written in form of an iteratively reweighted least squares fit 𝛾

({Y0)

= (𝑉′𝑊(𝛾

({)

)𝑉)B0 𝑉′𝑊(𝛾

({)

) 𝑦 (𝛾

({)

) with working responses 𝑦 (𝛾) = 𝜂 +

𝑊 B0 (𝑦 − 𝜋). This can also be interpreted as assuming a working Gaussian model for the responses 𝑦, where 𝑦 ∼ 𝑁(𝑉𝛾, 𝑊 B0 ), and estimating 𝛾 is based on weighted penalized least squares in each iteration.

2

This interpretation also forms the basis for determining the smoothing parameters for the various nonparametric functions employing recent developments that connects nonparametric Gaussian models and mixed models (e.g., Fahrmeir et al., 2004; Ruppert et al., 2003). The key in these approaches is to interpret the parameters of the nonparametric functions as random effects and to relate the penalty to a corresponding random effects distribution. The smoothing parameter 𝜆 is then a one-to-one transformation of the random effects variance, and methods to estimate this variance (such as restricted maximum likelihood estimation) can be employed to derive estimates for the smoothing parameters. We will now briefly describe the mixed model representation for a single nonparametric function 𝑓(𝑡) (where function and category indices are being dropped for simplicity). Utilizing the basis function representation, the function evaluations can be written as 𝑓(𝑡) = 𝜈′𝛾 and estimation involves the penalty pen(𝛾) = 𝜆𝛾′𝑃𝛾, where 𝑃 = 𝐷′𝐷 corresponds to the penalty matrix defined in terms of a first or second order difference matrix −1 𝐷0 =

1 −1

1 1 ⋱

⋱ −1

or 𝐷z = 1

−2 1 1 −2 ⋱

1 ⋱ 1

⋱ −2

. (A9) 1

To obtain a mixed model representation of this situation, we reparametrize the regression parameters as 𝛾 = 𝑈𝛽 + 𝑍𝑏, where 𝑈 is a two-column matrix containing a linear basis representing the unpenalized part of function 𝑓(𝑡), and 𝑍 contains the orthogonal deviation from the linear effect. The latter can be obtained from the spectral decomposition of the penalty matrix, utilizing only eigenvectors corresponding to non-zero eigenvalues. Inserting the reparametrization into the model equation yields the following representation of the nonparametric function: 𝑓(𝑡) = 𝜈′(𝑈𝛽 + 𝑍𝑏) = 𝑢′𝛽 + 𝑧′𝑏. Similarly, inserting the reparametrization into the penalty, it can be shown that 𝛽 remains unpenalized, corresponding to a fixed effect in mixed model interpretation. In contrast, the penalty for 𝑏 turns out to be of

3

the ridge type, i.e., pen(𝑏) = 𝜆𝑏′𝑏. This relates to the assumption of i.i.d. Gaussian random effects 𝑏 ∼ 𝑁(0, 𝜎 z 𝐼), where 𝜎 z = 1/𝜆. In summary, any nonparametric effect in the semiparametric MNL can be represented as the sum of fixed and random effects. Combining this interpretation with the working Gaussian model for the working responses 𝑦 allows to perform a REML update for the smoothing parameters within each of the iterations of the Fisher scoring algorithm.

Appendix B: Additional empirical applications In this appendix, we present the results of two additional empirical applications. That way, we can provide more general insights about the performance of the different models with versus without time-varying parameters as well as with versus without heterogeneity when applied to different data sets.2

B.1

Detergent Data

The A.C. Nielsen scanner panel data set used by Kim, Menzefricke, & Feinberg (2005) contains 6682 liquid detergent purchases of 492 households for 4 brands (A, B, C, and D)3 over a time period of 96 weeks. The households made at least 7 purchases (14 purchases on average) and the median interpurchase time is 7 weeks. Table B1: Summary statistics for the detergent data set (492 households, 96 weeks) brand name

no. purchases (estimation)

no. purchases (validation)

price ($ per unit) Mean sd

promotion (% of purchases) display feature

A

1015

987

5.087

0.583

0.148

0.233

B

952

943

5.952

0.632

0.135

0.217

C

627

633

5.928

0.515

0.069

0.064

D Total

774 3368

751 3314

5.426

0.675

0.095

0.118

Table B1 contains summary statistics for this data set, and Figure B1 shows weekly timeseries plots for shares, prices, and promotional activities. 2

Note that we here abstain from adding reference price and brand loyalty terms since we obtained the data sets (which were kindly provided by other researchers) already in preprocessed form for our analysis.

3

The real brand names cannot be revealed due to a confidentiality agreement with the original data provider.

4

Similar to ketchup, we here observe a considerable amount of variation with respect to brand shares, prices, and promotional activities over time and across brands. Further, three of the four brands reveal some interesting trends in the evolution of their price level within the considered time span. For estimation and validation, we randomly split the data into two halves. Note that this setup differs from Kim et al. (2005), where observations from the last 6 weeks were kept for validation purposes. Brand D was chosen as the reference brand. share

price

0.6 6 0.4 5

0.2 0.0 0

26

52

78

0

26

display

52

78

feature

0.75

0.75

0.50

0.50

0.25

0.25

0.00

0.00 0

26

52

78

0

26

52

78

week brand

A

B

C

D

Figure B1: Time-series of brand shares, prices, and promotion variables (detergent data) Table B2 shows the statistical performance for each model in the estimation and validation sample along the measures we already used for the ketchup study. Similar to ketchup, models with heterogeneity and parameter dynamics fit the data better (in- and out-of-sample) as compared to models without heterogeneity and/or parameter dynamics. Also, the parametric MNL-TVP4 model does not improve much over the MNL model and is clearly inferior to the nonparametric models. MXL-VAR and MXL-RVAR are again prone to overfitting (best insample fits but out-of-sample worse than the simple MXL). The model with the best predictive performance is the MXL-TVP3 model with random walk dynamics imposed on all parameters, followed by the two “hybrid” models MXL5

TVP31 and MXL-TVP32. Exactly the same ordering is obtained in the estimation sample (ignoring MXL-VAR and MXL-RVAR, see above). The estimation of the MXL-TVP4 model did not converge, and hence the model is excluded from further discussion. Table B2: Fit and predictive validity (detergent data) Model

Log-Lik

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3602.618 -3522.767 -3519.176 -3450.234 -3591.644 -1794.973 -1749.240 -1741.505 -1719.617 -1726.560 -1726.974 -1474.795 -1448.772

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3541.335 -3498.172 -3495.623 -3470.291 -3538.170 -2400.058 -2367.608 -2363.134 -2346.904 -2351.816 -2351.709 -3759.388 -4321.185

Brier score Spherical score Data set: estimation -1769.785 2257.762 -1717.718 2292.185 -1715.990 2293.201 -1677.722 2318.509 -1758.451 2266.430 -894.985 2857.480 -865.536 2875.327 -861.287 2878.037 -847.983 2887.052 -851.951 2884.220 -852.132 2884.172 -790.444 2906.298 -778.351 2909.457

ARMSE

Data set: validation -1729.832 2230.213 -1693.070 2252.430 -1691.255 2253.475 -1669.224 2267.843 -1724.286 2234.135 -1190.421 2601.103 -1176.283 2610.110 -1173.631 2611.989 -1164.477 2617.790 -1167.749 2615.462 -1167.736 2615.510 -1631.584 2381.478 -1565.313 2408.916

0.0860 0.0740 0.0736 0.0596 0.0845 0.0585 0.0523 0.0517 0.0468 0.0473 0.0473 0.0390 0.0367 0.0853 0.0800 0.0797 0.0754 0.0852 0.0715 0.0673 0.0668 0.0643 0.0649 0.0649 0.0830 0.0830

Note: Best-fitting model is indicated in bold within each data set (estimation vs. validation) for each performance measure.

Figure B2 illustrates how parameters evolve over time. Shown are the parameter paths for the simple MXL (dotted black line; no parameter dynamics), the MXL-TVP1 with smoother parameter paths (dashed green line; cubic splines imposed on all parameters), the MXL-TVP3 (dash-dotted red line; random walk dynamics for all parameters), and the MXL-TVP32 (solid blue line; random walks for brand intercepts, cubic splines for covariate effects).

6

intercept A

intercept B 2

0.0 1

−0.4 −0.8

0 0

26

52

78

0

26

intercept C

52

78

price

parameter value

−5 0.50 −10 0.25 −15

0.00

−20 0

26

52

78

0

feature

26

1.5

1.0

52

78

display

1.0 0.5

0.5

0.0 0.0 −0.5

−0.5 −1.0 0

26

52

78

0

26

52

78

week model

MXL

MXL−TVP1

MXL−TVP3

MXL−TVP32

Figure B2: Estimated parameter paths (detergent data) Estimated brand intercepts need to be interpreted w.r.t. brand D (the reference brand). As suggested by all four models, brand B has the largest brand value (intrinsic brand utility) followed by brands C, D, and A. Except for the intercept of brand C all estimated effects vary over time (only the MXL-TVP1 model suggests a slight positive linear trend for the value of brand C). For example, the perceived brand value of brand A drops during the first year and then considerably increases during the second year toward a higher level as compared to the beginning. The intrinsic utility of brand B shows even more fluctuation over time. It seems like there is a seasonal pattern with a full cycle each year. Note that in the two periods with a rather low perceived value of brand B (around weeks 15 and 60), the gap in brand values between brand B and brand C almost vanishes. In periods where the intrinsic utility of brand B is perceived high (in particular around week 50), brand B has by far the highest brand value

7

on the other hand. Please also note that these differences, as well as the dynamics in brand values, are not that clearly visible from the brand shares displayed in Figure B1. Price sensitivity of households increases during the first year, then strongly decreases during the first half of the second year, and finally increases again to about the same level as in the beginning. The effects of feature and display show interesting patterns, too. While the feature effect decreases over time and turns out insignificant after the first half year, the display effect reveals a seasonal pattern. In some periods, the (mean) display effect is not significant, too (e.g., between weeks 30 to 50). A comparison of the estimated parameter paths for the three time-varying parameter models (MXL-TVP1, MXL-TVP3, MXL-TVP32) together with the fit and predictive validity results (see Table B2) let us conclude that the MXL-TVP1 is not flexible enough to reproduce the high amount of variation in the intercepts for brands A and B. In addition, while the MXL-TVP1 and MXL-TVP32 models suggest nearly the same, more smooth parameter paths for the price effect, the large changes in the price sensitivity of households during the first half year are captured still better by the more flexible MXL-TVP 3 and are obviously not an artefact. Other than for ketchup, more flexibility (in form of the most flexible TVP3 model) seems to pay off here not only for representing time-variation in intrinsic brand utilities but also for covariate effects (here especially for the price parameter). Stated otherwise, since the MXL-TVP3 provides the best predictive performance the higher fluctuation of the price effect as suggested by this model seems robust and not a result of overfitting. To obtain a better understanding why the models of Kim et al. (2005) tend to overfit (best in-sample fits across all MNL and MXL models, worst out-of-sample fits across all MXL models, see Table B2), we display in Figure B3 the estimated parameter paths of the MXL-RVAR against the MXL-TVP3. All parameters of the MXL-RVAR model turn out larger in absolute magnitude. Hence they are estimated on a different scale, which is not unusual for HB-MNL models (Huber & Train, 2001). To solve this issue and make the results 8

comparable, we rescaled the estimated parameters of the MXL-RVAR model such that the average price parameters are the same for both models (Huber & Train, 2001). intercept A

intercept B 4

1 0

2

−1 0 −2 0

26

52

78

0

26

intercept C

52

78

price

parameter value

2 −5 1 −10 0 −15 −1 −20 −2 0

26

52

78

0

26

feature

2

52

78

display 2

1

1 0

0

−1

−1

−2 −2

−3 0

26

52

78

0

26

52

78

week model

MXL−RVAR

MXL−TVP3

Figure B3: Estimated parameter paths (MXL-TVP3 vs. MXL-RVAR; detergent data) Figure B3 shows that the general courses of the parameter paths are quite similar. Nevertheless, the parameter paths obtained from the MXL-RVAR model (solid grey lines) turn out much more volatile compared to the MXL-TVP3 model (dashed red lines) and suggest a very high week-to-week variation,4 which obviously leads to an excellent in-sample fit but at the same time hinders the model from doing a reasonable job out-of-sample. Besides, such wiggly parameter paths are hard to interpret from a managerial point of view.5

4

A closer comparison of the parameter dynamics in Figure B3 with Figures 2 and 3 in Kim et al. (2005) reveals large parallels, e.g., price sensitivities are highest between weeks 30 and 50 and lowest between weeks 70 and 80, despite the different definitions for splitting the data set into an estimation and validation sample. 5

All of the estimated autoregressive effects resulting from the MXL-RVAR model of Kim et al. (2005) are small or even negative (but larger than −1). Thus, the parameter paths are stationary but with low or negative autocorrelation. This results in extreme volatile week-to-week variation in parameters, and we assume that this is (at

9

B.2

Cola data

The second data set used in this appendix is the IRI scanner panel data set from Cosguner, Chan, & Seetharaman (2016).6 This data set contains 8372 cola purchases of 300 households for 4 brands (Coke, Pepsi, Private label, Royal Crown) over a period of 104 weeks (19911993). The households made at least 7 purchases (28 purchases on average), and the median interpurchase time is 5 weeks. Table B3 contains summary statistics for this data set. Table B3: Summary statistics for the cola data set (300 households; 104 weeks) brand name

no. purchases (estimation)

no. purchases (validation)

price ($ per 32 oz.) mean sd

promotion (% of purchases) display feature

Coke

1054

1101

0.755

0.233

0.186

0.267

Pepsi

1983

2012

0.688

0.181

0.320

0.398

Private Label

432

441

0.537

0.140

0.107

0.095

Royal Crown

678

671

0.714

0.214

0.142

0.151

Total

4147

4225

share

price

0.75

0.9 0.8

0.50 0.7 0.6

0.25

0.5 0.00 0

26

52

78

104

0

26

display

52

78

104

78

104

feature

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0.0

0.0 0

26

52

78

104

0

26

52

week brand

Coke

Pepsi

Private Label

Royal Crown

Figure B4: Time-series of brand shares, prices, and promotion variables (cola data) Figure B4 shows weekly time-series plots for shares, prices, and promotional activities. Like for the ketchup and detergent data, we can observe a considerable amount of variation in

least in part) driven by the changes in the weekly panel composition (i.e., not every household purchases in this product category every week). Our approach seems to be more robust regarding this issue. 6

See also Chib, Seetharaman, & Strijnev (2004).

10

shares, prices, and promotional activities over time for the cola brands. However, compared to ketchup and detergent, time trends in shares and/or prices are clearly less pronounced. For estimation and validation, we again randomly split the data into two halves. The Private Label brand is chosen as the reference brand in this application. Table B4: Fit and predictive validity (cola data) Model

Log-Lik

Brier score

Spherical score

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-3998.212 -3976.973 -3976.762 -3961.951 -3982.637 -2298.415 -2288.118 -2288.062 -2311.467 -2286.101 -2286.045 -2175.053 -2214.731

Data set: estimation -2143.913 2825.791 -2132.216 2834.305 -2132.111 2834.377 -2124.380 2839.038 -2134.398 2833.454 -1235.506 3432.636 -1229.644 3436.265 -1229.610 3436.289 -1242.464 3429.171 -1228.287 3437.181 -1228.235 3437.212 -1206.579 3447.063 -1202.734 3450.226

1) MNL 2) MNL-TVP1 3) MNL-TVP2 4) MNL-TVP3 5) MNL-TVP4 6) MXL 7) MXL-TVP1 8) MXL-TVP2 9) MXL-TVP3 10) MXL-TVP31 11) MXL-TVP32 13) MXL-VAR 14) MXL-RVAR

-4059.919 -4037.147 -4036.983 -4034.094 -4039.916 -2839.047 -2827.771 -2827.631 -2831.712 -2827.820 -2827.865 -3940.725 -3918.429

Data set: validation -2180.943 2891.078 -2170.768 2898.899 -2170.692 2898.936 -2168.261 2900.508 -2171.906 2898.761 -1507.478 3346.757 -1501.443 3351.193 -1501.406 3351.215 -1503.545 3349.766 -1501.545 3351.039 -1501.556 3351.029 -1876.719 3147.624 -1852.920 3158.583

ARMSE 0.0674 0.0644 0.0644 0.0624 0.0649 0.0486 0.0467 0.0467 0.0464 0.0460 0.0460 0.0399 0.0387 0.0689 0.0661 0.0661 0.0657 0.0660 0.0572 0.0552 0.0552 0.0555 0.0552 0.0552 0.0799 0.0767

Note: Best-fitting model is indicated in bold within each data set (estimation vs. validation) for each performance measure.

Table B4 reports in- and out-of-sample fit statistics for all models. Note that the MXL-VAR and MXL-RVAR once more suffer from overfitting (best in-sample fits, out-of-sample fits much worse than for the simple MXL), and we therefore do not consider them in the following discussion. Again, adding heterogeneity improves model fit considerably. However, probably due to the lack of clear time trends for shares and prices, the additional improvements in fit and predictive validity from including parameter dynamics turn out somewhat smaller here. Still, the models suggesting the best predictive performance account for both heteroge-

11

neity and time-varying parameters. Again, the estimation of the MXL-TVP4 model did not converge, and hence the model is excluded from further discussion. Interestingly, whereas the TVP-MXL models with cubic splines (MXL-TVP1, MXLTVP2) as well as the two “hybrid” models with cubic splines for covariate effects and random walk dynamics for brand intercepts (MXL-TVP31, MXL-TVP32) perform more or less equally well in the validation sample, the MXL-TVP3 model with random walk dynamics imposed on all parameter paths performs somewhat worse. Also, the in-sample fit of the MXL-TVP3 is worse as compared to the simple MXL, while all other MXL-TVP variants clearly outperform the simple MXL with respect to in-sample fit. Stated differently, the TVP-MXL 3 seems a bit too flexible to reproduce the dynamics in the cola data best possible, and a less flexible dynamic specification like the MXL-TVP1 or MXL-TVP2 seems to be the better choice here. intercept Coke

2.4

intercept Pepsi 2.8

2.1

2.6

1.8

2.4

1.5

2.2 2.0

1.2 0

26

78

104

0

26

intercept Royal Crown

1.4

parameter value

52

52

78

104

78

104

78

104

price −6

1.2

−7 −8

1.0

−9 0.8

−10

0.6

−11 0

26

52

78

104

0

26

feature

52

display

0.75

1.00

0.50

0.75

0.25

0.50

0.00 0

26

52

78

104

0

26

52

week model

MXL

MXL−TVP1

MXL−TVP2

Figure B5: Estimated parameter paths (cola data)

12

MXL−TVP3

Figure B5 shows the estimated parameter paths for the simple MXL (dotted black line; no parameter dynamics), the MXL-TVP 1 (dashed green line; cubic splines imposed on all parameters), the MXL-TVP2 (dash-dotted red line; cubic splines imposed on all parameters but more knots compared to MXL-TVP1), and MXL-TVP3 (solid blue line; random walk dynamics imposed on all parameters). Note that the estimated parameter paths obtained from the MXL-TVP1 and MXL-TVP2 models virtually coincide (the almost perfectly lie on top of each other), hence both models provide virtually the same results. The brand intercepts need to be interpreted with respect to the Private Label brand (the reference brand). Respectively, the estimation results consistently suggest that Pepsi is the brand with the highest intrinsic brand utility, followed by Coke, Royal Crown, and Private Label (independent of whether dynamics are considered or not). Interestingly, the brand value of Coke continuously decreases over time while that of Pepsi continuously increases. Also noteworthy, even the MXL-TVP3 here leads to relatively smooth parameter paths for the brand intercepts of Coke and Pepsi. The price sensitivity of households somewhat increases during the first year and then remains rather stable. Feature and display effects turn out rather flat. The largest difference between the three dynamic models is observed for the feature effect, where the more flexible MXL-TVP3 suggests no dynamics at all while the MXL-TVP1 and MXL-TVP2 models indicate a slightly increasing feature effect over time. Altogether, the plots in Figure B5 (interpreted together with the statistical model performance measures as reported in Table B4) demonstrate why the less flexible time-varying parameter models like the MXL-TVP1 and MXL-TVP2 that lead to more smooth parameter paths already seem sufficient to model dynamics for the cola data set (even for the brand intercepts here). Note that the MXL-TVP3 model still provides a better out-of-sample fit than the static MXL model due to the time-varying effects since it accounts for the time-varying effects in the brand values of Coke and Pepsi as as well as in the price sensitivity of households.

13

Appendix C: Estimation of brand choice models with timevarying parameters using BayesX In this appendix, we present (1) BayesX code for fitting a subset of the proposed brand choice models with time-varying parameters and (2) R code for calling the BayesX code conveniently from R. We describe the code using as example the Cola data (see Web Appendix B2). BayesX (Belitz, Brezger, Klein, Kneib, Lang, & Umlauf, 2015) is a software tool for estimating structured additive regression models. Structured additive regression embraces several popular regression models such as generalized additive models (GAM), generalized additive mixed models (GAMM), generalized geoadditive mixed models (GGAMM), dynamic models, varying coefficient models, and geographically weighted regression within a unifying framework. Besides exponential family regression, BayesX also supports non-standard regression situations such as regression for categorical responses (including the brand choice models considered in this paper), hazard regression for continuous survival times, continuous time multi-state models, quantile regression, distributional regression models and multilevel models.7 In addition to (the standalone software) BayesX, there is an R package (R2BayesX) that provides an interface to BayesX. It provides convenient tools for model specification as well as suitable classes for printing, summarizing, and plotting results. Lastly, the companion R package BayesXsrc (Adler, Kneib, Lang, Umlauf, & Zeileis, 2013) allows installing BayesX from within R (see Umlauf, Adler, Kneib, Lang, & Zeileis, 2015 for an introduction).

C.1

BayesX code

We start describing the Bayes X code with the simple MNL model and then show how to modify the code to account for (different specifications of) parameter dynamics as well as

7

See http://www.uni-goettingen.de/de/bayesx/550513.html for more information.

14

heterogeneity. Code-Block C1 illustrates the code for the MNL model, which we save in the MNL.prg file. Code-Block C1: MNL model (cola data) dataset d d.infile, maxobs = 10000 using cola.raw remlreg r delimiter = ; logopen using MNL/res.log; r.outfile = MNL/res; r.regress brand = price_catspecific + add_catspecific + disp_catspecific weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

The code in this file consists of several parts: (1) we define a data set object with name d. The data set has to be in a “wide” format (i.e., each choice set has only one row). (2) Next, we load the data set assuming that cola.raw is available in the same folder as the MNL.prg file. maxobs = 10000 is used for memory allocation and speeds up reading large data sets. It should therefore be adapted for larger data sets (> 10,000 observations) although BayesX will also manage to read larger data sets if the option has not been specified appropriately. (3) We define the regression object r and specify mixed model based estimation via remlreg. (4) We also define a new delimiter (;) that enables us to use “return” for line breaks to format the code more clearly. (5) The output of the estimation (log-files and estimates) are stored according to the specified relative paths. All output files will start with “res” in a model-specific subfolder (here MNL, which needs to be created). A clear folder and file structure facilitates working with the results in the post-estimation step. (6) The next block is relevant for the model specification. The variable brand which contains the number of the chosen brand for each observation is regressed on the covariates price, add, and disp. Note that the variables names must match with the names in the data set (excl. brand-specific 15

endings, e.g., price1, price2, etc.). The ending _catspecific is important for specifying generic effects of the covariates (i.e., that effects are the same across alternatives). The weight w is a dummy variable in the data set indicating which observations are used for estimation (w = 1) and validation (w = 0). In this example, we specify alternative 3 as reference brand and BayesX automatically adds alternative-specific intercepts for each alternative (except for the reference brand). In general, multinomial logit models are employed when defining family = multinomialcatsp. The maximum number of iterations for estimation, as well as the stopping criterion, are set via maxit and eps. Lastly, we use the data set d for estimation (and validation). (7) In the end, we stop logging, switch the delimiter back to “return”, and delete the used objects d and r. Modifications of the simple MNL model are straightforward (see also the reference manual of BayesX). For example, the code for the MNL-TVP2 model is depicted in CodeBlock C2. Again, we save the code in a .prg file (e.g., MNLTVP2.prg). Code-Block C2: MNL-TVP2 model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r delimiter = ; logopen using MNLTVP2/res.log; r.outfile = MNLTVP2/res; r.regress brand = price_catspecific * week(psplinerw2, nrknots=52, degree=3) + add_catspecific * week(psplinerw2, nrknots=52, degree=3) + disp_catspecific * week(psplinerw2, nrknots=52, degree=3) + week(psplinerw2, nrknots=52, degree=3) + weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

The parameter dynamics are specified via additional interaction terms with splines of the week variable in the data set. In particular, the arguments indicate that we employ a cubic spline (degree=3), with 52 knots (nrknots=52), and second order differences for penali-

16

zation (psplinerw2). Note that the week variable includes week numbers. Hence, we obtain flexible time-varying paths for each variable over weeks. If we want an alternative dynamic specification, we only have to change the arguments for each spline. Note that the settings do not necessarily have to be the same for each variable within a model. For the random walk dynamics in the models MNL-TVP3 and MXLTVP3, we specify week(psplinerw1, nrknots=104, degree=0). Hence, we use a zero-degree spline (degree=0), with as many knots as the number of weeks in the data set (nrknots=104), and first order differences for penalization (psplinerw1). For models with consumer heterogeneity, we also add interactions. For example, if we want to model heterogeneity in price sensitivity via a random effect we just have to specify price_catspecific * id(random), where id is an index variable in the data set that identifies the observations that belong to a specific household. Further, if we want to have both heterogeneity and dynamics, we simply add both. For the MXL-TVP3 model, it follows: price_catspecific * week(psplinerw1, nrknots = 104, degree = 0) + price_catspecific * id(random). Note that for dynamics and/or heterogeneity in the intercepts, we would write just id(random) and/or week(psplinerw2, nrknots=31, degree=3). BayesX automatically builds the corresponding terms for each alternative (except for the reference alternative). The Code-Blocks C3 and C4 show the code for the MXL model and the MXL-TVP2 model, respectively. The codes for all other nonparametric models contained in Table 3 follow directly using the steps mentioned above. We save the code for each model separately in a .prg file with a meaningful name (e.g., MNL.prg). These files will be called subsequently from within R using BayesXsrc. Code-Block C3: MXL model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r

17

delimiter = ; logopen using MXL/res.log; r.outfile = MXL/res; r.regress brand = price_catspecific + price_catspecific * id(random) + add_catspecific + add_catspecific * id(random) + disp_catspecific + disp_catspecific * id(random) + id(random) weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

Code-Block C4: MXL-TVP2 model (cola data) dataset d d.infile, maxobs=10000 using cola.raw remlreg r delimiter = ; logopen using MXLTVP2/res.log; r.outfile = MXLTVP2/res; r.regress brand = price_catspecific * week(psplinerw2, nrknots=52, degree=3) + price_catspecific * id(random) + add_catspecific * week(psplinerw2, nrknots=52, degree=3) + add_catspecific * id(random) + disp_catspecific * week(psplinerw2, nrknots=52, degree=3) + disp_catspecific * id(random) + week(psplinerw2, nrknots=52, degree=3) + id(random) weight w, reference=3 family=multinomialcatsp maxit=100 eps=0.001 using d; logclose; delimiter = return; drop d r

C.2

R code

Next, we describe data preparation, model estimation, and post estimation steps using R and the previously described BayesX code. The data of Cosguner et al. (2016) can be downloaded from the companion website of Management Science.8 In particular, we are interested in the file data.dat in the zip-folder mnsc.2016.2649-sm.zip. Code-Block C5 shows the R code for preparing the data such that it can be used with BayesX. Specifically, we delete (for our research) unnecessary variables, keep only choices within the cola category, and randomly split the data into two parts for estimation and validation. 8

The link to the website is http://pubsonline.informs.org/doi/suppl/10.1287/mnsc.2016.2649. Note that the data are part of the replication files of Cosguner et al. (2016) and usage is hence restricted. Prof. Seethu Seetharaman kindly shared the data with us for our specific analysis.

18

Code-Block C5: Date preparation (cola data) # Setup =========================================================================== library("data.table") library("psych") library("mlogit") # clear work space rm(list = ls()) # set working directory to folder with files setwd("~/Desktop/replication files/") # Load data ======================================================================= data