Asymmetric Loss Functions for Forecasting in Criminal Justice Settings

0 downloads 0 Views 616KB Size Report
Apr 19, 2010 - diction (Burgess, 1928; Borden, 1928), which later evolved into re-offending ... In many policy settings, symmetric loss is not responsive ...... asymmetric costs can be introduced into loss functions to affect the forecasts.
Asymmetric Loss Functions for Forecasting in Criminal Justice Settings∗ Richard Berk Department of Statistics Department of Criminology University of Pennsylvania April 19, 2010 Abstract The statistical procedures typically used for forecasting in criminal justice settings rest on symmetric loss functions. For quantitative response variables, overestimates are treated the same as underestimates. For categorical response variables, it does not matter in which class a case is inaccurately placed. In many criminal justice settings, symmetric costs are not responsive to the needs of stakeholders. It can follow that the forecasts are not responsive either. In this paper, we consider asymmetric loss functions that can lead to forecasting procedures far more sensitive to the real consequences of forecasting errors. Theoretical points are illustrated with examples using criminal justice data of the kind that might be used for “predictive policing.”

1

Introduction

Forecasting has been an important part of criminal justice practice since the early 20th century. Much of the initial work was undertaken for parole pre∗

Thanks go to Roger Koenker for several helpful e-mail exchanges on his program in R for quantile regression. Thanks also go the Larry Brown and Andreas Buja for suggestions on statistical inference for quantile regression. The work reported in this paper was supported by a grant from the National Institute of Justice Predictive Policing Demonstration and Evaluation Program.

1

diction (Burgess, 1928; Borden, 1928), which later evolved into re-offending more generally (Glueck and Glueck, 1950). Early in the post-war era, a number of key methodological concerns were raised in a manner that is still insightful (Ohlin and Duncan, 1949; Goodman, 1952; 1953a; 1953b). The need to evaluate forecasting accuracy with real forecasts, not fitted values, is one example. Over the past 30 years, criminal justice forecasting applications have become more varied: forecasts of crime rates (Blumstein and Larson, 1969; Cohen and Land, 1987; Eckberg, 1995), prison populations (Berk et al., 1983; Austin et al., 2007), repeat domestic violence incidents (Berk et al., 2005) and others. New methodological issues continue to surface and in that context, it is not surprising that the performance of criminal justice forecasts has been mixed at best (Farrington, 1987; Gottfredson and Moriarty, 2006). Complaints about “weak models” is an illustration. In this paper, we consider a somewhat unappreciated statistical feature of criminal justice forecasts. We consider the role of loss functions. The most popular loss functions are symmetric. For quantitative response variables, the direction of disparity between a forecast and what eventually occurs does not matter. Forecasts that are too small are treated the same as forecasts that are too large. In many policy settings, symmetric loss is not responsive to the costs associated with actions that depend on the forecasts. The recent interest in “predictive policing” (Bratton and Malinowski, 2008) is one instance. Is an overestimate of the number of future robberies in a neighborhood as costly as an underestimate? In one case, too many resources may be allocated to a particular area. In the other case, that area is underserved. The first may be wasteful, while the second may lead to an increase in crime. The consequences could be rather different and the resulting costs rather different as well. Yet, this point is overlooked in some very recent and sophisticated criminal justice forecasting (Cohen and Gorr, 2006). Analogous issue can arise when the response variable is categorical. The class into which an observation is incorrectly forecasted does not matter. If, for example, the outcome to be forecasted is whether an individual on parole or probation will commit a homicide, false positives can have very different consequences from false negatives (Berk et al., 2009). A false positive can mean that an individual is incorrectly labeled as “high risk.” A false negative can mean that a homicide that might have been prevented is not. There is no reason to assume that the costs of these two outcomes are the even approximately the same. Qualitative outcomes can also play a role in 2

predictive policing. In the pages ahead, we expand on these illustrations. We look more formally at forecasting loss functions, consider the potential problems associated with symmetric loss, and examine some statistical forecasting procedures that build on asymmetric loss functions. Loss functions for quantitative and categorical response variables are considered. Several examples from real data are used to illustrate key points. The overall message is straightforward: when the costs of forecasting errors are asymmetric and an asymmetric loss function is properly deployed, forecasts will more accurately inform the criminal justice activities than had a symmetric loss function been used instead.

2

An Introduction to Loss Functions

All model-based statistical forecasts have loss functions built in. The loss functions are introduced as part of the model building process and generally carry over to the forecasts that follow. Occasionally, a new loss function is introduced at the forecasting stage. Consider a quantitative response variable Y and a set of regressors X.1 There is a statistical procedure that computes some function of the regressors, fˆ(X) minimizing a particular loss function L(Y, fˆ(X)). For squared error, such as used in least squares methods, L(Y, fˆ(X) = (Y − fˆ(X))2 .

(1)

For absolute error such as used in quantile regression,2 L(Y, fˆ(X)) = |Y − fˆ(X)|.

(2)

Sometimes these loss functions are minimized directly.3 Least squares methods for quadratic loss are the obvious example. An illustration is shown in Figure 1. Note that the loss for a residual value 3 is the same as the loss for a residual value of -3. 1

The bold font denotes a matrix. For ease of exposition, we are not including lagged values of the response in X. For this discussion, there is no important loss of generality. 2 Absolute error, also known as L1 loss, comes up in a variety of topics within theoretical statistical (e.g., Bickel and Doksum, 2001: 18-19). Quadratic loss is sometimes called L2 loss. 3 Equations 1 and 2 are notational. They are not expressions for how the actual loss value would be computed.

3

Sometimes loss functions are minimized indirectly. Maximum likelihood procedures applied linear regression with normally distributed disturbances is the obvious example. We are concerned in this paper with the loss function itself, not the methods by which it is minimized. As already noted, there can be on occasion one loss function for the forecasting model and another loss function for the forecast itself. For example, one might use quadratic loss for normal regression and then once the conditional distribution of the forecast is determined, apply linear loss. Instead of forecasting the conditional mean, one might forecast the conditional median.

6 0

2

4

Loss Value

8

10

12

Quadratic Loss function

-3

-2

-1

0

1

2

3

Residual Value

Figure 1: An Example of A Quadratic Loss Function

When the response variable is categorical, the ideas are the same, but the details differ substantially. Drawing on the exposition of Hastie and his collegues (2009: 221), consider a categorical response variable G with 1, 2, . . . , K categories, sometimes called classes. There is a function of the regressors, fˆ(X), that transforms the values of the regressors into the probability pˆk of a case falling into class k. There is another function of the ˆ regressors, G(X), that uses pˆk to place each case in a class. Thus, there can in principle be two different loss functions. With logistic regression, for 4

example, fˆ(X) is a linear combination of regressors that computes fitted ˆ probabilities. Then, G(X) assigns a class to each case using some threshold on those fitted probabilities (e.g., assign the case to the class for which it has the highest probability). Two different loss functions can be employed. ˆ Alternatively, the procedure goes directly to G(X). There is no intervening function. Then, there is a single loss function. Nearest neighbor methods as commonly used are one example, although one-step approaches usually can be reformulated, if necessary, as two-step approaches. Two common loss functions for categorical response variables are ˆ ˆ L(G, G(X)) = I(G 6= G(X),

(3)

where I() denotes an indicator variable,4 and L(G, pˆ(X)) = −2

K X

I(G = k) log pˆk (X).

(4)

k=1

The first loss function is called “0-1 loss,” which codes whether a classification error has been made. The sum of the classification errors is commonly transformed into a measure of forecasting accuracy (e.g., the proportion of cases correctly forecasted).5 The second loss function is −2×log likelihood and is called the “deviance” or “cross entropy loss.”6 The deviance is often reported for generalized linear models. For the special case of normal regression, the error sum of squares is the deviance. Each of the loss functions shown in equations 1 through 4 are examples of symmetric loss. For equation 1, squaring removes the sign of residuals. For equation 2, the sign is removed by computing the absolute value of the residual. In both cases, once the sign is removed, fitted values that are too large are treated the same as fitted values that are too small. For equation 3, all misclassifications are treated the same regardless of the incorrect class into which a case is placed. For equation 4, all classes are treated the same with respect to classification errors because the weight given to each class is solely determined by the probability of placement in that class. And that probability has no necessary connection to the costs of classification errors. 4

In this case, the indicator operator returns a value of 1 when a case does not fall in class k, and 0 otherwise, 5 This measure assumes that the costs of false positives and false negatives are the same. 6 Again, these equations are notational, not computational.

5

Statistical procedures that rest upon the loss functions briefly introduced and all other common loss functions (e.g., McCullagh and Nelder, 1989: 34; Hastie et al., 2009: 343-350), assume symmetric loss. It follows that symmetric loss functions dominate statistical practice. Forecasting applications are no different. Yet, as Nobel Laureate Clive Granger observed long ago for a quadratic loss function C(e) used in forecasting applications, “The obvious problem with this choice for C(e) is that it is a symmetric function, whereas actual cost functions are often nonsymmetric” (Granger, 1989: 15).7 Granger goes on to explain that it can be difficult for a stakeholder to specify a preferred loss function. And even if that were possible, there were at the time some rather daunting technical obstacles to overcome before asymmetric loss functions could be readily implemented. Over the past 20 years, there have been some important statistical developments that can make asymmetric loss functions practical, and some of these require less of stakeholders than Granger assumed. We turn to a simple but powerful example.

3

Quantile Regression and Beyond

Consider a quantitive random response variable Y. Drawing on Koenker’s exposition (2005: Section 3.1), Y has a continuous distribution function so that F (Y ) = P (Y ≤ y). (5) F (Y ) is the cumulative distribution function for Y. Then for 0 < τ < 1 and an inverse function F −1 (),8 F −1 (τ ) = inf[y : F (y) ≥ τ ].

(6)

The τ th quantile is a value of Y. It is chosen so that the probability that Y is equal to or larger than τ is as small as possible. In the most common instance, τ = .5, which represents the median. Then less formally stated, the median is the value of y beneath which Y falls with a probability .50. 7

Working within econometric traditions, Granger prefers the term cost function to the term loss function, but in this context, these are just two names for the same thing. Likewise, “nonsymmetric” means the same thing as “asymmetric.” 8 An inverse function f −1 (), when it exists, is defined so that If f (a) = b, then f −1 (b) = a.

6

All other values of τ have analogous interpretations. Common values of τ are .25, symbolized by Q1 and .75 symbolized by Q3 . These are called the “first quartile” and “third quartile” respectively. The median can be called the “second quartile.” Let di = yi − fˆτ (yi ), where fˆτ (yi ) is potential estimate of the τ th quantile; di is a conventional deviation score but computed for a quantile, not the mean. The quantile loss function is then L(Y, fˆτ (y)) =

n X

di (τ − I(di < 0)),

(7)

i=1

where as before, I() is an indicator function, and the sum is over all cases in the data set. The indicator variable is equal to 1 if the deviation score is negative, and 0 otherwise. So, the right hand side is equal to the deviation score multiplied by either τ or τ − 1. This is how τ directly affects the loss. Suppose τ = .50. Then according to equation 7, the loss for deviation score di = 3 is 1.5. If the deviation score di = −3 the loss is also 1.5. But suppose that τ = .75. Now the loss for di = 3 is 2.25, and the loss for di = −3 is .75. When the median is the target, the sign of the deviation score does not matter. The loss is the same. But when the third quartile is the target, the sign matters a lot. Positive deviations scores are given 3 times more weight than negative deviation scores as the sum of n cases is being computed. In other words, for all quantiles other than the median, the loss function is asymmetric. Moreover, one can alter the weight given to the deviation scores by the choice of τ . Thus, by choosing the third quartile, positive deviation scores are given 3 times the weight of negative deviation scores. Had one chosen the first quartile, positive deviation scores would be given 1/3 the weight of negative deviation scores. Values of τ > .5 make positive deviations more more costly than negative deviations. Values of τ < .5 make negative deviations more more costly than positive deviations. Figure 2 illustrates an asymmetric a linear loss function when the quantile is less than .50. The loss for negative deviations increases much faster than the loss for positive deviations. Overestimates are more costly than underestimates. Figure 3 illustrates an asymmetric linear loss function when the quantile is greater than 50. The loss for positive deviations increases much faster than the loss for negative deviations. Underestimates are more

7

costly than overestimates.9 An Asymmetric Linear Loss Function for A Quantile Less Then .50 high

Loss

low

-

+

0 Deviation Score

Figure 2: An Example of An Asymmetric Linear Loss Function for τ < .50

The same principles can apply to regression when the target is then a conditional quantile. Such quantiles are computed as a function of predictors so that the forecasts fall on a hyperplane in the space defined by the predictors and the response.10 We basically are back to equation 2. In equation 7, fˆτ (X) replaces fˆτ (y), and instead of talking about deviation scores, we talk about residuals. Looked at another way, in broad brush strokes this is just like least squares regression, but it is not the sum of the squared residuals that is minimized. It is the sum of the absolute values of the residuals such that there is the option of treating negative residuals differently from positive residuals. There is a large literature on quantile regression. Gilchrist (2008: 402) points out that quantile regression has its roots in work by Galton in the 9

When τ = .50, the loss function is a “perfect” V such that both arms of the V have the same angle with the horizontal axis. 10 In the simplest case of a single regressor, the conditional quantiles are assumed to fall on a straight line.

8

An Asymmetric Linear Loss Function for A Quantile Greater Than .50 high

Loss

low

-

+

0 Deviation Score

Figure 3: An Example of An Asymmetric Linear Loss Function for τ > .50

1880’s, but that Galton’s work was largely ignored until more recent papers by Parzen (e.g., Parzen, 1979). It then fell Koenker and Bassett (1978) and Koenker (2005) to provide a full-blown, operational regression formulation. A number of important extensions followed (e.g., Oberhofer, 1982; Tibshirani, 1996; Kriegler, 2007). The general properties of quantile regression are now well known (Berk 2004: 26). Britt (2008) offers a criminal justice application in which sentencing disparities are analyzed. In contrast, the literature on forecasting with quantile regression is relatively spare and then, the emphasis has been primarily on the benefits when conditional quantiles are used rather than the conditional mean (Koenker and Zhao, 1996; Tay and Wallius, 2000). The loss function follows from the conditional quantile to be forecasted. That reasoning legitimately can be reversed. The relative costs of forecasting errors are first determined, from which one arrives at the appropriate quantile to be forecasted. And that is the stance taken in this paper. When quantile regression is used as just described, it is a one-step procedures. A forecast is just a conditional quantile, much like a fitted value. For

9

the forecast, the outcome is just not known. The same loss function is used for both the fitted values and the forecast. However, one can forecast quantiles in a two step manner that in some situations provides greater precision. For example, one can apply normal regression to forecast the full conditional distribution of the response and then use as the forecast the conditional third quantile of that conditional distribution (Brown et al., 2005). The first step employs quadratic loss. The second step employs linear loss.

3.1

An Illustration

Quantile regression can be a useful forecasting tool for the amount of crime in a particular geographical area. In this instance, the data come from the Central police reporting district in the city of Los Angeles. There are over 4 years of data on the number of violent crimes aggregated up to the week. One might want to forecast the amount of violent crime a week in advance for that reporting district. District commanders often allocate the law enforcement resources on week-to-week basis. Predictors include earlier crime counts for the Central Reporting District and earlier crime counts for the four contiguous reporting districts: Newton, Rampart, North East, and Hollenbeck. All of the predictors are lagged because the goal is forecasting. From what is known this week and perhaps earlier, the predicted amount of crime can be used to anticipate what may happen the following week. The lagged values for the Central Reporting District can capture important temporal trends. The lagged values for the other four reporting districts can capture spatial dependence, such as offenders working across reporting district boundaries. With longer predictor lags (e.g., a month), more distant forecasting horizons can be employed. In principle, other predictors could be included. Any of several conventional regression forecasting procedures could be used such as those from the general linear model, ARIMA models, the generalized additive model, and more. But all assume quadratic loss. Here, this means that crime forecasts that are too high are treated the same as crime forecasts that are too low. In practice, this is probably unreasonable. Forecasts that are too high may lead to an allocation of resources to where they are not really needed. Forecasts that are too low may lead to an increase in crime that could have been prevented. There is no reason to assume that the costs of these outcomes are the same. Indeed, informal discussions with 10

police officials in several cities clearly indicate that they are not. One can take an important aspect of such costs into account if it is possible to specify the relative costs of overestimates and underestimates. In many instances, relative costs can be elicited from stakeholders, at least to the satisfaction of those stakeholders (Berk et al., 2009, Berk, 2009). Moreover, if one proceeds with conventional quadratic loss, equal relative costs are automatically being assumed. These also need to be explicitly be justified. There is really no way to avoid the issue of relative costs. Consider the three cost ratios of overestimates to underestimates used earlier: 3 to 1, 1 to 1, 1 to 3.11 One quantile regression analysis was undertaken for each cost ratio. The first implies fitting the conditional first quartile (Q1 |X). The second implies fitting the conditional second quartile (Q2 |X), also known as the median. The third implies fitting the conditional third quartile (Q3 |X). Variables Intercept Newton Rampart North East Hollenbeck Central

Q1 Coefficients Q2 Coefficients Q3 Coefficients 15.04 18.98 19.53 0.28 0.34 0.27 -0.01 -0.01 0.01 -0.19 0.12 0.21 0.04 0.10 0.24 0.28 0.29 0.31

Table 1: Regression Coefficients for Quantile Regression Forecasting Models of Violent Crime in the Central Reporting District Using One Week Lags for All Predictors (Coefficients with p-values < .05 for the usual null hypothesis in bold font)

Table 3.1 shows the three sets of regression coefficients and the results of bootstrapped statistical tests Each regression coefficient represents the change in the conditional quantile for a unit change in its associated reporting district regressor. For these data, a unit change is one crime. Thus, for every three additional violent crimes per week in the Central Reporting District, there is little less than one additional violent crime the following week. The lagged values for violent crime in the the Newton Reporting District 11

Any other cost ratios are fair game. For example, using a quantile of .8 would imply relative costs of 1 to 4 for overestimates compared to underestimates.

11

have a similar interpretation. No causal inferences need be made, in part because the three quantile regressions are not intended to be causal models. Moreover, precisely why there is apparently a relatively strong link between violent crime in Central Reporting District and violent crime in the Newton Reporting District for all three regressions should be explored, but is not a salient matter here. The intent is to concentrate on forecasting. Table 3.1 also reports the test results for the usual null hypothesis that the parameter value for any given regression coefficient is zero. There are several different methods to arrive at reasonable standard errors for quantile regression coefficients, but no consensus yet on which is best (Koenker, 1994: section 3.10). For these data, the residuals are effectively independent with constant variance, given the past values of the regressions, so that most of the available methods should work in principle. And in fact, all tests conditional on the past values of the regressors lead to effectively the same results.12 It is important keep in mind, however, that the reported test results can depend heavily on unbiased the regression coefficient estimates. We make no such claims here, and will return to the issues shortly. Many of the regression coefficients are much the same for the three regressions implying roughly parallel sets of fitted values.13 Does the quantile chosen then make a difference in the one-week-ahead fitted values for the Central Police Division? One could reasonably expect to see meaningful differences in at least the fitted values because the intercept estimates in Table 3.1 vary from a little over 15 crimes to a little less than 20 crimes. Figure 4 shows the Central Reporting District violent crime time series by week with the three sets of fitted values overlaid. Each set of fitted values has a correlation of nearly .70 with its observed crime counts. The fit quality implies that there is some important structure in the violent crime time series for the Central Reporting District. The violent crime patterns over weeks are not just white noise. This is very important because otherwise one can do no better than forecasting an unconditional quantile (e.g., the overall median). The fitted values from the Q1 regression are shown in green. The fitted 12

There are a number of very interesting bootstrap options that were also examined (Kocherginsky et at., 2005; He and Hu, 2002; Parsen et al., 1994; Bose and Chatterjee, 2003). Again, conditioning on “history,” the results were pretty much the same. One conditions on history because of the forecasting application. Given what is known now, what do we anticipate a week from now? 13 In general, this need not be the case. Regression coefficients can vary substantially depending on the quantile used.

12

80 60 40

Number of Violent Crimes

100

120

Quantile Regression Fitted Values for Q1 (green), Q2 (blue), and Q3 (red)

0

100

200

300

400

Week

Figure 4: Time Series of Violent Crime with Three Set of Fitted Values Overlaid

values from the Q2 regression are shown in blue. They assume equal costs and represent forecasting business as usual. The fitted values from the Q3 regression are shown in red. One can easily see how the fitted values are shifted upward as one moves from Q1 to Q2 to Q3 . This is consistent, respectively, with overestimates being 3 times more costly than underestimates for Q1 , overestimates being equally costly as underestimates for Q2 , and overestimates being 1/3 as costly as underestimates for Q3 . The quantile used really matters for the fitted values. Therefore, they should substantially matter for forecasts.14 A week’s worth of new data was appended to the existing data set, and one forecast for each quantile regression was made using the existing model; the model’s parameters were not re-estimated with the new data included. For this demonstration, the one-week-ahead violent crime count is known. 14

The fitted values are not forecasts even though they are one-week-ahead conditional quantiles. The one-week-ahead observed values of the response variable were used to construct them. This is important because fitted values will generally overstate how well a model forecasts.

13

In practice, it would not be.15 Table 2 includes for each of the three quartiles the forecasted crime counts and estimated upper and lower forecasting bounds. The estimates increase, from 39.9 to 46.9 to 51.5 from Q1 to Q2 to Q3 . For this illustration, the oneweek-ahead crime count is known to be 34.0. Although the estimate using Q1 is the closest to 34.0, the estimate chosen is determined by the relative costs of overestimates to underestimates one imposes. The conventional default value of 46.9 assumes equal costs. If overestimates are 3 times more costly than underestimates, 39.9 is used as the forecast. If overestimates are 1/3rd as costly as underestimates, 51.5 is used as the forecast. The difference between equal costs business as usual and the two other forecasts is roughly 10%. These may be modest differences, but the costs differentials are modest too. More dramatic cost differences will lead to more dramatic differences in forecasts. We will see later a case in which a 20 to 1 cost ratio, rather than 1 to 1, was properly used in a real application. The lower and upper bounds were constructed bootstrapping the quantile regression model. They are used to compute error bars.16 One can assume that 95% of the bootstrap forecasted values fell between the lower and upper bounds. However, the bounds do not represent true confidence intervals because the values forecasted almost certainly contain some bias. Rather, the upper and low bounds indicate the likely range of forecasted values in independent realizations of the data, assuming that the social processes responsible for violent crime in the Central Reporting District do not change systematically between the time the forecast is made, and the time the crime count for the following week materializes. The quantile approach to asymmetric loss functions can be applied with several other statistical procedures sometimes used in forecasting. One pow15

It is also possible to update the model over time (Swanson and White, 1997). As each forecasted observation materializes, that new observation is added to the data set and the model is rebuilt before a new forecast is made. This is an appealing idea in principle, but it raises some tricky issues for statistical inference and the proper discounting of p-values for multiple and dependent statistical tests. Moreover, updating the forecasting model after each new observation becomes available may not always be practical. 16 Standard errors for quantile regression forecasts can be computed in several different ways. Bootstrap methods require the fewest assumptions but will often lead to the widest error bars. Consistent with how the statistical tests for the quantile regression coefficients were computed, we treated earlier values of all regressors as fixed because the forecast is made conditional on “history.” Several alternative bootstrap approaches based on somewhat different formulations produced error bars that were quite similar.

14

Lower Bound Estimate Upper Bound

Q1 Model 37.9 39.9 43.0

Q2 Model 44.9 46.9 48.7

Q3 Model 49.8 51.5 53.3

Table 2: Forecasted Values and Error Bands When the Observed Weekly Value is 34 Crimes

erful example adopts a quantile approach within an analogue of the generalized additive model (Hastie and Tibshirani, 1990). The righthand side can be a linear combination of parametric and/or nonparametric quantile functions linking each regressor to the response variable (Koenker, Ng, and Portnoy, 1994). Figure 5 is an illustration. The data are from the 7th Police District in Washington, D.C.. The observed values, shown with blue circles, are the number of Part I crimes per week. The fitted values are shown in black. They are constructed from three predictors: lagged values of the response, a regressor for non-seasonal trends, and a regression for season cycles. All are implemented in a non-parametric fashion as cubic regression splines from a B-spline basis (Koenker, 2009). The forecast and the bootstrapped 95% forecasting error bands are shown in red. A seasonal pattern over the full time series is apparent with a linear upward trend starting early in 2005. In this instance, the third quantile is fitted. Another good approach, especially when there are a very large number of predictors and highly nonlinear functional forms are anticipated, is machine learning. One option is to use quantiles in stochastic gradient boosting (Kriegler, 2007). Another option is to use quantile random forests (Meinshausen, 2006). Random forests is implemented as usual for quantitative response variables. But in the final step when fitted values are constructed, the conditional quantile is used rather than the conditional mean. Quantile stochastic gradient boosting may well turn out to be the best choice (Kriegler and Berk, 2009), but a discussion is beyond the scope of this paper.17 To date, there seems to be no principled way to build in asymmetric costs for other than linear loss functions. So, there does not seem to be an obvious analogue to quantile regression for squared loss. There is some promising 17

All three procedures — quantile nonparametric regression, quantile stochastic gradient boosting, and quantile random forests — are available in R and work well.

15

100 80

*-

60

Number of Crimes Per Week

120

Multiple B-Spline Quantile Autoregression with Weekly Trends for the 7th District (Tau=.75)

0

52 2001

104 2002

156 2003

208

260

2004 2005 Week and Year

312 2006

364 2007

416 2008

Figure 5: Time Series of Violent Crime Forecast from Nonparametric Quantile Regression

work, however, within machine learning traditions (Kriegler, 2007).

4

Logistic Regression and Beyond

When a response variable is categorical, the approach can be quite different. For ease of exposition, we focus on the binary case. For a binary response variable Y , we begin with conventional logistic regression in which P ln 1−P 



= Xβ,

(8)

and the loss function is the deviance as defined in equation 4. The goal is to accurately forecast the class into which a case will fall. For logistic regression, there is an easy way to take the costs of forecasting errors into account. Suppose for logistic regression is applied to a data set that includes the values of key predictors and the binary outcome of interest. Estimates of the regression coefficients are obtained. Consider now a new observation for which only the values of the predictors are known. The intent to forecast into which of the two outcome classes the case will fall. 16

For example, one might be interested in forecasting failure on parole from information available shortly before release from prison. Conventionally, the forecast into one class or the other (e.g., fail or not) depends on which side of the 50-50 midpoint a case falls. If for a given case, the model’s fitted probability is equal to or greater than .50, the case is assigned to one category. If for that case, the fitted probability is less than .50, the case is assigned to the other category. This is analogous to using the median (i.e., Q2 ) in quantile regression. Misclassification into either class has the same costs. The analogy to quantile regression carries over more broadly. Suppose in the parole failure analysis, an arrest coded as “1.” Suppose that no arrest coded as “0.” Then, if one uses a threshold of .75 instead of .50, one has assumed that the cost of falsely forecasting an arrest is 3 times the cost of falsely forecasting no arrest. Likewise, if one uses a threshold of .25, one has assumed that the cost of falsely forecasting an arrest is 1/3rd the cost of falsely forecasting no arrest. But unlike quantile regression, the introduction of asymmetric costs is undertaken in the very last analysis step. Only the forecasted class is affected. The regression coefficients and all other output do not change. If there are no plans to use the regression output for other than the forecasts, no important harm may be done. Usually, however, there is at least some reliance on the regression coefficients to help justify the forecasts made. For example, stakeholders might be skeptical about a forecast if an offender’s prior record were not strongly related to the chances of failure on parole, other things equal. Probably the earliest multivariate classification procedure that could build in asymmetric costs at the front end is classification and regression trees, also known as CART (Bieman et.al., 1984). CART is now familiar to many researchers. What may not be so widely known is that there are two ways within CART to properly introduce asymmetric costs. One way is to explicitly specify different costs for false positives and false negatives. Another way is to alter the prior probability of the response.18 Either method leads to the same outcome because, somewhat surprisingly, the two methods are in CART mathematically identical. Both serve to weight the fitting function. How this works is discussed in considerable detail elsewhere (Berk, 2008: 18

The prior probability of the response in CART is in practice usually the marginal distribution of the response in the data being analyzed.

17

Section 3.5.2). CART is not much used. Among its problems is instability. Small differences in the data can lead to trees with very different features. However, CART has become a building block for several very effective machine learning procedures. Random Forests is an instructive example. The random forests algorithm takes the following form, substantially quoted from a recent paper by Berk et al (2009: 209) 1. From a data set with N observations, a probability sample of size n is drawn with replacement. 2. Observations not selected are retained as the out-of-bag (OOB) data that can serve as test data for that tree. Usually, about a third of the observations will be OOB. 3. A random sample of predictors is drawn from the full set that is available. For technical reasons that are beyond the scope of this paper (Breiman, 2001), the sample is typically small (e.g., 3 predictors). The number of sampled predictors is a tuning parameter, but the usual default value for determining the predictor sample size seems to work quite well in practice. 4. Classification and Regression Trees (CART) is applied (Breiman et al., 1984) to obtain the first partition of the data. 5. Steps 3 and 4 are repeated for all subsequent partitions until further partitions do not improve the model’s fit. The result for a categorical outcome is a classification tree. 6. Each terminal node is assigned a class in the usual CART manner. In the simplest case, the class assigned is the class with the greatest number of observations. 7. The OOB data are dropped down the tree, and each observation is assigned the class associated with the terminal node in which that observation falls. The result is the predicted class for each OOB observation for a given tree. 8. Steps 1 through 7 are repeated a large number of times to produce a large number of classification trees. The number of trees is a tuning 18

parameter, but the results are usually not very sensitive for any number of trees of about 500 or more. 9. For each observation, classification is by majority vote over all trees for which that observation was OOB. The averaging over trees introduces stability that CART lacks. Sampling predictors facilitates construction of nonparametric functional forms that can be highly nonlinear and responsive to subtle features of the data. The use of OOB data means that random forests does not overfit no matter how many trees are grown. No classifiers to date consistently classify and forecast more accurately than random forests (Breiman, 2001; Berk, 2008). Because random forests builds on CART, an asymmetric loss function can be implemented in a similar manner. In the first random forests step, a stratified sampling procedure, can be used. Each outcome class becomes a stratum. One can then sample disproportionately so that, in effect, the prior distribution of the response is altered. When a given class is over-sampled relative to the prior distribution of the response, the prior is shifted toward that class. Misclassifications for this class are then given more weight (See Berk, 2008: Section 5.5).19 To see how this can play out, consider a recent study in which the goal was to forecast parole or probation failures. A failure was defined as committing a homicide or attempted homicide or being the victim of a homicide or an attempted homicide. In the population of offenders, about 2 in 100 individuals failed by this definition within 18 months of intake. After much discussion, the stakeholders agreed on a 20 to 1 ratio for the costs of false negatives to false positives. Failing to identify a prospective perpetrator or victim was taken to be 20 times more costly than falsely identifying a prospective perpetrator or victim. One can see in Table 3 that just as intended, there are about 20 false positives for every false negative (987/45). One false negative has about the same cost as 20 false positives. The random forests procedure correctly forecasts failure on probation or parole with about 77% accuracy and correctly forecasts no failure with 93% accuracy. The 77% accuracy is especially notable 19

If one uses the procedure randomForests in R (written by Leo Breiman and Adele Cutler and ported to R by Andy Liaw), there are several other options for taking costs into account. Experience to date suggests that the stratified sampling approach should be preferred.

19

Forecast: No Actual: No 9972 Actual: Yes 45

Forecast: Yes Forecast Error 987 .09 153 .23

Table 3: Confusion Table for Forecasts of Shooters and Victims using a 20 to 1 Cost Ratio (Yes = shooter or victim, No = not a shooter or victim)

Actual: No Actual: Yes

Forecast: No 14208 151

Forecast: Yes Forecast Error 2 .0001 101 .60

Table 4: Confusion Table for Forecasts of Shooters and Victims Using the Cost Ratio Implied by the Prior (Yes = shooter or victim, No = not a shooter or victim)

because in the overall population of probationers and parolees, only about 2% fail. Put another way, for the subset of offenders identified by the random forests procedure, approximately 77 out of each 100 offenders will actually fail compared approximately 2 out each 100 for the general population of individuals under supervision.20 The default is to draw a unstratified random sample of observations (with replacement). Whatever the relative costs implied by the prior distribution of the response, these are the relative costs imposed. What are the consequences here? Table 4 contains random forests output for the same data used for Table 3, but with the default of equal a priori costs. Because the prior distribution of the response is so unbalanced and no cost adjustments are made, there are now 151 false negatives for 2 false positives. The realized costs are dramatically upside down. Instead of a 20 to 1 cost ratio of false negatives to false positives, the cost ratio is now about 1 to 75. Most stakeholders would find this reversal of the cost ratio completely inconsistent with how the relative costs of false negatives and false positives are actually viewed. The forecasting consequences that can result from accepting the prior 20

It is important to emphasize that because of the OOB data, confusion tables show real forecasting accuracy, not goodness of fit. Each tree is grown from one random sample of the observations with forecasts made for non-overlapping random sample of the observations.

20

distribution of the response when it is inappropriate are dramatically represented in the table. Random forests is able to forecast with nearly 100% accuracy the offenders did not fail. But one can forecast the absence of failure with about 98% accuracy using no predictors whatsoever given the 98% to 2% split in the response. The gain in accuracy is unimpressive. And although forecasting error drops from 9% to virtually 0%, that is not the group that policy makers are most worried about. Random forests was correct over 75% of the time forecasting failures under relative 20 to 1 costs, but with no cost adjustments, random forests is only 40% accurate. Forecasting skill is cut nearly in half. To appreciate what this means, with 20 to 1 relative costs there would be about 25 offenders in every 100 whose failure as a perpetrator or victim would not be predicted. And if not predicted, prevention is made moot. With relative costs determined solely by the prior distribution of the response, there would be approximately 60 offenders in every 100 whose failure as a perpetrator or victim would not be predicted. The number of violent crimes that might have been prevented more than doubles. In short, we see that once again the costs assigned to forecasting errors can dramatically affect the results, and automatically accepting the usual default costs implied by the prior distribution distribution of the response can lead to forecasts that are not responsive to the needs of stakeholders.

5

Conclusions

Not all forecasting errors have the same consequences. For decision makers who will be using forecasts, these different consequences can have real bite. It stands to reason, therefore, that forecasts that will be used should be shaped sensibly and sensitively by the costs of forecasting errors. A key instance is when the costs are asymmetric. We have considered in this paper how asymmetric costs can be introduced into loss functions to affect the forecasts produced. Sometimes different cost ratios can lead to very different forecasts. This is a good thing. Accepting the usual default of equal a priori costs in an unexamined manner is not. In practice, it is important to elicit from stakeholders their views on the relative costs of forecasting errors. If the stakeholders cannot arrive at a single assessment, it can be instructive to work with several and provide the stakeholders with the several forecasts. They can then decide how to proceed. 21

If there are no stakeholders when the forecasts are being constructed, it can still make good sense to apply several different cost ratios and produce the forecasts for each. It will often be useful to know how sensitive the forecasts are to a likely range cost ratios.

22

References Austin, J., Clear, T., Duster, T., Greenberg, D.F., Irwin, J., McCoy, C., Mobley, A., Owen, B. and J. Page. (2007) Unlocking America: Why and How to Reduce America’s Prison Population, Washington. D.C., JFA Institute. Berk, R.A., He, Y., and S. B. Sorenson (2005) “Developing a Practical Forecasting Screener for Domestic Violence Incidents.” Evaluation Review 29 (4): 358-383. Berk, R.A. (2008) Statistical Learning from a Regression Perspective. New York, Springer. Berk, R.A., (2009) “The Role of Race in Forecasts of Violent Crime.” Race and Social Problems 1: 231-242. Berk, R.A., Messinger, S.L., Rauma, D., and J. E. Berecochea (1983) “Prisons as Self-Regulating Systems: A Comparison of Historical Patterns in California for Male and Female Offenders.” Law and Society Review 17(4): 547-586. Berk, R.A., Sherman, L., Barnes, G., Kurtz, E., and L. Ahlman (2009) “Forecasting Murder within a Population of Probationers and Parolees: A High Stakes Application of Statistical Learning.” Journal of the Royal Statistical Society (Series A) 172, part 1: 191-211. Blumstein, A., and R. Larson (1969) “Models of a Total Criminal Justice System.” Operations Research 17(2): 199-232. Borden, H.G. (1928) “Factors Predicting Parole Success.” Journal of the American Institute of Criminal Law and Criminology 19: 328-336. Bose, A., and S. Chatterjee, (2003) “Generalized Bootstrap for Estimators of Minimizers of Convex Functions.” Journal of Statistical Planning and Inference 117: 225-239. Bratton, W.J., and S.W. Malinowski (2008) “Police Performance Management in Practice: Taking COMPSTAT to the Next Level.” Policing 2(3): 259-265.

23

Breiman, L. (2001) Random forests. Machine Learning, 45, 5-32. Breiman, L. Friedman, J.H. Olshen, R.A. and Stone, C.J. (1984) Classification and Regression Trees. Monterey, Ca, Wadsworth Press. Britt, C.L. (2009) “ Modeling the Distribution of Sentence Length Decisions Under a Guideline System: An Application of Quantile Regression Models.” Journal of Quantitative Criminology, 4, Online. Brown, L. Gans, N., Mandelbaum, A., Sakov, A., Shen, H., Zeltyn, S., and L. Zhao. (2005) “Statistical Analysis of a Telephone Call Center: A Queueing-Science Perspective.” Journal of the American Statistical Association 100(449) 36-50. Burgess, E.W. (1928) “Factors Determining Success or Failure on Parole,” in A.A. Bruce, A.J. Harno, E.W. Burgess, and J. Landesco (eds.) The Working of the Indeterminante Sentence Law and the Parole System in Illinois. Springfield, Illinois, State Board of Parole: 205-249. Cohen, J. and W.L. Gorr (2006) Development of Crime Forecasting and Mapping systems for the Use by Police. Final Report, U.S. Department of Justice (Grant # 2001-IJ-CX-0018). Cohen L.W. and K.C. Land (1987) “Age Structure and Crime: Symmetry Versus Asymmetry and the Projection of Crime Rates Through the 1990s”. American Sociological Review 52 :170-183. Eckberg, D.L (1995) “Estimates of Early Twentieth-Century U.S. Homicide Rates: An Econometric Forecasting Approach.” Demography 32: 1-16. Gilchrist, Warren (2008) “Regression Revisited.” International Statistical Review 76(3): 401-418. Glueck, S., and E. Glueck (1950) Unraveling Juvenile Delinquency. Cambridge, Mass.: Harvard University Press. Goodman L.A. (1952) “Generalizing the Problem of Prediction.” American Sociological Review 17: 609-612. Goodman L.A. (1953a) “The Use and Validity of a Prediction Instrument. I. A Reformulation of the Use of a Prediction Instrument.” American Journal of Sociology 58: 503-510. 24

Goodman L.A. (1953b) “II. The Validation of Prediction.” American Journal of Sociology 58: 510-512. Gottfredson, S.D., and L.J. Moriarty (2006) “Statistical Risk Assessment: Old Problems and New Applications.” Crime & Delinquency 52(1): 178-200. Granger, C.W.J (1989) Forecasting in Business and Economics, second edition, New York: Academic Press. Hastie, T.J. and R.J. Tibshirani (1990) Generalized Additive Models. New York: Chapman & Hall. Hastie, T.J., Tibshirani, R.J., and J. Friedman (2009) Elements of Statistical Learning, second edition, NewYork, Springer. He, X., and F. Hu (2002). “Markov Chain Marginal Bootstrap.” Journal of the American Statistical Association 97 (459): 783-795. Kocherginsky, M., He, X., and Y. Mu (2005) “Practical Confidence Intervals for Regression Quantiles.” Journal of Computational and Graphical Statistics 14: 41-55. Koenker, R.W., and G. Bassett (1978) “Regression Quantiles.” Ecomometrica 46: 3-50. Koenker, R. W. (2005). Quantile Regression. Cambridge, Cambridge University Press. Koenker, R.W. (2009) ‘Quantile regression in R: A Vignette.” project.org/web/packages/quantreg/vignettes/rq.pdf.

cran.r-

Koenker, R., Ng, P, and S. Portnoy, (1994) “Quantile Smoothing Splines”. Biometrika 81: 673680. Koenker, R., and Q. Zhao (1996) “Conditional Quantile Estimation and Inference for ARCH Models.” Econometric Theory 12: 793-813. Kriegler, B. (2007). Cost-Sensitive Stochastic Gradient Boosting Within a Quantitative Regression Framework. PhD dissertation, UCLA Department of Statistics. 25

Kriegler, B. and R.A. Berk (2009) “Estimating the Homeless Population in Los Angeles: An Application of Cost-Sensitive Stochastic Gradient Boosting,” working paper, Department of Statistics, UCLA, under review. McCullagh, P. and J.A. Nelder (1989) Generalized Linear Models. New York: Chapman & Hall. Meinshausen, N. (2006) “Quantile Regression Forests.” Journal of Machine Learning 7: 983-999. Ohlin L.E. and O.D. Duncan (1949) “The Efficiency of Prediction in Criminology.” American Journal of Sociology 54: 441-452. Parzen, M.I (1979) “Nonparametric Statistical Data Modeling.” Journal of the American Statistical Association 74: 105-1‘21. Parzen, M. I., L. Wei, and Z. Ying (1994) “A Resampling Method Based on Pivotal Estimating Functions” Biometrika 81: 341350. Oberhofer, W. (1982) “The Consistency of Nonlinear Regression Minimizing the L1 -Norm.” The Annals of Statistics 10(1): 316-319. Swanson, N.R., and H. White (1997) “Forecasting Economic Time Series Using Flexible Versus Fixed Specification and Linear Versus Nonlinear Econometric Models.” International Journal of Forecasting 13(4): 439461. Tay, A.S. and K.F. Wallis (2000) “Density Forecasting: A Survey.” Journal of Forecasting 19: 235-254 Tibshirani, R.J. (1996) “Regression Shrinkage and Selection Via the LASSO.” Journal of the Royal Statistical Society, Series B, 25: 267–288.

26