Application of Generalized Pareto Distribution to ... - GeoScienceWorld

0 downloads 0 Views 1MB Size Report
PGA and the residual data do not always follow a lognormal distribution and are instead often better characterized by the generalized Pareto distribution (GPD).
Bulletin of the Seismological Society of America, Vol. 100, No. 1, pp. 87–101, February 2010, doi: 10.1785/0120080265

Application of Generalized Pareto Distribution to Constrain Uncertainty in Peak Ground Accelerations by L. Huyse, R. Chen, and J. A. Stamatakos

Abstract

Probabilistic seismic hazard analysis (PSHA) has become standard practice to characterize earthquake ground-motion hazard and to develop ground-motion inputs for seismic design and performance analyses. One emerging issue is the application of PSHA at low annual exceedance probabilities, particularly the characterization of scatter (aleatory variability) in the recorded ground-motion parameters, including peak ground acceleration (PGA). Lognormal distributions are commonly used to model ground-motion variability. However, a lognormal distribution, when unbounded, can yield a nonzero probability for unrealistically high ground-motion values. In this article, we evaluate the appropriateness of the lognormal assumption for low-probability ground motions by examining the tail behavior of the PGA recordings from the Pacific Earthquake Engineering Research–Next Generation Attenuation of Ground Motions (PEER NGA) database and the PGA residuals using Abrahamson– Silva NGA ground-motion relations. Our analyses show that the tail portion of the PGA and the residual data do not always follow a lognormal distribution and are instead often better characterized by the generalized Pareto distribution (GPD). We propose using a composite distribution model (CDM) that consists of a lognormal distribution (up to a threshold value of ground-motion residual) combined with GPD for the tail region. We demonstrate implications of the CDM in PSHA using a simple example and GPD parameters derived from the residual fit. Our results show that, at low annual exceedance probabilities, the CDM yields considerably lower PGA values than the unbounded lognormal distribution. It also produces smoother hazard curves than truncated lognormal distributions because the PGA increases asymptotically with a decreasing probability level. The presented approach is readily adapted to spectral accelerations and other ground-motion parameters.

Introduction relations (Abrahamson and Silva, 2008) as an example. This two-step approach allows us to demonstrate the universal applicability of peak-over-threshold models to either the raw PGA records of specific events or to the residuals in a regression model with broader applicability (the regression model covers a wider range of earthquakes, distances, and fault mechanisms). We demonstrate that a threshold value marks the tail portion of the PGA or residual data that is better characterized by the generalized Pareto distribution (GPD). Therefore, ground motions are modeled by a composite distribution model (CDM) that consists of a lognormal distribution below a threshold value and GPD for the tail region. The application of the CDM in probabilistic seismic hazard analysis (PSHA) is illustrated using a simple example and GPD parameters derived from the residual fit of the Pacific Earthquake Engineering Research–Next Generation Attenuation of Ground Motions (PEER NGA) Database (Chiou et al., 2008). The hazard

A common feature of probabilistic seismic hazard studies is the use of lognormal distributions to model the groundmotion scatter. Because the lognormal distribution belongs to the Gumbel domain of attraction (Singpurwalla, 1972; Castillo, 1988), decreasing the exceedance probability results in progressively larger ground motions that are in part due to extrapolation into the extreme tails of the lognormal distribution. In this article, we evaluate the appropriateness of the lognormal assumption for low-probability ground motions by examining peak ground accelerations (PGA). As a first example, the PGA records from two well-recorded aftershocks of the 1999 Chi-Chi earthquake are analyzed. This data set was chosen because it provided a sufficiently large sample associated with a single moment magnitude (Mw ) to which peak-over-threshold analysis techniques could be directly applied. In a second step, a peak-overthreshold analysis is performed on the PGA residuals, using the Abrahamson–Silva Next Generation Attenuation (NGA) 87

88

L. Huyse, R. Chen, and J. A. Stamatakos

results using CDM are compared with those obtained using unbounded and truncated lognormal distributions.

Tail Behavior and Peak-Over-Threshold Analysis For large return periods or very low frequencies, an accurate estimation of the tails of the distribution is of primary interest. Generally speaking, two types of statistical models of extremes can be distinguished on the basis of independently and identically distributed (iid) random samples: classical extreme values analysis (only the maxima within a given reference period are included in the analysis) and peak-over-threshold analyses (all data that exceed a given threshold value are included). The drawback of the classical extreme value analysis is that only a single data point per reference period can be used. For instance, in an application where the maximum PGA due to an earthquake is to be estimated, only the largest PGA value for a specific year or decade could be used. Such an approach severely limits the amount of data available to estimate the distribution parameters and therefore introduces significant statistical uncertainty on the distribution parameters. Peak-over-threshold analyses, on the other hand, use data that exceed a given threshold value. Pickands (1975) identified the GPD as the limiting distribution for the excesses for a sufficiently large threshold value.

Generalized Extreme Value Distribution It can be shown that, under conditions traditionally used for PSHA (an ergodic process for the recurrence of ground motions at a given site and the iid assumption; e.g., Cornell, 1968; Cornell, 1971; and Marz and Cornell, 1973; Castillo, 1988; Anderson and Brune, 1999; Anderson and Brune, 2001; Coles, 2001; Castillo et al., 2005), the distribution of maxima converges to one of three limiting distributions, which are more widely known as the Gumbel (type I), Frechet (type II), and Weibull (type III) distributions. All three distribution types have a location (λ) and scale (δ) parameter; the Frechet and Weibull distributions also have a shape parameter (c), where c > 0 for the Frechet and c < 0 for the Weibull. These three distribution types can be further combined into a generalized extreme value distribution (GEVD; also referred to as the von Mises form of the extreme value distribution—see Castillo, 1988):     x  λ 1=c FGEVD x  exp  1  c δ 

  xλ FGumbel x  exp  exp  δ 

c  0:

Statistical uncertainty is associated with each of the distribution parameters λ, δ, and c. If the data set is not too small, the covariance matrix of the parameters can most readily be obtained as the negative inverse of the expected value of the Fisher information matrix (Castillo, 1988).

Generalized Pareto Distribution Pickands (1975) showed that the GPD arises as the limiting distribution of the excesses, X  λ, for a sufficiently large threshold λ. The GPD has the following form:  FGPD x 

1=c 1  1  c xλ δ  ; xλ 1  exp δ ;

c≠0 : c0

(3)

Like the GEVD, the GPD is also characterized by a location (λ), scale (δ), and shape (c) parameter. The tail heaviness index H (Boos, 1984) is a measure of the curvature of the log-exceedance function Lx and is defined as Hx 

L00 x L0 x2

with

Lx   ln1  Fx;

(4)

where Fx is the cumulative distribution function (CDF) of the random variable X. For a GPD, the tail heaviness index is readily shown to be constant and equal to the shape parameter: H  c. In general, the tail heaviness index H benchmarks the tail behavior against an exponential tail. Depending on the value of H (or c), three types of tails can be discerned: light, neutral, and heavy (Fig. 1). When the tail heaviness index is negative, the distribution is bounded, and the maximum attainable value for the random variable X is reached when

c ≠ 0; (1)

where the  denotes that the GEVD is defined only when the term inside the square brackets is positive. When the parameter c is zero, the Gumbel distribution is retrieved as a limiting case of the GEVD:

(2)

Figure 1.

Three types of tail distributions.

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations 

 xmax  λ δ  0⇔xmax  λ  : 1c δ c

(5)

Various estimators for the tail heaviness index, GPD parameters, and their statistical uncertainties exist in the statistical literature (e.g., Hill, 1975; Hosking and Wallis, 1987; Maes, 1995; Caers and Maes, 1998; Hsieh, 1999; Drees et al., 2000). Although the GPD distribution is used to model all large values and the GEVD models all extremes, each of the GPD types is tail equivalent with an extreme value distribution (Table 1).

Application to the Chi-Chi Aftershock PGA Data As a first example, the PGA records from two wellrecorded aftershocks of the 1999 Chi-Chi earthquake are analyzed. These data are part of the PEER NGA database, which is discussed in detail in Chiou et al (2008). Both aftershocks have moment magnitude of 6.2 and a reverse faulting mechanism. Together, there are 564 records. This data set was chosen because it provided a sufficiently large sample associated with aftershocks of the same magnitude and similar source characteristics, and the recording sites have similar crustal characteristics. Peak-overthreshold analysis techniques could be directly applied to data grouped by distance bins without the use of any type of regression modeling, which might influence the tail behavior. However, it should be noted that other factors, such as site conditions and path characteristics, can introduce sample bias. Therefore, the peak-over-threshold analysis on PGA raw data is only exploratory in nature. The empirical distribution of the PGA was analyzed, and the appropriateness of the traditional lognormal distribution model was assessed. To this extent, the Chi-Chi data set was partitioned into bins according to their closest distance to ruptured area (r), and the empirical distribution of the peak ground acceleration was plotted on lognormal probability paper. Figure 2 shows that the data generally follow a lognormal distribution (i.e., a straight line). Because the slope in the lognormal plot is essentially identical for all closestdistance bins, we conclude that the coefficient of variation is independent of the distance r. The data was also plotted on a log-exceedance or tail plot in Figure 3 to become acquainted with its upper tail characteristics. With the exception of the 35–45 km distance bin, the PGA distribution appears to have a finite upper bound.

Table 1 Tail Equivalence between Generalized Pareto and Extreme Value Distributions Tail

Hx

PDF of X  λ

PDF of Extremes

Light Neutral Heavy

− 0 

Beta Exponential Pareto

Weibull (c < 0) Gumbel (c  0) Frechet (c > 0)

89

Although insufficient data exist to draw firm conclusions, it is not unreasonable to surmise that the lack of upper bound in the 35–45 km series is perhaps caused by statistical fluctuations. Figure 3 also reveals that the maximum PGA decreases with increasing distance. Because all data were included in Figure 3 (equivalent to a threshold value λ  0), the exploratory modeling gives only a general indication of the tail behavior, which then must be further quantified. Because it is generally accepted that log (PGA) may be proportional to logr, where r is the closest distance to the rupture plane, equal logr bins were created, and the number of data within each bin is recorded in Figure 4. Given that the tail region typically contains only a few percent of all data, many of the distance range bins do not contain sufficient number of data points for a tail analysis. There are sufficient data in the 50–70 km range, and thus we illustrate the tail fitting procedure using the 130 data points in that distance bin. In a first step, we determine the threshold λ. There are several procedures that can be used to determine the threshold (Caers and Maes, 1998; Coles, 2001; Guillou and Hall, 2001). We adopted a straightforward graphical procedure, based on the conditional mean excess E(PGA  λ j PGA > λ), where E: denotes the expectation operator. If the random variable follows a GPD, then the conditional mean excess is a linear function of the threshold level λ. This leads to a selection procedure for λ, as the start point of the last linear segment of the conditional mean excess curve. Figure 5 shows the conditional mean excess as function of the threshold level λ. The resulting threshold level is λ  0:1275 g. At this threshold, only nine data points exceed this level—a sample size that is perhaps too small to draw meaningful inference. We therefore also consider λ  0:085 g and λ  0:065 g as possible threshold levels. The GPD parameters and other tail statistics that are associated with each of the threshold levels are summarized in Table 2, including number of data points in the tail region (N tail ), PGA upper bound, and the ninety-fifth percentile upper confidence limit. The shape parameter is essentially constant for all three threshold levels. This example thus illustrates the invariance of the tail heaviness index with respect to a shift of the threshold deeper into the tail region. Because the tail heaviness is negative, the GPD is bounded, and an upper bound limit for the PGA can be estimated. Figure 6 shows the Chi-Chi PGA data, as well as the GPD fit and 95% confidence upper limit; the figure also shows the general lognormal model (with parameters calculated from the statistical moments of all the PGA data). We thereby show that, in the tail region, the lognormal model significantly overestimates the PGA for a given reliability level 1  FX x or log-exceedance value. The different shapes of the GPD and the lognormal model illustrate that the two distributions are not tail equivalent (Castillo et al., 1995); the lognormal distribution belongs to the unbounded Gumbel domain of attraction, whereas the data clearly suggest convergence to a bounded Weibull extreme value

90

L. Huyse, R. Chen, and J. A. Stamatakos

Figure 2.

Lognormal probability plot for Chi-Chi peak ground acceleration data.

distribution. The GPD model is able to better capture the curvature in the data. In addition, theoretical considerations support the use of the GPD model beyond the data range instead of relying on the lognormal model (see Pickands, 1975). Some researchers have also used Akaike’s information criterion (Akaike, 1974) to justify the selection of the GPD over the lognormal model. Based on this example, we conclude that, while the unbounded lognormal PGA gives increasingly large groundmotion values as a function of the return period, the GPD model for the 50–70 km distance bin has a finite upper bound on the PGA. For exceedance probabilities less than 7% (i.e., within the tail region), the lognormal model assumption overestimates the maximum PGA. The tail of the lognormal distribution is much longer than that of GPD. Because λ and c are subject to statistical uncertainty, the maximum PGA value will be uncertain as well. The ninetyfifth percentile confidence upper limits listed in Table 2 are first-order approximations. It can be seen that the confidence intervals grow progressively larger with increasing threshold level λ.

Application to PGA Residuals As a second example, we performed a peak-overthreshold analysis on the total residuals of the PGA. The total residual includes both the interevent and intraevent residuals and is defined as ε  lnPGAobserved   lnPGApredicted ;

(6)

where PGApredicted is the peak ground acceleration calculated using a ground-motion model for the chosen earthquake and recording site. In the literature, numerous empirical ground-motion models, often known as ground-motion prediction equations (GMPEs), have been proposed to describe the dependence of ground-motion parameters on earthquake magnitude, sourceto-site distance, and other source, travel-path, and local-site soil and rock characteristics. The most comprehensive and integrated effort to develop GMPEs is the recent NGA project (Power et al., 2008). Five NGA ground-motion models have been developed, including the ground-motion relations of Abrahamson and Silva (2008). The Abrahamson–Silva

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

Figure 3.

λ  0).

91

Log-exceedance plot for Chi-Chi peak ground acceleration data (for exploratory purposes only because assumed threshold

NGA relations are used in this article to calculate the predicted PGA and the PGA residuals. Conducting a peak-over-threshold analysis on PGA residuals of an attenuation model instead of the PGA records themselves allows combining data from multiple events with different parameters that affect ground-motion characteristics. Because the tail analysis of the residuals implicitly assumes that the statistics of the residuals are unaffected by the specific values of the attenuation model parameters (i.e., the statistics associated with one type of faulting are no different than for another type of faulting), a much larger size sample is obtained in a residual analysis than in a direct assessment of the PGA, and this, in turn, results in narrower confidence bounds on the tail behavior. The PEER NGA database was again used to calculate the PGA residuals using the NGA ground-motion relations developed by Abrahamson and Silva (2008). The peak ground acceleration values are equal to the geometric mean of the two orthogonal horizontal components orientated randomly, as defined in the NGA flatfile documentation.

In residual calculations using the Abrahamson and Silva (2008) ground-motion relations, we excluded the earthquake recordings that were considered not applicable and were excluded by Abrahamson and Silva (2008) based upon their data selection criteria. These excluded earthquake recordings and the bases for exclusion are given in table D-1 of Abrahamson and Silva (2008). In addition, we excluded the four Chi-Chi aftershocks that require a separate distance attenuation model. These earthquakes are identified by earthquake identification numbers 172, 173, 174, and 175 in the NGA flatfile. These earthquakes were excluded mainly because the calculated residuals show considerable bias by distance, and we were not able to clarify the cause. Also, the use of different regression models within the same data set may introduce additional uncertainty in terms of residual distribution. Instead, selective Chi-Chi aftershock recordings were used for peak-over-threshold analysis on the raw PGA data as documented in previous sections of this article. A considerable fraction (110 earthquakes) of the NGA flatfile does not have finite-fault models and, therefore,

92

L. Huyse, R. Chen, and J. A. Stamatakos

Figure 4.

Figure 5.

Number of peak ground acceleration records in each distance bin.

Mean conditional excess plot for 50 < r < 70 km. A threshold λ  0:1275g can be clearly identified as the beginning of the last stable, linear part of this plot.

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

93

Table 2

Table 3

GPD Parameters and Tail Statistics for the Distance Range 50–70 km of the Two Chi-Chi Mw  6:2 Aftershocks

Tail Statistics of the Regression Residuals of the Abrahamson–Silva (A-S) and Boore–Atkinson Models

Threshold

Scale

Shape

0.065 0.085 0.127

0.064 0.056 0.044

0:376 0:351 0:351

*

N tail

36 25 9

*

PGA Upper Bound

95th Percentile Upper Bound Limit

0.235 0.243 0.252

0.254 0.279 0.278

Number of data points in the tail region.

are missing important source parameters such as rupture distance, Joyner–Boore distance, depth-to-top-of-fault rupture, fault-rupture width, source-to-site azimuth, and hanging-wall or foot-wall indicators. The majority of the earthquakes without finite-fault models are for magnitudes less than 6.0. R. R. Youngs (personal commun., 2007) developed finitefault models for these earthquakes based on simulation of 101 possible rupture planes for each earthquake and supplemented the NGA database with simulated source parameters. These simulated parameters were used by the NGA developers in their ground-motion model development. Therefore, two sets of residuals were considered in peak-over-threshold analysis, one with the simulated finite-fault data provided by R. R. Youngs (personal commun., 2007) and one without. Summary statistics for the tail region of the Abrahamson– Silva model residuals are given in Table 3. The tail plot associated with the selected threshold λ  0:9 for the Abrahamson–Silva model with finite fault is given in Figure 7. The lognormal distribution fitted to all residuals is also indicated. Although there is no overwhelming differ-

Model Description

A-S model with finite fault A-S model without finite fault Boore–Atkinson Model

Threshold

Scale

Shape

Residual Upper Bound

4.3%

0.9

0.35

0:29

2.19

8.1%

0.7

0.31

0:12

3.84

8.8%

0.7

0.29

0:11

3.55

% Tail

*

*

Percentage of data points in the tail region.

ence between the GPD and lognormal distribution assumption within the range of data, considerable differences appear when predictions need to be made beyond the data range. The shape parameter in the GPD model enables better modeling of the curvature in the data. The tail plot along with the lognormal distribution fitted to all residuals for the Abrahamson–Silva model without finite fault is given in Figure 8 for λ  0:7. Figure 8 shows that the GPD model captures the true tail behavior much better than the lognormal distribution does (which was fitted to all residual data). Although the best estimate of the GPD shape parameter is fairly close to zero (see Table 3), it is significantly different from zero (its coefficient of variation is 12%), and the residual distribution is therefore bounded. The threshold selection where the tail begins is not always straightforward (Guillou and Hall, 2001). If the data follow a GPD distribution, the shape parameter of the GPD is

Figure 6. Comparison of generalized Pareto distribution fit with 95% confidence upper bound and lognormal distribution for 50 < r < 70 km.

94

L. Huyse, R. Chen, and J. A. Stamatakos

Figure 7. Comparison of generalized Pareto distribution fit with ninety-fifth percentile upper bound and lognormal distribution for the residual in the Abrahamson–Silva attenuation model with the finite-fault data. invariant with respect to a shift of the threshold λ to higher values. Figure 9 illustrates this property for the Abrahamson– Silva model with finite-fault data. If the threshold is selected too low (e.g., λ < 1:0), a large number of data points that clearly do not belong to the tail (i.e., they are not part of the

last linear part of the mean excess curve in Fig. 5) will be included in the estimation of the shape parameter, and this will result in a biased estimate for the parameter c. Figure 9 shows that the shape parameter c converges as the threshold is increased. On the other hand, if the threshold is moved too

Figure 8. Comparison of generalized Pareto distribution fit with ninety-fifth percentile upper bound and lognormal distribution for the residual in the Abrahamson–Silva attenuation model without the finite-fault data.

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

95

Figure 9.

Best estimate and ninety-five percent confidence bounds of the generalized Pareto distribution shape parameters as functions of the threshold λ for the residual in the Abrahamson–Silva attenuation model with the finite-fault data.

far into the tail, very few data will remain to estimate the c value, which results in numerically unstable estimates with considerable statistical uncertainty (wide confidence bounds in Fig. 9). In practice, the optimal threshold value achieves a compromise between bias and variance of the estimate (Caers and Maes, 1988).

Implication for Seismic Hazard Analyses Thousands of strong-motion recordings have shown a considerable degree of scatter in ground-motion parameters, including peak ground acceleration. Aleatory variability in ground-motion prediction and the way it is integrated in seismic hazard analyses have significant effects on the calculated hazard. For decades, the existence of ground-motion variability has been recognized (Bender, 1984), and the formulation of hazard calculation has included integration across the scatter in ground motion (Cornell, 1971; Marz and Cornell, 1973; McGuire, 1976). However, in practice, as indicated by Bommer and Abrahamson (2006), ground-motion variability was often neglected, or its effect was artificially reduced by various means in studies carried out in the 1970s and 1980s. Truncating the lognormal distribution at a predefined level of ground motion, often specified as the median ground motion plus a number of standard deviations, is an example of various attempts to reduce the effect of ground-motion variability. Bommer and Abrahamson (2006) conducted extensive review and analysis and emphasized the importance of including ground-motion variability in PSHA. They con-

cluded that neglecting ground-motion variability in earlier studies is the main reason that these studies yielded much lower ground-motion hazard than did the probabilistic hazard studies performed in recent years. Probabilistic seismic hazard calculations should be carried out in the way that integrates across the full range of variability in ground motion. However, such integration should be performed over a probability density function (PDF) that better characterizes the ground-motion distribution than the widely used lognormal distribution. A lognormal distribution, when unbounded, gives a nonzero probability to groundmotion values so high that they are not physically possible. Bounded lognormal distributions, on the other hand, may distort the data, render the shape of the hazard curve artificial, and produce erroneous hazard results, particularly at low annual exceedance probabilities. In addition, the selection of the truncation level is not entirely objective and may be arbitrary. This section demonstrates that probabilistic hazard integration should be performed over a density that is composite: a lognormal distribution supplemented by a GPD in the tail region. In the GPD, as detailed in the previous sections, PGA would gradually approach an upper bound. Assumption of Lognormal Distribution and Probability Calculation To discuss the effects of ground-motion variability and the way it is integrated in PSHA on the calculated hazard, we need to briefly review the basics of PSHA calculations,

96

L. Huyse, R. Chen, and J. A. Stamatakos

starting with the PDF of the lognormal distribution for PGA. This PDF can be written as  fy; μY ; σY  

p 1 e 2πσY PGA



1 yμY 2 2σ2 Y

0

PGA > 0 ; (7) PGA  0

where Y  lnPGA is a random variable with normal distribution that has a mean, μY , and standard deviation, σY . The median and standard deviation are obtained from groundmotion models such as Abrahamson and Silva (2008). For a given earthquake with magnitude m, the probability that the ground motion at distance r exceeds a particular value a0 is the complementary of the CDF and can be written as FY ≥ ln a0 jm; r Z a  12 yμY 2 0 1 p 1 dPGA e 2σY 2πσY PGA 0 Y  lnPGA > 0:

(8)

The integration is performed numerically. Because the CDF of a lognormal distribution can be easily obtained from the CDF of a standard normal distribution, often denoted

as ΦZ, by means of transformation, equation (8) can be written as  FY ≥ ln a0 jm; r  1  ΦZ  1  Φ Y  lnPGA > 0;

ln a0  μY σY



(9)

where Z  ln aσ0 YμY is a standard normal random variable. ΦZ has been extensively tabulated and is readily available in many statistics handbooks and spreadsheet applications. Suppose the site of interest is affected by a total of N earthquake sources, each of which has an associated magnitude Mi , distance Ri , and annual rate of occurrence vi . An earthquake source can produce multiple earthquakes that can occur at various locations of the source. Therefore, Mi and Ri are random variables, each having its own distribution. If fMi m and fRi r are used to denote the PDFs of magnitude and distance variables, respectively, for the ith source, then the total annual probability that the ground motion exceeds a0 is (total probability theorem) PY ≥ ln a0  

N X

vi

ZZ FY ≥ ln a0 jm; rfMi mfRi rdmdr: (10)

i1

The effect of ground-motion variability and its integration is manifested in the calculation of FY ≥ ln a0 jm; r. To demonstrate, we consider a simple case that is similar to the hazard calculation example used by Field (2006). This example assumes a rock site condition. It further assumes

that there are two potential sources, both are vertical strikeslip faults, and both are located at a rupture distance of 15 km from the site (r1  r2  15 km). The first produces an M 5.0 earthquake once every 20 years (m1  5:0, v1  1=20), and the second produces an M 7.0 earthquake once every 300 years (m2  7:0, v2  1=300). For the chosen magnitudes, distances, and rates of occurrence, equation (10) is simplified to give the total annual probability that the ground motion exceeds a0 and is shown as PY ≥ ln a0   v1 F1 Y 1 ≥ ln a0 jm1 ; r1   v2 F2 Y 2 ≥ ln a0 jm2 ; r2 :

(11)

The NGA ground-motion relations developed by Abrahamson and Silva (2008) are used to obtain the median PGAs of 0.0794g (μY 1  2:533) and 0.1636g (μY 2  1:810), respectively, for the M 5.0 and M 7.0 earthquakes. The total standard deviation for lnPGA is calculated to be 0.7449 and 0.5336, respectively for the M 5.0 and M 7.0 earthquakes. F1 Y 1 ≥ ln a0 jm1 ; r1  and F2 Y 2 ≥ ln a0 jm2 ; r2  are calculated for various values of a0 by numerically integrating over the entire lognormal distribution without truncation. The total annual probabilities of exceeding various PGA levels are then calculated using equation (11) and plotted as a hazard curve labeled “Without truncation” in Figure 10. This hazard curve indicates that, at an annual exceedance probability of 107 , the PGA exceeds 2.0g, and at an annual exceedance probability of 108 , the PGA reaches 3.5g. Truncated Lognormal Distribution If the lognormal distribution is truncated at a PGA value of atrun , the PDF needs to be renormalized [divided by μY ] so that it integrates to one Φztrun , where ztrun  ln atrun σY when the maximum PGA (i.e., atrun ) is reached. The probability that a given earthquake with magnitude m produces PGA at distance r exceeding a particular value a0 is then FY ≥ ln a0 jm; r  1 Φztrun   ΦZ  Φztrun  0

PGA ≤ atrun PGA > atrun

:

(12)

Using the same two-source example, the total annual probabilities of exceeding various PGA levels are calculated with the lognormal distribution truncated at various levels. Figure 10 compares hazard curves with no truncation and with truncations at a PGA of median plus 1, 2, 3, and 4 times standard deviation, respectively. The maximum PGA for the various truncation levels is calculated as atrunc  eμY nσY , where n is the number of standard deviations and μY and σY are the mean and standard deviation of lnPGA, as defined in equation (7). Because the maximum PGA is limited for

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

97

Figure 10.

The total peak ground acceleration exceedance rate for the two earthquake sources described in the text using full range lognormal and truncated lognormal distributions.

each earthquake scenario, the total PGA is also limited and the resulting hazard curves become asymptotic to the maximum PGA. More severe bending of the calculated hazard curve corresponds to the more severe truncation (i.e., smaller n value). Although truncating the lognormal distribution of ground motion at a specified value of uncertainty is a common practice in PSHA, truncation below 3 standard deviations is not supported by the analyses of empirical strongmotion data in this study or other studies (Abrahamson, 2000; Bommer et al., 2004; Bommer and Abrahamson, 2006). As Bommer and Abrahamson (2006) indicated, truncating at a level that is above 3 standard deviations has little effect on the hazard curves for the return periods generally used in engineering design or even at the 10,000-year level employed in the nuclear industry in some cases. This observation is confirmed by Figure 10. Truncation will affect the calculated hazard for smaller annual exceedance probabilities (e.g., below 106 ). Because the selection of truncation levels is somewhat subjective, the calculated ground motions at these longer return periods may not be appropriate. Truncating ground-motion uncertainty potentially means a portion of the recorded strong-motion data that has low probability of occurring is ignored. Yet, it is precisely these low-probability values that establish the hazard at very long return periods. Figure 11 shows deaggregation results with respect to ground-motion uncertainty (ε) for a hypothetical site. At higher probability level or shorter return period, the PDF

of ε is skewed to the left, whereas at higher ground motion (lower probability level or longer return period), the PDF of ε is skewed to the right. Modeling the left tail of the residual is important for low return periods (Fig. 11a, upper figure), whereas it is the right tail that is of interest in the prediction of low-probability events (Fig. 11b). In this example, the ground motion at low annual exceedance probability levels is controlled by uncertainty up to 3 or 4 standard deviations. Therefore, correctly modeling the uncertainty in the highdensity region of the deaggregation diagram in Figure 11b is critical if ground motions associated with very low annual exceedance probabilities are of interest. In the current state of practice, all of the existing PSHA codes assume lognormal distribution of ground motion, and, therefore, the integration for hazard is all based on lognormal distribution. Figures 6, 7, and 8 indicate that significant differences between the recorded data and the assumed lognormal distribution model may exist in the tail region and when extrapolated beyond the data range. Using a composite distribution model (lognormal supplemented by the GPD in the tail region) may result in significantly more accurate estimates of the hazard. Combined Lognormal and Generalized Pareto Distribution We use a composite distribution model that combines lognormal distribution and GPD to model peak ground acceleration. Hazard integration is over the lognormal distribution until a threshold PGA value, aλ , is reached. For the portion

98

L. Huyse, R. Chen, and J. A. Stamatakos GPD. Hazard examples were calculated using two separate sets of GPD parameters: the one with simulated finite-

Figure 11. Deaggregation of peak ground acceleration hazard with respect to ground-motion uncertainty (ε) for a hypothetical site.

with PGA greater than aλ (i.e., the tail region), the integration is performed over the GPD. This is achieved by combining equation (9) and the CDF for GPD (equation 3) and renormalizing the distribution, which yields the probability that a given earthquake with magnitude m producing PGA at distance r exceeding a particular value a0 and is shown as FY ≥ ln a0 jm; r  ΦZ 1  1  ptail  Φz λ  ptail 1  FGPD PGA

lnPGA  μlnPGA ≤ λ ; lnPGA  μlnPGA > λ

(13) ln a μ

where zλ  λσy ln y , with aλ  expλ  μln y , corresponding to PGA when lnPGA  μlnPGA  λ; FGPD PGA  lnPGAμ

λ

lnPGA 1=c ; λ, δ, and c are generalized 1  1  c δ Pareto density parameters defined in equation (3) and given in Table 3 for PGA residuals; and ptail is the fraction of recorded data in the tail region (see Table 3). Again, using the same two-source example, the total annual probabilities of exceeding various PGA levels were calculated using the combined lognormal distribution and

fault data and the one without simulated finite-fault data. Figures 12 and 13 show total hazard curves with lognormal distribution without truncation, lognormal distribution with truncations at the threshold and upper bound PGAs (threshold and upper bound values mark the beginning and end of the GPD tail model, respectively), and the composite distribution model for each case, respectively. The threshold PGA, aλ , is defined in equation (13). The upper bound PGA is calculated similarly but using the best estimate of upper bound value in Table 3 in the place of tail threshold value λ. For the case with finite-fault data, the threshold PGA is calculated to be the median PGA plus 1.12 standard deviations, and the upper bound PGA is the median PGA plus 2.83 standard deviations. For the case without finite-fault data, the threshold PGA is the median PGA plus 0.94 standard deviations, and the upper bound PGAs is the median PGA plus 4.55 times standard deviations. These figures show that hazard estimated using the composite distribution model is lower than hazard estimated using the lognormal distribution without truncation for all annual probabilities below about 102 . The difference becomes increasingly more significant as the probability level decreases, particularly for the set of residuals calculated with simulated finite-fault data (Fig. 12). The composite distribution model predicts higher hazard than the lognormal distribution truncated at the threshold PGA; and, for most annual exceedance probability, it predicts lower hazard than the lognormal distribution truncated at the upper bound PGA. For the set of residuals calculated with the simulated finite-fault data, the difference between predicted hazard level using the composite distribution model and lognormal distribution truncated at upper bound PGA is insignificant. However, the former predicts a smoother hazard curve because the characteristics of the hazard curve are controlled by the smooth transition from the lognormal distribution to GPD and by GPD parameters in the tail region rather than truncation due to an arbitrary cutoff value. We note that the GPD parameter values used in our PSHA example are the best-estimate results from the peak-overthreshold analysis. These estimates are associated with uncertainties as indicated by the coefficient of variation and confidence intervals for scale and shape parameters (see Fig. 9 for an illustration of the confidence bounds on the shape parameter). Variations in these parameters will affect both the level and the shape of the hazard curve. This is demonstrated by comparing the differences between Figures 12 and 13. Although not quantitatively defined, there is also uncertainty associated with threshold λ. This parameter is important because it defines the probability level at which the composite distribution model will begin to differ from the traditional lognormal distribution. Because the confidence interval associated with λ is not estimated, the value obtained for this parameter is considered preliminary. These uncertainties, however, do not affect our general observations that

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

99

Figure 12.

Comparison of the total peak ground acceleration exceedance rate for the two earthquake sources described in the text using full range lognormal distribution, truncated lognormal distribution, and composite distribution (combined lognormal distribution and generalized Pareto distribution) for Abrahamson–Silva with finite-fault data.

Figure 13.

Comparison of the total peak ground acceleration exceedance rate for the two earthquake sources described in the text using full range lognormal distribution, truncated lognormal distribution, and composite distribution (combined lognormal distribution and generalized Pareto distribution) for Abrahamson–Silva without finite-fault data.

100

L. Huyse, R. Chen, and J. A. Stamatakos

the peak-over-threshold analysis is applicable to PGA residuals, and the calculation of probabilistic seismic hazard should be based upon combined distribution models. To gain confidence in this regard, we performed tail analysis for a set of residuals obtained from an alternative ground-motion model, the Boore–Atkinson NGA equations and their selective NGA earthquake recordings (Boore and Atkinson, 2007, 2008). The tail analysis shows very similar distribution characteristics to the residuals obtained from the Abrahamson–Silva NGA model without finite-fault data, as demonstrated by the GPD parameters in Table 3. There is good agreement between the upper bound on the maximum residual predicted by the Boore–Atkinson model and Abrahamson–Silva model without the finite-fault data.

Conclusion PSHA has been used to characterize earthquake groundmotion hazard and to establish design input for nuclear facilities and other engineering structures. However, the characterization of scatter (aleatory variability) in the recorded ground-motion parameters, including PGA, is subject to considerable debate. The state of practice is to use a lognormal distribution to model ground-motion variability. Although the lognormal distribution assumption works quite well for moderate levels of the exceedance probability, it leads to unrealistically high values of PGA when predictions are made for extremely small exceedance probabilities (e.g., below 106 ). In this article, the validity of this lognormal distribution assumption for the PGA is assessed using statistical peak-over-threshold analysis techniques. Both raw ground-motion recordings from the PEER NGA database and the attenuation model residuals were used to demonstrate the approach. The statistical analysis indicated that although the PGA data or the attenuation residuals generally follow a lognormal distribution, significant deviations from the lognormal model may exist near the upper end of the distribution, where they are better characterized by the GPD. The GPD distribution is also justified on the basis of theoretical considerations because it arises as the limiting distribution in peak-over-threshold analyses. Although the GPD pertains to only a small part of the distribution, the difference between the lognormal and the GPD is important because it is precisely this end of the distribution that controls the estimate of ground motions at low annual exceedance probabilities. The implications of peak-over-threshold analysis in PSHA are demonstrated using a composite distribution model that consists of a lognormal distribution supplemented by the GPD in the upper tail. Results show that, for the small annual exceedance probability considered in the examples, the composite distribution model yields considerably lower PGA values or residuals than the unbounded and physically unrealistic lognormal distribution. The peak-over-threshold modeling approach provides a rational basis for determining the PGA or residual upper bound, wherever it exists. In addition, due to the absence of an arbitrary truncation, the

resulting hazard curves transition gradually toward that upper bound. Realistic modeling of low-probability ground motions will gain increasing significance with renewed interests in nuclear energy production. The GPD parameters derived in this study are only for PGA and are considered preliminary and specific to the PEER NGA data and the ground-motion models employed. Our approach, however, can be applied to more extensive ground-motion data and other groundmotion models so that the GPD parameters can be derived for more general applications. Because of the underlying assumptions for the GPD, the approach does require modification before it can be applied to highly dependent ground-motion data. In addition, different sets of GPD parameters may be required for different tectonic conditions. Nevertheless, the approach can be readily adapted to spectral accelerations that have more engineering significance and other ground-motion parameters, such as peak ground velocity and peak ground displacement. It may also be applicable to ground-motion intensity measures such as the European Macroseismic Scale EMS-98 intensities. Furthermore, the composite distribution model can be implemented in existing or future PSHA software for practical applications.

Data and Resources Peak ground acceleration data can be obtained at the PEER NGA flatfile database (nonpublic version 7.3, 0208-06; http://peer.berkeley.edu/products/nga_flatfiles_dev .html, last accessed January 2008). Coefficients used in the calculation were from a 2007 draft version of Abrahamson–Silva NGA ground motion model. The revised version of the model is published in Abrahamson and Silva (2008) and is available at http://peer.berkeley.edu/products/ Abrahamson-Silva_NGA.html. Simulated finite-fault data were obtained from Dr. R. R. Youngs, 11 November 2007 (AMEC Geomatrix).

Acknowledgments This report was prepared to document work performed by the Center for Nuclear Waste Regulatory Analyses (CNWRA) for the U.S. Nuclear Regulatory Commission (NRC) under Contract No. NRC-02-02-012. The activities reported here were performed on behalf of the NRC Office of Nuclear Material Safety and Safeguards, Division of High-Level Waste Repository Safety. The report is an independent product of the CNWRA and does not necessarily reflect the views or regulatory position of the NRC. The manuscript benefited from discussions with Abou-Bakr Ibrahim, Tianqing Cao, Jon Ake, and Mahendra Shah and review comments from Alan Morris, Budhi Sagar, Sitakanta Mohanty, and two anonymous reviewers. The authors would also like to thank Bob Youngs for providing the simulated finite-fault data and Sarah Gonzalez (formerly at CNWRA, now at NRC) for her contributions to the data collection efforts. Finally, we thank Beverly Street and Lauren Mulverhill for editorial support and manuscript preparation.

Generalized Pareto Distribution to Constrain Uncertainty in Low-Probability Peak Ground Accelerations

References Abrahamson, N. A. (2000). State of the practice of seismic hazard assessment, in Proceedings of GeoEng 2000, Melbourne, Australia, 19–24 November 2000, 1, 659–685. Abrahamson, N. A., and W. J. Silva (2008). Summary of the Abrahamson and Silva NGA ground-motion relations, Earthq. Spectra 24, no. 1, 67–97. Akaike, H. (1974). A new look at the statistical model identification, IEEE Trans. Autom. Control 19, no. 6, 716–723. Anderson, J. G., and J. N. Brune (1999). Probabilistic seismic hazard analysis without the ergodic assumption, Seismol. Res. Lett. 70, no. 1, 19–28. Anderson, J. G., and J. N. Brune (2001). Importance of the ergodic assumption in probabilistic seismic hazard analysis, abstract number S52C0640, American Geophysical Union, Fall Meeting 10–14 December 2001, San Francisco. Bender, B. (1984). Incorporating acceleration variability into seismic hazard analysis, Bull. Seismol. Soc. Am. 74, no. 4, 1451–1462. Bommer, J. J., and N. A. Abrahamson (2006). Why do modern probabilistic seismic-hazard analyses often lead to increased hazard estimates?Bull. Seismol. Soc. Am. 96, no. 6, 1967–1977. Bommer, J. J., N. A. Abrahamson, F. O. Strasser, A. Pecker, P.-Y. Bard, H. Bungum, F. Cottom, D. Faeh, F. Sabetta, F. Scherbaum, and J. Studer (2004). The challenge of defining the upper limits on earthquake ground motions, Seismol. Res. Lett. 75, no. 1, 82–95. Boore, D. M., and G. M. Atkinson (2007). Boore–Atkinson NGA ground motion relations for the geometric mean horizontal component of peak and spectral ground motion parameters, in PEER Report 2007/1, Pacific Earthquake Engineering Research Center, University of California, Berkeley, available at http://peer.berkeley.edu/ publications/peer_reports.html. Boore, D., and G. Atkinson (2008). Ground-motion prediction equations for the average horizontal component of PGA, PGV, and 5%-damped SA at spectral periods between 0.01 s and 10.0 s, Earthq. Spectra 24, 99–138. Boos, D. D. (1984). Using extreme value theory to estimate large percentiles, Technometrics 2, no. 1, 33–39. Caers, J., and M. A. Maes (1998). Identifying tails, bounds and end-points of random variables, Struct. Saf. 20, no. 1, 1–23. Castillo, E. (1988). Extreme Value Theory in Engineering, Statistical Modeling and Decision Science, Academic Press Inc., Boston, Massachusetts, 389 pp. Castillo, E., A. S. Hadi, N. Balakrishnan, and J. M. Sarabia (2005). Extreme Value and Related Models with Applications in Engineering and Science, in Wiley Series in Probability and Statistics, D. J. Balding, N. A. C. Cressie, N. I. Fisher, I. M. Johnstone, J. B. Kadane, G. Molenberghs, L. M. Ryan, D. W. Scott, A. F. M. Smith, and J. L. Teugels (Editors), John Wiley and Sons, Hoboken, New Jersey, 362 pp. Castillo, E., A. S. Hadi, and J. M. Sarabia (1995). Statistical analysis of extreme waves, in Proc. of the 14th ASME Offshore Mechanics and Arctic Engineering Conference, Copenhagen, Denmark, 2, 33–40. Chiou, B., R. Darragh, N. Gregor, and W. Silva (2008). NGA project strongmotion database, Earthq. Spectra 24, no. 1, 23–44. Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values, Springer Series in Statistics, Springer, London, United Kingdom, 224 pp.

101

Cornell, C. A. (1968). Engineering seismic risk analysis, Bull. Seismol. Soc. Am. 58, 1583–1606. Cornell, C. A. (1971). Probabilistic analysis of damage to structures under seismic loads, in Dynamic Waves in Civil Engineering, D. A. Howells, I. P. Haigh, and C. Taylor (Editors), John Wiley and Sons, New York, 473–488. Drees, H., L. de Haan, and S. Resnick (2000). How to make a Hill plot, Ann. Stat. 28, no. 1, 254–274. Field, N. H. (2006). Probabilistic seismic hazard analysis, a primer, http:// epicenter.usc.edu/docs/PSHA_Primer_v2.pdf. Guillou, A., and P. Hall (2001). A diagnostic for selecting the threshold in extreme value statistics, J. Roy. Stat. Soc. Series B, 63, no. 2, 293–305. Hill, B. M. (1975). A simple and general approach to inference about the tail of a distribution, Ann. Stat. 3, 1163–1174. Hosking, B. M., and J. R. Wallis (1987). Parameter and quantile estimation for the generalized Pareto distribution, Technometrics 29, no. 3, 339–346. Hsieh, P.-H. (1999). Robustness of tail estimation, J. Comput. Graph. Stat. 8, no. 2, 318–332. Maes, M. A. (1995). Tail heaviness in structural reliability, Proceedings of ICASP 7, Paris, France, 10–13 July 1995. Marz, H. A., and C. A. Cornell (1973). Seismic risk based on a quadratic magnitude-frequency law, Bull. Seismol. Soc. Am. 63, no. 6, 1999– 2006. McGuire, R. K. (1976). FORTRAN computer program for seismic risk analysis, U.S. Geol. Surv. Open-File Rept. 76–67, 68 pp. Pickands, J. (1975). Statistical inference using extreme order statistics, Ann. Stat. 3, 119–131. Power, M., B. Chiou, N. Abrahamson, Y. Bozorgnia, T. Shants, and C. Roblee (2008). An overview of the NGA project, Earthq. Spectra 24, no. 1, 3–21. Singpurwalla, N. D. (1972). Extreme values from a lognormal law with applications to air pollution problems, Technometrics 19, no. 3, 703–711.

Reliability and Materials Integrity Section Mechanical and Materials Engineering Division Southwest Research Institute San Antonio, Texas (L.H.)

Department of Civil Engineering California State University Chico, California [email protected] (R.C.)

Washington Technical Support Office and Environmental Program Center for Nuclear Waste Regulatory Analyses Geosciences and Engineering Division Southwest Research Institute San Antonio, Texas (J.A.S.)

Manuscript received 5 September 2008