Parametric Bootstrap Goodness-of-Fit Tests for Imperfect Maintenance ...

22 downloads 342 Views 2MB Size Report
Digital Object Identifier 10.1109/TR.2016.2578938. Λt. Cumulative failure intensity. h(t) Initial intensity. q. Maintenance effect parameter in QR and EGP models.
IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

1343

Parametric Bootstrap Goodness-of-Fit Tests for Imperfect Maintenance Models C´ecile Chauvel, Jean-Yves Dauxois, Laurent Doyen, and Olivier Gaudoin

Abstract—The simultaneous modeling of ageing and maintenance efficiency of repairable systems is a major issue in reliability. Many imperfect maintenance models have been proposed. To analyze a dataset, it is necessary to check whether these models are adapted or not. In this paper, we propose a general methodology for testing the goodness of fit of any kind of imperfect maintenance model. Two families of tests are presented, based respectively on martingale residuals and probability integral transforms. The quantiles of the test statistics distributions under the null hypothesis are computed with parametric bootstrap methods. An extensive simulation study is provided, from which we recommend the use of two tests in practice, one from each family. Finally, the tests are applied to several real datasets. Index Terms—Goodness-of-fit tests, imperfect maintenance, martingale residuals, parametric bootstrap, probability integral transform, repairable systems.

ABAO AD AGAN AIC ARA BP CvM EGP i.i.d. KS LLP PIT PLP QR TTT

Ti Nt λt

NOMENCLATURE As bad as old. Anderson–Darling. As good as new. Akaike information criterion. Arithmetic reduction of age. Brown–Proschan. Cram´er-von Mises. Extended geometric process. Independent and identically distributed. Kolmogorov–Smirnov. Log linear process. Probability integral transform. Power law process. Quasi-renewal. Total time on test.

NOTATION Failure times. Number of failures at time t. Failure intensity.

Manuscript received December 18, 2015; revised May 23, 2016 and April 15, 2016; accepted June 1, 2016. Date of publication August 3, 2016; date of current version August 30, 2016. This work was supported by the French National Agency of Research (ANR), project AMMSI, ANR-2011-BS01-021. Associate Editor: M. Finkelstein. C. Chauvel, L. Doyen, and O. Gaudoin are with the Laboratoire Jean Kuntzmann, Universit´e Grenoble Alpes, Grenoble 38000, France (e-mail: [email protected]; [email protected]; olivier. [email protected]). J.-Y. Dauxois is with the Institut de Math´ematiques de Toulouse, Universit´e de Toulouse-INSA, Toulouse 31400, France (e-mail: jean-yves.dauxois@ insa-toulouse.fr). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TR.2016.2578938

Λt h(t) q ρ Ln ˆi M

Cumulative failure intensity. Initial intensity. Maintenance effect parameter in QR and EGP models. Maintenance effect parameter in ARA models. Likelihood. Martingale residuals. I. INTRODUCTION

AINTENANCE is carried out on industrial systems throughout their life cycle to keep them in, or restore them to, an operating state, while complying with constraints related to safety, availability, and costs. Maintenance, by providing an essential contribution to the operational system reliability, plays a great part in risk management and constitutes a crucial element in the performance of an industrial installation. That is why the quantitative assessment of maintenance effect and its optimization are major strategic challenges. After a failure, a corrective maintenance (or repair) is done, then the system is put in operation again. For a given system, the object of the study is the sequence of successive failure or corrective maintenance times. A model for this sequence of recurrent events is an univariate random point process. Other kinds of maintenance exist, such as preventive maintenance, but they are not considered in this paper. Maintenance has naturally a major impact on reliability. The first basic assumption is to consider that after maintenance, the system is in the same state as before. This is known as minimal maintenance or ABAO effect. The corresponding random processes are nonhomogeneous Poisson processes. The second basic assumption is to consider that maintenance leaves the system as if it were new. This is known as perfect maintenance or AGAN effect. The corresponding random processes are renewal processes. Reality is between these two extreme cases. This situation is known as imperfect maintenance. Many imperfect maintenance models have been proposed: [1]–[4] and [5] among others. For each particular application and observed dataset, practitioners are confronted to the problem of choosing the “most appropriate” model among the big range of available models. More precisely, practitioners need to evaluate whether the observations could be realizations of random variables following the assumptions of a given model, or not. This is a classical statistical issue on model selection. A solution is to build a statistical test of hypotheses, called a goodness-offit test. Goodness-of-fit tests are tools for checking whether the model fits well the data or not. These tests are well-known for simple situations such as samples: classical techniques are available for determining if a sample can be considered as issued

M

0018-9529 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

1344

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

from i.i.d. random variables [6]. In the reliability area, we are particularly interested by goodness-of-fit tests for the exponential and Weibull distributions. Recent results on these issues can be found, e.g., in [7] and [8]. These techniques can be adapted to build goodness-of-fit tests for AGAN models. Testing the ABAO assumption is equivalent to build goodness-of-fit tests for nonhomogeneous Poisson Processes. The basic idea in this case is to transform the data to get back to the i.i.d. case. For instance, the conditional probability integral transforms [9] and the prequential transforms [10] can be used. Other ideas have been proposed by Park and Kim [11], Zhao and Wang [12], and Lindqvist and Rannestad [13]. To our knowledge, only Lindqvist et al. [14] and Liu et al. [15] have proposed methods for testing the goodness of fit of general imperfect maintenance models. They both consider a transformation of the failure times into uniform random variables. However, these authors did not provide a detailed analysis of their tests, nor performed simulation studies to evaluate their performances. In this paper, we assume that the successive failure times of one repairable system are observed. We propose two families of goodness-of-fit tests for imperfect maintenance models that are based on martingale residuals or probability integral transforms. The quantiles of the test statistics are computed with a parametric bootstrap approach. The methodology is general and can be applied to a wide range of imperfect maintenance models. Section II gives an overview on some usual imperfect maintenance models. The proposed goodness-of-fit tests are described in Section III. Section IV presents a simulation study evaluating the quality of the tests through their empirical powers. In particular, the new tests are compared with the tests of Lindqvist et al. [14] and Liu et al. [15]. In Section V, the tests are applied to real datasets. Finally, a conclusion is provided in Section VI. II. IMPERFECT MAINTENANCE MODELS Let us assume that we observe the n first failure times of a repairable system T1 , T2 , . . . , Tn , with T0 = 0. A corrective maintenance, or repair, is made after each failure. The repair duration is not taken into account, or considered to be negligible. Let N = (Nt )t≥0 denote the counting process of failures. A stochastic model for this process of recurrent events is completely characterized by the intensity λ = (λt )t≥0 of the failure counting process defined as [18] 1 P (Nt+Δ t − Nt − = 1 | Ht − ) , t ≥ 0 Δ t→0 Δt where Ht − is the past of the process just before t. An imperfect maintenance model is composed of two parts. The first part of the model is the initial intensity, which expresses the intrinsic wear of the system before the first maintenance. Usually, the first failure time is assumed to follow a Weibull distribution, so that the initial intensity is the intensity of a PLP, h(t) = abtb−1 (a > 0, b > 0), for t ≥ 0. The intensity of a LLP is also considered in practice: h(t) = exp(a + bt) (a, b ∈ R), for t ≥ 0. In this paper, we assume that the system is wearing, so the initial intensity is an increasing function of time. Then, we will set b > 1 for the PLP and b > 0 for the LLP. The second λt = lim

part of the model characterizes the maintenance effect on the system. The repair can be minimal, in which case the maintenance is ABAO and the counting process is a nonhomogeneous Poisson process of intensity λt = h(t), t ≥ 0. The repair can also leave the system AGAN. In this situation, the times between two successive failures are i.i.d. and the counting process is a renewal process of intensity λt = h(t − TN t − ), t ≥ 0. Between these two extreme cases, there is a wide range of models corresponding to imperfect maintenances. In the Brown–Proschan model [1], each maintenance is AGAN with a probability p ∈ [0, 1] and ABAO with a probability 1 − p. This can be modeled by defining a sequence (Bk )k ∈N of i.i.d. random variables with Bernoulli distribution of parameter p, such that Bk = 1 if the kth maintenance is AGAN and Bk = 0 if it is ABAO. The corresponding intensity is [16] ⎛

λt = h ⎝ t − T N t − +



Nt−

j=1

⎛ ⎝



Nt−





(1 − Bk )⎠ (Tj − Tj −1 )⎠ ,

t ≥ 0.

k=j

Another usual imperfect maintenance model is the QR model which considers that the times between two successive failures decrease or increase geometrically, depending on whether the system wears or rejuvenates [3]. Under the QR model, the interfailure times verify Ti − Ti−1 = q i−1 Yi ,

i ∈ {1, . . . , n}

where (Yi )i∈{1,...,n } is a sequence of i.i.d. random variables and q > 0 is a parameter characterizing the effect of repair. The corresponding counting process is also known as the geometric process [17] and its intensity is   λt = q −N t − h q −N t − t − TN t − , t ≥ 0. The case q = 1 corresponds to the AGAN model. If q ∈]0, 1[ the system deteriorates due to stochastically decreasing inter-failure times and if q > 1 the system improves. The last situation can occur when failed components of the system are replaced by new components built with new technologies. The geometrical growth of the interfailure times is a strong condition that can be softened with the extended geometrical processes [5] in which Ti − Ti−1 = q b i Yi ,

i ∈ {1, . . . , n}

where (bi )i∈N ∗ is a nondecreasing sequence of nonnegative real numbers such that b1 = 0 and bi tends to infinity as i goes to infinity. For instance, one can choose bi = i − 1 (QR case), bi = (i − 1)1/2 or bi = log(i − 1), for i ∈ N ∗ . The last class of imperfect maintenance models presented in this paper is the class of virtual age models [2]. The models of this class depend on a sequence of positive random variables (Ai )i∈N ∗ called effective ages. They assume that after the ith failure, the system behaves like a new system which has not failed until Ai , for i ∈ N ∗ . The failure intensity is  λt = h AN t − + t − TN t − . A virtual age model is defined by a particular expression of the effective ages. For instance, in the ARA∞ model [4] or Kijima type II model with deterministic constant effect, the repair is supposed to reduce the effective age by a factor ρ ≤ 1.

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

The corresponding intensity is ⎞ ⎛ N t − −1  (1 − ρ)j TN t − −j ⎠ , λt = h ⎝t − ρ

t ≥ 0.

1345

(1)

j =0

In the ARA1 model [4] or Kijima type I model with deterministic constant effect, the supplement of age since the last failure is reduced by a factor ρ ≤ 1 and  λt = h t − ρTN t − , t ≥ 0. In both ARA models, the value of ρ represents the effect of repair. If ρ ∈]0, 1[ the repair is efficient, if ρ = 1 the repair is optimal (AGAN), if ρ = 0 the repair is minimal (ABAO) and if ρ < 0 the repair is harmful. In this paper, we will not consider harmful repairs so that ρ ∈ [0, 1]. To assess both the intrinsic aging and the repair effect, one has to estimate the parameters of these imperfect maintenance models. This is usually done by the maximum likelihood method. The likelihood function associated to the observation of the n first failure times T1 , . . . , Tn is [18]

n  Tn  Ln = λT i exp − λt dt . i=1

0

III. PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS Given that a point process model is characterized by its intensity, an imperfect maintenance model can be denoted  C = λ(θ), θ ∈ Θ ⊂ Rd , where θ is the model parameter. We want to determine if C is a relevant model for the observed data T1 , . . . , Tn . A goodness-of-fit test is a statistical test of / C”. Usually, the test procedure H0 : “λ ∈ C” versus H1 : “λ ∈ consists in rejecting the null hypothesis of a good fit if some quantity, the test statistic, is higher than a critical value. This critical value is the quantile of either the exact or the asymptotic distribution of the statistic under H0 . So the problem is first to find test statistics which express the gap between the data and the model, and second to determine the distribution of the statistics under H0 . We propose two families of goodness-of-fit tests, based respectively on martingale residuals and probability integral transforms. For each test, the quantiles of the test statistic distribution under H0 are computed with parametric bootstrap methods. A. Tests Based on Martingale Residuals Let Λ = (Λt )t≥0 denote the cumulative intensity of the t process N , such that Λt = 0 λs ds, for t ≥ 0. The process M = (Mt )t≥0 defined by M = N − Λ is a mean zero martingale. Then N is close to Λ in the sense that the expectation of their difference is null [18]. In our setting, the intensity has a parametric form and is denoted λ(θ) = (λt (θ))t≥0 , for θ ∈ Θ ⊂ Rd . The cumulative intensity is Λ(θ) and the corresponding martingale is M (θ) = N − Λ(θ). In practice, the parameter θ is unknown and the cumulative intensity is estimated from the n first failure times T1 , . . . , Tn . Let θˆ denote the maximum likelihood estimator of θ. The random

Fig. 1. Plots of N , Λ(a, b, ρ) and Λ(ˆ a , ˆb, ρˆ) over time for 30 simulated failure times under an ARA∞ -PLP model, with a = 0.05, b = 2.5, and ρ = 0.1.

n defined by 1 , . . . , M variables M ˆ = i − ΛT (θ), ˆ i ∈ {1, . . . , n} i = NT − ΛT (θ) M i i i are called martingale residuals [19]. The martingale property is ˆ An i but N is still expected to be close to Λ(θ). lost on the M illustration of this phenomenon is given in Fig. 1, where the counting process N , the real cumulative intensity Λ(θ) (blue ˆ (red dotted line) and the estimated cumulative intensity Λ(θ) dashed line) are plotted over time for a dataset simulated under the ARA∞ imperfect maintenance model with a PLP initial intensity. From now on, this model will be denoted ARA∞ PLP. Here, θ = (a, b, ρ). The intensity of the model is given by (1) and the cumulative intensity is ⎧⎛ ⎞b ⎪ N i−2 t +1 ⎨  ⎝Ti − ρ (1 − ρ)j Ti−j −1 ⎠ Λt (θ) = a ⎪ i=1 ⎩ j =0 ⎛ − ⎝Ti−1

⎞b ⎫ ⎪ ⎬ j ⎠ −ρ (1 − ρ) Ti−j −1 , t≥0 ⎪ ⎭ j =0 i−2 

where we set TN t +1 = t. The data have been generated with n = 30 failures, a = 0.05, b = 2.5, and ρ = 0.1. The maximum likelihood estimators of the parameters are a ˆ = 0.053, ˆb = 2.67, ˆ is as and ρˆ = 0.15. The estimated cumulative intensity Λ(θ) close to the counting process as the real cumulative intensity Λ(θ). The first family of goodness-of-fit tests is constructed on ˆ The tests reject measures of discrepancies between N and Λ(θ). the assumption that the model is valid if the two processes are too far apart. We have considered three usual test statistics based on the martingale residuals. The first one is a KS type statistic:     ˆ = sup M ˆ  . i  = sup i − ΛT (θ) KSm (θ) i i=1,...,n

i=1,...,n

1346

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

The second statistic is of CvM type:

Tn  2 ˆ = ˆ ˆ dΛt (θ). CvMm (θ) Nt − Λt (θ) 0

Using a discretization of the time interval [0, Tn ], one can easily show that ˆ = −1 CvMm (θ) 3 

n  

ˆ i − 1 − ΛT i (θ)

i= 1

ˆ − i − 1 − ΛT i −1 (θ)

3

3  .

We also propose a statistic of AD type: 2 

Tn ˆ Nt − Λt (θ) ˆ = ˆ   dΛt (θ). ADm (θ) ˆ n + 1 − Λt (θ) ˆ 0 Λt (θ) More weight is put on large and low values of the estimated cumulative intensity. A usual choice for AD statistics would be ˆ by the to weight the square discrepancy between N and Λ(θ) ˆ ˆ inverse of Λt (θ)(n − Λt (θ)) for t ∈ [0, Tn ]. However, because ˆ = n, the corresponding integral is not defined. With a ΛT n (θ) discretization of [0, Tn ], we can show that

 n ˆ 1  ΛT i (θ) 2 ˆ ADm (θ) = (i − 1) log ˆ n + 1 i=2 ΛT i −1 (θ)

ˆ n + 1 − ΛT i (θ) 2 − (n + 2 − i) log ˆ n + 1 − ΛT i −1 (θ)

ˆ ΛT 1 (θ) + (n + 1) log 1 − − n. (2) n+1 The proof is given in the Appendix. In the different context of survival analysis in presence of covariates, KS and CvM tests on martingale residuals are used to test the fit of the Cox proportional hazards model [20], [21]. To our knowledge, this approach has not yet been used in our context without covariates. The distributions of the test statistics under the null hypothesis are not standard distributions and, in addition, they may depend on the parameters. Therefore, we will have to evaluate their quantiles by parametric bootstrap. B. Tests Based on Probability Integral Transform The second class of tests is based on the random variables ΛT i + 1 (θ) − ΛT i (θ), for i = 0, . . . , n − 1. Under H0 , these variables are i.i.d. with standard exponential distribution [19]. Let us transform these variables into uniform ones. For i = 0, . . . , n − 1, let S(· | Ti ; θ) denote the reliability function of the interfailure time Ti+1 − Ti conditionally to Ti = (T1 , T2 , . . . , Ti ): S(s | Ti ; θ) := P (Ti+1 − Ti > s | Ti ; θ) = exp (−ΛT i +s (θ) + ΛT i (θ)) ,

for s ≥ 0.

Let us define the variables Ui (θ) = S(Ti+1 − Ti | Ti ; θ) for i = 0, . . . , n − 1. Under the null hypothesis, H0 : “λ ∈ C, ” the Ui ’s are i.i.d. with standard uniform distribution U[0, 1]. Such a

transformation of the interfailure times is usually called PIT and consists in applying a cumulative distribution function (c.d.f.) to a random variable [6, p. 239]. In our case, the c.d.f. is conditional on the past and the transformation is the Rosenblatt’s one [22], then we call it conditional PIT. The second class of goodness-of-fit tests that we propose in this article is based on the conditional PIT of the inter-failure times. We can expect that the uniformity will no longer hold in case of model misspecification. Therefore, we can test the goodness of fit of an imperfect maintenance model by testing that the transformed inter-failure times have a uniform distribution. In applications, θ is estimated and we consider the KS, CvM, and AD test statistics for testing the uniformity of ˆ . . . , Un −1 (θ) ˆ [6]. The test statistics are respectively U0 (θ), √ ˆ = n sup |Fn ,S (x) − x| KSu (θ) x∈[0,1]

 i ˆ − U(i−1) (θ) , max i=1,...,n n  ˆ − i−1 max U(i−1) (θ) , i=1,...,n n

1 ˆ = n (Fn ,S (x) − x)2 dx CvMu (θ) √ = n max





0

=

n  i=1

and

ˆ − U(i−1) (θ)

2i − 1 2n

2 +

1 12n

(Fn ,S (x) − x)2 dx x(1 − x) 0 n !   1 ˆ = −n− (2i − 1) log U(i−1) (θ) n i=1 "  ˆ + log 1 − U(n −i) (θ)

ˆ = n ADu (θ)

1

where Fn ,S is the empirical c.d.f. of the random variables ˆ and U(0) (θ) ˆ ≤ U(1) (θ) ˆ ≤ · · · ≤ U(n −1) (θ) ˆ are the orUi (θ), ˆ dered Ui (θ), i = 0, 1, . . . , n − 1. Liu et al. [15] proposed to perform a goodness-of-fit test by ˆ to critical values that can be comparing the value of KSu (θ) found in classical tables [23] for testing the uniformity of a sample. However, we believe that this approach is questionable because the estimation of θ should be taken into account: even ˆ are neither independent nor uniformly disunder H0 , the Ui (θ) tributed. Under H0 , the distributions of the statistics are not known. We have performed a numerical experiment to assess whether these distributions depend on the parameters or not. We have generated 4000 datasets under the ARA∞ -PLP model for b ∈ {1.5, 3}, ρ ∈ {0.2, 0.8}, a = 0.05 and n = 30. For each sample, the maximum likelihood estimators of the parameters ˆ CvMm (θ), ˆ ADm (θ), ˆ KSu (θ), ˆ and the 6 test statistics KSm (θ), ˆ ˆ CvMu (θ) and ADu (θ) are computed. In Fig. 2, the boxplots of the 4000 simulated test statistics are presented. For each test statistic, it is not clear whether b has an influence or not, but

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

1347

ˆ Fig. 2. Boxplots of 4000 simulated test statistics under the ARA∞ -PLP model for different values of b and ρ, a = 0.05 and n = 30. (a) K S m (θ). ˆ (c) AD m (θ). ˆ (d) K S u (θ). ˆ (e) C vM u (θ). ˆ (f) AD u (θ). ˆ (b) C vM m (θ).

the boxplots are clearly different when the value of ρ changes. Then, this experiment seems to indicate that the distributions of the statistics depend on the model parameters. So the usual tables cannot be used to perform the KS, CvM and AD tests. It is necessary to use a parametric bootstrap approach. Other transforms of the exponential random variables Δi (θ) = ΛT i (θ) − ΛT i −1 (θ) into uniform variables can be used. For instance, Lindqvist et al. [14] proposed to apply the K transform [6, p. 431]. This transform is also used in the construction of the TTT plot to visualize a model goodness of fit. Let Δ(1) (θ) < Δ(2) (θ) < · · · < Δ(n ) (θ) denote the order statistics of Δ1 (θ), . . . , Δn (θ). Then, under H0 , the variables i Ui (θ) =

 (n + 1 − j) Δ(j ) (θ) − Δ(j −1) (θ) j =1 n Δj (θ) j =1

are order statistics from a uniform distribution sample, for i = 1, . . . , n − 1. Thus, after estimating θ, Lindqvist et al. [14] ˆ to test the fit of the computed the AD statistic on the Ui (θ) trend-renewal process. They also suggested to use a parametric bootstrap approach to estimate the distribution of the test statistic under H0 .

C. Construction of the Tests by Parametric Bootstrap To perform the tests, we need to compare the tests statistics to quantiles of their distribution under H0 . Because these distributions depend on the unknown model parameters, the computation of these quantiles cannot be done directly. Then, we will use a bootstrap method. Bootstrap methods, introduced by Efron [24], are included in the large class of resampling methods. The general idea behind these methods is that observations contain all information about their distribution without making any additional assumption. Resampling the data with bootstrap samples gives access to extra information to perform hypothesis testing or to evaluate the accuracy of an estimator. For example, in the latter case, the estimator distribution is needed but, from the initial sample, one observe a unique realization of it. However, the strength of the bootstrap method lies in the fact that this distribution can be approximated by the empirical distribution of the estimators calculated from the bootstrap samples. We refer to [25] and [26] for more information about bootstrap. In our situation, a parametric model is fitted to the data. So we will use parametric bootstrap goodness-of-fit tests, adapted from the method developed by Stute et al. [27] for i.i.d. random variˆ denote the generic test statistic. In this article, ables. Let Z(θ) ˆ CvMm (θ), ˆ ADm (θ), ˆ ˆ Z(θ) will be one of the statistics KSm (θ),

1348

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

ˆ CvMu (θ) ˆ or ADu (θ), ˆ but other statistics could be KSu (θ), ˆ used. Under H0 , Z(θ) is computed from the dataset T1 , . . . , Tn generated from a point process with intensity λ(θ). We would ˆ to compute the empirical like to obtain i.i.d. replications of Z(θ) quantiles of the statistic distributions. However, θ is unknown. Then, θ is estimated by θˆ and we simulate i.i.d. replications ˆ For each T1∗ , . . . , Tn∗ from a point process with intensity λ(θ). ∗ # replication, the maximum likelihood estimator θ and the test statistic Z ∗ (θ#∗ ) can be computed. The closeness of θ and θˆ makes us expect a very low difference between the empirical ˆ and Z ∗ (θ#∗ ). The general quantiles of the distributions of Z(θ) procedure for applying the test is described in the following algorithm: 1) Compute the MLE θˆ of θ in the class of models C and ˆ on the dataset T1 , . . . , Tn . compute the statistic Z(θ) 2) For i = 1 until L, ∗ ∗ , T2,i , . . . , Tn∗,i under the model of ina) Generate T1,i ˆ tensity λ(θ) ∈ C. ∗ b) Compute θˆi∗ the MLE of θˆ from T1,i , . . . , Tn∗,i in the model C. c) Compute the statistic Zi∗ = Zi∗ (θˆi∗ ) from ∗ T1,i , . . . , Tn∗,i and θˆi∗ . 3) The hypothesis H0 is rejected at significance level α if ˆ is higher than the empirical quantile of order 1 − α Z(θ) of Z1∗ , . . . , ZL∗ . The simulations and the computations of Step 2 are the parametric bootstrap part of the algorithm. Under usual regularity conditions, Stute et al. [27] have proved the asymptotic validity of the method in the classical framework of i.i.d. random variables for the usual KS and CvM tests, i.e. statistics built from a difference between two estimators of c.d.f. Genest and R´emillard [28] have extended the result to the case of independent random vectors, in order to apply it to goodness-of-fit tests for copula models. The authors considered two test families, both corresponding to our test classes. The first one is based on a KS or a CvM discrepancy between a parametric and an empirical copula and the second one is built with the PIT of the observations. At this point, the theoretical proof of the method has not yet been established in our case because of the dependence between the random variables T1 , . . . , Tn . Further research should be carried out to answer this problem. In the next section, we provide a simulation study to assess the performance of the tests. IV. SIMULATION STUDY A. Simulation Design To assess their quality, the tests are performed on a huge number of simulated datasets. The power of a test is estimated by the percentage of rejection of the null hypothesis. First, we have to check that the significance levels of the tests are well respected. The empirical level is the percentage of rejection of H0 when data are simulated according to H0 . The empirical level should be close to the theoretical level α. Second, when data are simulated according to another model, the percentage of rejection should be as high as possible.

The test level is set to α = 0.05. We have considered three null hypotheses, all with PLP initial intensities. Under the first null hypothesis, the maintenance model is ARA∞ (model C1 ), under the second one, the maintenance model is ARA1 (model C2 ) and under the third one, the maintenance model is QR (model C3 ). These models are denoted respectively ARA∞ -PLP, ARA1 -PLP, and QR-PLP. More precisely, we test either ⎧ ⎛ ⎞b−1 ⎪ N t − −1 ⎨  H0 : “λ ∈ C1 := ab ⎝t − ρ (1 − ρ)j TN t − −j ⎠ , ⎪ ⎩ j =0 t ≥ 0; a > 0, b ≥ 1, 0 ≤ ρ ≤ 1} ” versus H1 : “λ ∈ / C1 ”, or

!  b−1 H0 : “λ ∈ C2 := ab t − ρTN t − , t ≥ 0; a > 0, / C2 ” b ≥ 1, 0 ≤ ρ ≤ 1} ” versus H1 : “λ ∈

or

!  b−1  H0 : “λ ∈ C3 := q −N t − ab q −N t − t − TN t − , t ≥ 0; a > 0, b ≥ 1, q ≥ 0} ” versus H1 : “λ ∈ / C3 .”

The test powers are evaluated by simulating models which are not the model tested in H0 . For the maintenance effect, we chose ARA1 , ARA∞ , BP, QR, and EGP models. For ARA models, we have set ρ ∈ {0.2, 0.8}, representing a low and a strong maintenance efficiency. Similarly, p ∈ {0.2, 0.8} for the BP model. For the QR model, we have set q ∈ {0.8, 0.9, 0.95}. Indeed, when q is close to 1, the counting process looks like a renewal process. When q is too small, the interarrival times are too small and a phenomenon of explosion appears. For the EGP √ model, we have set q ∈ {0.8, 0.9, 0.95} and bi = i − 1 or bi = i − 1 for i ∈ {1, . . . , n}. For the initial intensity, we first chose a PLP with scale parameter a = 0.05 and shape parameter b ∈ {1.5, 2, 2.5, 3}, corresponding to mild to strong wear. We have also considered the LLP initial intensity with a = −5 and b ∈ {0.005, 0.01, 0.05, 0.1}. For each of these models, M = 1500 datasets are simulated, each composed of n = 30 failures. The parameters are estimated by likelihood maximization under the constraints that b ≥ 1 and ρ ∈ [0, 1]. The constraints ensure that the system is wearing and the repair is efficient. For the evaluation of the quantiles of the statistic distributions, L = 1000 bootstrap samples are generated. The KSm , CvMm , and ADm tests based on the martingale residuals as well as the KSu , CvMu , and ADu tests based on the conditional PIT are performed. We have also applied the tests proposed by Liu et al. [15] and Lindqvist et al. [14], described in Section III. For the first one, we found empirical levels close to 0.1 % which is far below the expected 5%. Then, this test is badly calibrated and should not be used. For the second one, the powers are very similar to that of the ADu test (difference of 1%). For the case H0 : “λ ∈ C3 , ” we have compared the performances of our tests with goodness-of-fit tests for the QR model proposed by Lam [29]. These tests are built on the sequences (Wi ) and (Vi ), where Vi = (Ti − Ti−1 )(Tn −i+1 − Tn −i ) and Wi = (T2i − T2i−1 )/(T2i−1 − T2(i−1) ), for i ∈ {1, . . . , n/2}.

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

Under the QR model, both sequences are respectively sequences of i.i.d. random variables. Lam proposed to perform rank-based tests on these sequences such as the turning point test or the difference-sign test. Both tests are asymptotic. We observed that the level of the difference-sign test is not well respected for n = 30 and that the power of the turning point test is very low. So finally, only the results for the six new tests are presented in the following. B. Simulation Results The results of the simulation study are presented in Tables I, II, and III. In Table I, the null hypothesis is that the data come from the ARA∞ -PLP model. The subtables present the estimated power of these tests for data simulated according to all the studied models. For instance, the top right corner gives the power of the tests of ARA∞ -PLP model when data are simulated according to an ARA1 -PLP model. Similar results for testing the ARA1 -PLP and QR-PLP models are respectively given in Tables II and III. We first draw some general conclusions from the three tables, then examine each table separately. The empirical levels are given at the top left corner of these tables. In all cases, these empirical levels are close to the theoretical level α = 5%. Then, the six proposed tests are valid. The other subtables give the estimated powers of the tests. In some cases, the powers are good and even very good. But in several cases, the power is close to the theoretical level, indicating a poor power. In a few cases, the power is less than the level, indicating that the corresponding tests are biased. This phenomenon happens sometimes in the empirical studies on goodness-of-fit tests, as in [7] and [8]. In these cases, the biased tests should not be used. Comparing the figures on one given line allows to compare the powers of the six studied tests. For all alternatives except BP, the ADm test is more powerful than the KSm and CvMm tests in the martingale residuals class of tests. The ADu test is the most powerful test in the conditional PIT class of tests except for some alternatives of type QR. In each class of tests, both powers of the CvM and AD tests are very close. The power results of the goodness-of-fit tests for the ARA∞ PLP model are presented in Table I. When data are simulated under the ARA1 -PLP model, the powers are generally very low. In Fig. 3(a), an illustration of the fit of an ARA∞ -PLP model for one dataset generated under the ARA1 -PLP model is given. More precisely, we have generated a dataset with n = 30 failures under the ARA1 -PLP model with parameters a = 0.05, b = 2.5, and ρ = 0.8. We have fitted an ARA∞ -PLP model on this dataset. The resulting estimators are a ˆ = 0.0019, ˆb = 3.07, and ρˆ = 0.05. The counting process of the failures N and the estimated cumulative intensity of the ARA∞ -PLP model, Λ(ˆ a, ˆb, ρˆ), are plotted over time. There is no obvious discrepancy between both graphical representations, so it is expected that the martingale residual tests will not be powerful in this case. A similar phenomenon happens for the martingale residual tests when data are simulated under the BP-PLP model. This is illustrated by Fig. 3(b), where the counting process of a dataset

1349

simulated under a BP model and the estimated cumulative intensity of the ARA∞ -PLP model are close. The KSm test appears to be the most powerful test in this family. For high values of b, the conditional PIT tests are more powerful than those based on martingale residuals but their powers are still very low for p = 0.8. When data are generated under the QR-PLP model, the powers of the martingale residuals tests are very good, reaching 100% when q = 0.8. This is explained by the lack of flexibility of the QR model, in which interfailure times grow geometrically. The conditional PIT tests are less powerful but their powers are still acceptable. The QR model is softened with the use of the EGP model as illustrated in Fig. 3(c) and (d). The powers vary a lot with q. When q tends to 1, the powers of all tests decrease as the model gets closer to a renewal process model which is the limit of the ARA∞ -PLP model [30]. Finally, the tests are globally not powerful for rejecting the ARA∞ -PLP model when data are simulated under the ARA∞ LLP model. The PIT tests are slightly more powerful than the martingale based tests. The power results of the goodness-of-fit tests for the ARA1 PLP model are given in Table II. The previous comparison of the tests remains valid, that is, the ADm and ADu tests are the most powerful tests of each class, except when data are generated under the BP model. In this case, KSm is the most powerful test based on martingale residuals. For the ARA∞ alternative, the powers of the tests based on martingale residuals are very low for all considered parameters and it is delicate to establish a comparison from these low values. However, when ρ = 0.8, the ADu test is powerful. The low powers can be explained by the high flexibility of the ARA1 model. The model can fit well a large range of situations as illustrated in Fig. 4. Therefore, the model is hardly ever rejected. Finally, the power results of the goodness-of-fit tests for the QR-PLP model are presented in Table III and illustrated in Fig. 5. The ADm test is not powerful except when data are simulated under the ARA1 -PLP model. The ADu test is powerful for data generated under the BP and QR-LLP models.

C. Conclusion of the Simulation Study As a conclusion of this simulation study, we recommend to test the goodness of fit of the ARA∞ -PLP, the ARA1 -PLP, or the QR-PLP models both with the ADm test built on martingale residuals and the ADu based on the conditional PIT. It has been shown that the ARA∞ and the BP models are close to a renewal process [30], explaining the difficulty of distinguishing one from another. The ARA1 -PLP and QR-PLP models seem to be very flexible as they are able to approach a large range of alternative models. Therefore, these models are hardly ever rejected. The goodness-of-fit tests of the ARA-PLP models are not powerful when data are simulated with an initial intensity of type LLP. We conclude that the proposed tests are more able to detect a discrepancy in the repair effect than in the shape of the intrinsic wear. This conclusion does not hold for QR models because the test powers are higher when the initial intensity is misspecified.

1350

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

TABLE I POWER RESULTS OF THE GOODNESS-OF-FIT TESTS FOR THE ARA∞ -PLP MODEL, n = 30

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

TABLE II POWER RESULTS OF THE GOODNESS-OF-FIT TESTS FOR THE ARA1 -PLP MODEL, n = 30

1351

1352

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

TABLE III POWER RESULTS OF THE GOODNESS-OF-FIT TESTS FOR THE QR-PLP MODEL, n = 30

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

1353

Fig. 3. Plots of N (solid lines) and Λ(ˆ a , ˆb, ρˆ) under the ARA∞ -PLP (dashed lines) over time for five datasets simulated under different models with n = 30 failures. (a) Simulation under the ARA1 -PLP model with b = 2.5 and ρ = 0.8. Estimations under ARA∞ : a ˆ = 1.9 × 10 −3 , ˆb = 3.07, and ρˆ = 0.05. (b) Simulation under the BP-PLP model with b = 2.5 and p = 0.8. Estimations under ARA∞ : a ˆ = 0.035, ˆb = 2.54, and ρˆ = 0.81. (c) Simulation under the ˆ = 0.01, ˆb = 2.28, and ρˆ = 2.7 × 10 −8 . (d) Simulation under the EGP-PLP model with QR-PLP model with b = 2 and q = 0.9. Estimations under ARA∞ : a b = 2 and q = 0.9. Estimations under ARA∞ : a ˆ = 3.5 × 10 −4 , ˆb = 3.76, and ρˆ = 0.26. (e) Simulation under the ARA∞ -LLP model with a = −5, b = 0.05, ˆ = 2 × 10 −4 , ˆb = 2.33, and ρˆ = 0.91. and ρ = 0.8. Estimations under ARA∞ : a

On the whole, in our simulation study, the powers are not very high. This is partly due to the facts explained above, but it is also because we have chosen a number of simulated failures equal to 30. This moderate number is representative of real life data. For larger values, the powers should be much higher. To check that, we have tested the goodness of fit of the ARA1 -PLP model for data simulated under the BP-PLP model with n = 100. We have also simulated L = 4000 bootstrap samples so that the quantile of the statistic distribution under H0 is more accurately estimated. Finally, we have repeated M = 5000 simulations to calculate the empirical test powers. The results are given in Table IV. For the martingale residuals tests and p = 0.8, the power is as low as for n = 30, indicating that the tests may be biased in this case. But in all other cases, the powers of the tests have increased significantly. V. APPLICATIONS In this section, we apply the goodness-of-fit tests for the ARA∞ -PLP, ARA1 -PLP, and QR-PLP models to several real datasets. For the purpose of illustration, we have applied the six

TABLE IV POWER RESULTS OF THE GOODNESS-OF-FIT TESTS FOR THE ARA∞ -PLP MODEL WHEN DATA ARE SIMULATED UNDER THE BP-PLP MODEL, n = 100 BP p

0.2

0.8

b

KSm

CvMm

ADm

KSu

CvMu

ADu

1.5 2 2.5 3 1.5 2 2.5 3

7.1 15.5 25.5 34.7 2.2 1.2 1.6 3.1

5.6 11.8 19.5 26.3 3.0 1.5 1.0 0.8

5.0 11.8 20.3 28.2 2.9 1.3 0.8 0.6

5.5 19.3 60.0 89.8 50.1 64.7 72.5 69.5

5.7 22.4 67.4 94.0 53.1 67.4 73.6 70.0

5.6 22.7 68.7 94.4 53.0 66.6 73.0 69.6

tests but, following our recommendation based on the simulation study, the conclusion can be established only with ADm and ADu . In each case, we give the maximum likelihood estimates of the model parameters and the p-values of the six tests. We

1354

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

Fig. 4. Plots of N (solid lines) and Λ(ˆ a , ˆb, ρˆ) under the ARA1 -PLP (dashed lines) over time for 5 datasets simulated under different models with n = 30 failures. (a) Simulation under the ARA∞ -PLP model with b = 2.5 and ρ = 0.8. Estimations under ARA1 : a ˆ = 0.051, ˆb = 2.35, and ρˆ = 0.975. (b) Simulation under the BP-PLP model with b = 2.5 and p = 0.8. Estimations under ARA1 : a ˆ = 0.060, ˆb = 2.16, and ρˆ = 0.987. (c) Simulation under the QR-PLP model ˆ = 0.0016, ˆb = 5.82, and ρˆ = 0.871. (d) Simulation under the EGP-PLP model with b = 2.5 and q = 0.9. with b = 2.5 and q = 0.9. Estimations under ARA1 : a ˆ = 0.024, ˆb = 3.33, and ρˆ = 0.981. (e) Simulation under the ARA1 -LLP model with a = −5, b = 0.05, and ρ = 0.8. Estimations Estimations under ARA1 : a under ARA1 : a ˆ = 2.8 × 10 −6 , ˆb = 3.22, and ρˆ = 0.72.

recall that a model will be rejected at a given level α if the p-value is less than α. To compare the three tested models, we have also calculated the AIC: AIC = −2 max log (Ln (θ)) + 2d θ ∈Θ

where d is the number of estimated parameters (here d = 3). The best model among those considered is the one that minimizes the AIC.

and S4 , most p-values are low so all models are rejected. This may be because these data exhibit long time periods without any failure: this phenomenon cannot be captured by our models. For S1 and S4 , the estimated values of the parameter ρ of the ARA1 and ARA∞ models are nearly 0, then these models reduce to nonhomogeneous Poisson processes and the corresponding plots are mingled in Fig. 6. A nonhomogeneous Poisson process may be a relevant model for S1 but not for S4 because the null hypotheses have been rejected. B. Photocopier Data

A. EDF Data The first dataset concerns four identical and independent systems in four different EDF coal-fired power stations, denoted S1 , S2 , S3 , and S4 . The dates of maintenance of the four systems are collected over a time-period of 9 years. The results are given in Table V. For S1 and S2 , the p-values are high so none of the three models is rejected. This is illustrated in Fig. 6, where the counting processes are close to the estimated cumulative intensities for S1 and S2 . With the AIC criterion, we select the QR model for S1 and the ARA∞ model for S2 . For S3

The second dataset is composed of the maintenance times of a photocopier [31, p. 314]. During four and a half years after being put in service, the system has been repaired n = 42 times. Doyen [32] considered an ARA1 -PLP model on this dataset. The results are given in Table VI. We can notice that the estimation of ρ is close to 0 under the ARA∞ -PLP model, corresponding to an ABAO model. This model is rejected for all tests based on conditional PIT. Both other models are not rejected for any of the six tests. So the choice of the ARA1 -PLP model made by Doyen [32] can be considered as valid. However, the AIC

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

1355

Fig. 5. Plots of N (solid lines) and Λ(ˆ a , ˆb, ρˆ) under the QR-PLP (dashed lines) over time for 5 datasets simulated under different models with n = 30 failures. (a) Simulation under the ARA∞ -PLP model with a = 0.05, b = 2.5 and ρ = 0.8. Estimations under QR: a ˆ = 0.05, ˆb = 2.22, and qˆ = 0.98. (b) Simulation under the ARA1 -PLP model with a = 0.05, b = 2.5, and q = 0.8. Estimations under QR: a ˆ = 0.28, ˆb = 1.50, and qˆ = 0.94. (c) Simulation under the BP-PLP model with a = 0.05, b = 2.5, and p = 0.8. Estimations under QR: a ˆ = 0.12, ˆb = 2.03, and qˆ = 1.01. (d) Simulation under the EGP-PLP model with a = 0.05, b = 2.5, and q = 0.9. Estimations under QR: a ˆ = 0.06, ˆb = 2.68, and qˆ = 0.99. (e) Simulation under the QR-LLP model with a = −5, b = 0.05, and q = 0.9. Estimations under QR: a ˆ = 0.0005, ˆb = 2.28, and qˆ = 0.91.

TABLE V ESTIMATED PARAMETERS, p-VALUES OF THE GOODNESS-OF-FIT TESTS AND AIC FOR THE ARA∞ -PLP, ARA1 -PLP, AND QR-PLP MODELS FOR THE EDF DATA Syst. id

n

S1

22

S2

23

S3

16

S4

31

Model ARA∞ ARA1 QR ARA∞ ARA1 QR ARA∞ ARA1 QR ARA∞ ARA1 QR

ˆb

a ˆ −4

1.2 × 10 1.2 × 10 −4 1.3 × 10 −3 8.03 × 10 −9 9.14 × 10 −6 1.87 × 10 −3 9.20 × 10 −1 9 7.57 × 10 −7 1.17 × 10 −3 9.4 × 10 −5 9.3 × 10 −5 4.8 × 10 −3

1.51 1.51 1.18 2.84 1.86 1.13 6.14 2.08 1.14 1.56 1.56 1.00

ρ/ ˆ qˆ −6

7.6 × 10 7.3 × 10 −5 0.94 0.11 0.48 0.95 0.16 5.0 × 10 −6 0.92 1.2 × 10 −6 3.8 × 10 −5 0.95

of the ARA1 model is higher than the one of the QR model so we would recommend to use this latter model. The nonrejection of the three models with the tests based on martingale-residuals is illustrated in Fig. 7 in which the failure counting process is close to the estimated cumulative intensities of all the considered models.

KSm

CvMm

ADm

KSu

CvMu

ADu

AIC

0.95 0.94 0.47 0.84 0.40 0.55 0.13 0.33 0.38 0.11 0.15 0.27

0.82 0.60 0.46 0.84 0.70 0.48 0.39 0.63 0.43 0.04 0.02 0.18

0.58 0.41 0.38 0.71 0.81 0.51 0.34 0.65 0.31 0.05 0.02 0.28

0.48 0.44 0.09 0.89 0.99 0.80 0.17 0.04 0.10 0.07 0.05 0.08

0.55 0.51 0.11 0.67 0.98 0.86 0.05 0.06 0.01 0.02 0.01 0.03

0.57 0.55 0.21 0.64 0.94 0.77 0.09 0.06 0.01 0.03 0.03 0.04

264.64 264.64 263.69 276.29 278.09 280.63 194.82 201.43 205.00 354.18 354.17 352.38

C. AMC Data The last considered example is the AMC ambassador dataset [33], which is composed of n = 18 failure times of an AMC ambassador car owned by the Ohio state government. The dataset has been analyzed by several authors including Guida et al. [34]

1356

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

Fig. 6. Plots of N (solid lines), Λ(ˆ a , ˆb, ρˆ) of the ARA∞ -PLP model (dashed blue line), the ARA1 -PLP model (dotted red line), and the QR-PLP model (black plain line) over time for the EDF data. (a) S1 . (b) S2 . (c) S3 . (d) S4 .

TABLE VI ESTIMATED PARAMETERS, P-VALUES OF THE GOODNESS-OF-FIT TESTS AND AIC FOR THE ARA∞ -PLP, ARA1 -PLP, AND QR-PLP MODELS FOR THE PHOTOCOPIER DATA Model ARA∞ -PLP ARA1 -PLP QR-PLP

a ˆ

ˆb

ρ/ ˆ qˆ

KSm

CvMm

ADm

KSu

CvMu

ADu

AIC

3.9 × 10 −4 3.7 × 10 −9 2.3 × 10 −5

1.6 4.3 2.3

1.0 × 10 −6 0.95 0.96

0.31 0.19 0.44

0.26 0.12 0.38

0.24 0.09 0.31

< 0.01 0.42 0.45

< 0.01 0.54 0.61

< 0.01 0.60 0.58

380.33 357.52 349.74

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

1357

TABLE VII ESTIMATED PARAMETERS, p-VALUES OF THE GOODNESS-OF-FIT TESTS AND AIC FOR THE ARA∞ -PLP, ARA1 -PLP, AND QR-PLP MODELS FOR THE AMC DATA Model

a ˆ

ˆb

ρ/ ˆ qˆ

KSm

CvMm

ADm

KSu

CvMu

ADu

AIC

ARA∞ ARA1 QR

2.11 × 10 −9 1.30 × 10 −7 1.07 × 10 −4

3.58 3.10 1.86

0.25 0.90 0.95

0.24 0.95 0.87

0.12 0.97 0.94

0.15 0.98 0.93

0.18 0.96 0.58

0.18 0.86 0.29

0.17 0.89 0.34

191.36 189.99 190.15

Fig. 7. Plots of N (solid lines), Λ(ˆ a , ˆb, ρˆ) of the ARA∞ -PLP model (dashed blue line), the ARA1 -PLP model (dotted red line), and the QR-PLP model (black plain line) over time for the photocopier data.

and Corset et al. [35] who performed Bayesian analyses. The results are given in Table VII. All tests do not reject the three considered models but the p-values of the tests of the ARA∞ model are lower than those of the ARA1 and QR models. The AIC leads to select the ARA1 -PLP model, as in [35]. In Fig. 8, the failure counting process is plotted over time with the estimated cumulative intensities of the three models. This figure illustrates the fact that the ARA∞ model provides a worse fit than the ARA1 and QR models. VI. CONCLUSION We have presented two classes of goodness-of-fit tests for imperfect maintenance models. The first class, based on martingale residuals, measures a discrepancy between the counting process of failures and the estimated cumulative intensity. The second class of tests is built on the conditional PIT of the interfailure times. The exact and asymptotic distributions of the test statistics in both families are not known. Then, their upper quantiles are evaluated with parametric bootstrap techniques. In both families, the AD tests performed well in most simulated cases and we recommend their use in practice.

Fig. 8. Plots of N (solid lines), Λ(ˆ a , ˆb, ρˆ) of the ARA∞ -PLP model (dashed blue line), the ARA1 -PLP model (dotted red line), and the QR-PLP model (black plain line) over time for the AMC data.

The methodology has been applied to ARA and QR models but it is completely general and it can also be applied on any imperfect maintenance model. These tests could also be applied on perfect or minimal repair models, even if they are expected to be less powerful than exact tests already existing. Our simulation study seems to indicate that the method is correct but further work is needed to obtain a theoretical validation of the method. We have used the KS, the CvM, and the AD statistics but other statistics could be of interest. The moderate power of some of these tests means that, for small datasets, it may be very difficult to discriminate between usual imperfect maintenance models. Anyway, the ARA1 -PLP and QR-PLP models are very flexible and can be adapted to a broad range of data. A perspective of work is to build a test of which statistic is a combination of the ADm and ADu statistics, such as the maximum or a weighted sum. Indeed, from our simulation study, these two tests seem to complement one another and a combined test could be powerful to detect a broad range of alternatives. It would also be interesting to obtain asymptotic results for the statistics proposed in this article in order to know their distribution functions and apply other kinds of tests. One could also work on finding other tests that do not require the bootstrap procedure. For further research, it would be of interest to

1358

IEEE TRANSACTIONS ON RELIABILITY, VOL. 65, NO. 3, SEPTEMBER 2016

incorporate preventive maintenances as well as study the case where several identical systems are observed in parallel. APPENDIX A PROOF OF (2) By definition,

ˆ = ADm (θ)



2 ˆ N (t) − Λt (θ) ˆ   dΛt (θ) ˆ ˆ Λt (θ) n + 1 − Λt (θ)

Tn

0

=

n  i=1



Ti

T i −1

2 ˆ i − 1 − Λt (θ) ˆ   dΛt (θ). ˆ ˆ Λt (θ) n + 1 − Λt (θ)

ˆ we have With the change of variables x = Λt (θ), n Λ T ( θˆ)  i (i − 1 − x)2 ˆ = ADm (θ) dx. ˆ x(n + 1 − x) i=1 Λ T i −1 ( θ ) By noticing that, for x ∈]0, n[ (i − 1 − x)2 (i − 1)2 (n + 2 − i)2 = + −1 x(n + 1 − x) (n + 1)x (n + 1)(n + 1 − x) we have (i − 1 − x)2 ∂ = x(n + 1 − x) ∂x



(i − 1)2 log(x) n+1

 (n + 2 − i)2 − log(n + 1 − x) − x . n+1

Therefore,



 n  ˆ ( θ) 1 Λ T i 2 ˆ = ADm (θ) (i − 1) log ˆ n + 1 i=2 ΛT i −1 (θ)

ˆ n + 1 − ΛT i (θ) 2 − (n + 2 − i) log ˆ n + 1 − ΛT i −1 (θ)

ˆ ΛT 1 (θ) + (n + 1) log 1 − − n. n+1

and the proof is complete. REFERENCES [1] M. Brown and F. Proschan, “Imperfect repair,” J. Appl. Probability, vol. 20, pp. 851–859, 1983. [2] M. Kijima, “Some results for repairable systems with general repair,” J. Appl. Probability, vol. 26, pp. 89–102, 1989. [3] H. Wang and H. Pham, “A quasi renewal process and its application in imperfect maintenance,” Int. J. Syst. Sci., vol. 27, pp. 1055–1062, 1996. [4] L. Doyen and O. Gaudoin, “Classes of imperfect repair models based on reduction of failure intensity or virtual age,” Rel. Eng. Syst. Safety, vol. 84, pp. 45–56, 2004. [5] L. Bordes and S. Mercier, “Extended geometric processes: Semiparametric estimation and application to reliability,” J. Iranian Statist. Soc., vol. 12, pp. 1–34, 2013. [6] R. D’Agostino and M. Stephens, Goodness-of-Fit-Techniques. Boca Raton, FL, USA: CRC Press, 1986.

[7] N. Henze and S. G. Meintanis, “Recent and classical tests for exponentiality: A partial review with comparisons,” Metrika, vol. 61, pp. 29–45, 2005. [8] M. Krit, O. Gaudoin, M. Xie, and E. Remy, “Simplified likelihood based goodness-of-fit tests for the Weibull distribution,” Commun. Statist., Simul. Comput., vol. 45, pp. 920–951, 2016. [9] O. Gaudoin, “CPIT goodness-of-fit tests for the power-law process,” Commun. Statist., Theory Methods, vol. 27, pp. 165–180, 1998. [10] E. Cr´etois, M. El Aroui, and O. Gaudoin, “U-plot method for testing the goodness-of-fit of the power-law process,” Commun. Statist., Theory Methods, vol. 28, pp. 1731–1747, 1999. [11] W. Park and Y. Kim, “Goodness-of-fit tests for the power law process,” IEEE Trans. Rel., vol. 41, no. 1, pp. 107–111, Mar. 1992. [12] J. Zhao and J. Wang, “A new goodness-of-fit test based on the Laplace statistic for a large class of NHPP models,” Commun. Statist., Simul. Comput., vol. 34, pp. 725–736, 2005. [13] B. Lindqvist and B. Rannestad, “Monte Carlo exact goodness-of-fit tests for nonhomogeneous Poisson Processes,” Appl. Stochastic Models Business Ind., vol. 27, pp. 329–341, 2011. [14] B. H. Lindqvist, G. Elvebakk, and K. Heggland, “The trend-renewal process for statistical analysis of repairable systems,” Technometrics, vol. 45, pp. 31–44, 2003. [15] Y. Liu, H.-Z. Huang, and X. Zhang, “A data-driven approach to selecting imperfect maintenance models,” IEEE Trans. Rel., vol. 61, no. 1, pp. 101–112, Mar. 2012. [16] L. Doyen, “On the Brown-Proschan model when repair effects are unknown,” Appl. Stochastic Models Business Ind., vol. 27, pp. 600–618, 2011. [17] Y. Lam, “Geometric processes and replacement problem,” Acta Mathematicae Applicatae, vol. 4, pp. 366–377, 1988. [18] P. K. Andersen, O. Borgan, R.D. Gill, and N. Keiding, Statistical Models Based on Counting Processes. New York, NY, USA: Springer-Verlag, 1993. [19] R. J. Cook and J. F. Lawless, The Statistical Analysis of Recurrent Events. New York, NY, USA: Springer, 2007. [20] D. Y. Lin, L. J. Wei, and Z. Ying, “Checking the Cox model with cumulative sums of martingale-based residuals,” Biometrika, vol. 80, pp. 557–572, 1993. [21] N. L. Hjort, “Goodness of fit tests in models for life history data based on cumulative hazard rates,” Ann. Statist., vol. 18, pp. 1221–1258, 1990. [22] M. Rosenblatt, “Remarks on a multivariate transformation,” Ann. Math. Statist., vol. 7, pp. 470–472, 1952. [23] A. Haldar and S. Mahadevan, Probability, Reliability, and Statistical Methods in Engineering Design. New York, NY, USA: Wiley, 2000. [24] B. Efron, “Bootstrap methods: Another look at the jackknife,” Ann. Statist., vol. 7, pp. 1–26, 1979. [25] B. Efron and F. Tibshirani, An Introduction to the Bootstrap. Boca Raton, FL, USA: CRC Press, 1993. [26] A. Davison and D. Hinkley, Bootstrap Methods and Their Applications. Cambridge, U.K.: Cambridge Univ. Press, 1997. [27] W. Stute, W. G. Manteiga, and M. P. Quindmil, “Bootstrap based goodnessof-fit-tests,” Metrika, vol. 40, pp. 243–256, 1993. [28] C. Genest and B. R´emillard, “Validity of the parametric bootstrap for goodness-of-fit testing in semiparametric models,” Annales de l’IHP, vol. 44, pp. 1096–1127, 2008. [29] Y. Lam, “Nonparametric inference for geometric processes,” Commun. Statist. Theory Methods, vol. 21, pp. 2083–2105, 1992. [30] G. Last and R. Szekli, “Asymptotic and monotonicity properties of some repairable systems,” Adv. Appl. Probability, vol. 30, pp. 1089–1110, 1998. [31] D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. New York, NY, USA: Wiley, 2003. [32] L. Doyen, “Asymptotic properties of imperfect repair models and estimation of repair efficiency,” Naval Res. Logistics, vol. 57, pp. 296–307, 2010. [33] C. W. Ahn, K. C. Chae, and G. M. Clark, “Estimating parameters of the power-law process with two measures of time,” J. Quality Technol., vol. 30, pp. 127–132, 1998. [34] M. Guida and G. Pulcini, “Bayesian analysis of repairable systems showing a bounded failure intensity,” Rel. Eng. Syst. Safety, vol. 91, pp. 828–838, 2006. [35] F. Corset, L. Doyen, and O. Gaudoin, “Bayesian analysis of ARA imperfect repair models,” Commun. Statist. Theory Methods, vol. 41, pp. 3915–3941, 2012.

CHAUVEL et al.: PARAMETRIC BOOTSTRAP GOODNESS-OF-FIT TESTS FOR IMPERFECT MAINTENANCE MODELS

C´ecile Chauvel received the Master’s degree in statistics and the Ph.D. degree in applied mathematics from Paris 6 University, France, in 2011 and 2014, respectively. She had a postdoctoral position in Universit´e Grenoble Alpes in 2015. She is currently working as a research engineer in a chemical company.

Jean-Yves Dauxois received the M.Sc. degree in applied mathematics from the University of Toulouse, France, and the Ph.D. degree in applied mathematics from the University of Pau and Pays de l’Adour, France, in 1996. He is a Professor in the Institut National des Sciences Appliqu´ees de Toulouse, University of Toulouse-INSA. His main research interest is statistical inference for lifetime data with applications in biostatistics or reliability. Some of the subjects he has investigated are semiparametric or nonparametric inference, competing risks, length biased data, recurrent events, large sample behavior using counting processes, and martingale technics.

1359

Laurent Doyen received the Ph.D. degree in applied mathematics from Grenoble University, France, in 2004. He is an Associate Professor at Universit´e Grenoble Alpes, France. His research interests include probability and statistics applied to reliability, ageing, imperfect maintenance modeling, competing risks, and random processes in reliability.

Olivier Gaudoin received the M.Sc. and Ph.D. degrees in applied mathematics from Grenoble University, France, in 1986 and 1990, respectively. He is a Professor at Grenoble Institute of Technology, Universit´e Grenoble Alpes, France. His research interests are probabilistic modeling and statistical analysis for the reliability of complex systems, including ageing and maintenance modeling, competing risks, goodness-of-fit testing, and random processes in reliability.