Prediction of Gross Tree Volume Using Regression Models with Non ...

2 downloads 0 Views 198KB Size Report
Prediction of Gross Tree Volume Using. Regression Models with. Non-Normal Error Distributions. Michael S. Williams and Hans T. Schreuder. ABSTRACT.
Prediction of Gross Tree Volume Using Regression Models with Non-Normal Michael

S. Williams

Error Distributions and Hans T. Schreuder

ABSTRACT. Previouswork in weighted linear regression, where weight functions are used to obtain

homogeneousvariance on a transformedscale, has often assumed that the errors are normally distributed. In a study of four data sets, three of which were actual data sets with unknown error distributions and one an artificial set with a known error distribution, this assumption is incorrect. Consequently,we tested a transformationof the normal distribution,called the Su distribution,and

comparedit withthe normalas an alternative.Forthree ofthe fourdata sets studied,the Sudistribution was superior.Predictionintervalsand biases forthe regressionestimatorsgenerated usingthe Suand normaldistributionswere also evaluated. Resultsfor the Sudistributionbetteredthose for the normal distributionin three of the four data sets. Forthe remainingdata set, they were comparable.FoR.Sc•. 42(4):419-430.

AdditionalKey Words: Bias, predictionintervals, Su distribution.

Literature

Review

INEAR REGRESSION IS WIDELY USED IN FORESTRY to esti-

mate total tree volume (V) as a function of tree

diameter squared (D2takenat1.37mor4.5ft above theground) timestotaltreeheight(H) or D2 alone.A commonlyusedmodelis

= = + IOHi

withœi- N[0,{J2vi], E[œiœj] = ({JVi) 2fori =j, and E[œiœj] =0 for i ½j. Much work hasalreadybeenperformedonestimating theregression coefficientsandweightingfunctionsused toequalizethevarianceforEquation(1). Thisworkhasoften assumed normallydistributed errors,whichhasrarelybeen verified.If the assumption is incorrect,undesirableeffects couldoccurwith themostlikely sideeffectbeinglessthan optimalestimatesof the toleranceandpredictionintervals. The aim of thisstudywasto testthe normalityassumption and if it was incorrect determine an alternative distribution to

thenormalfor describing thedistribution of theœiinEquation (1). The performanceof the estimatorsderivedfrom these two distributions wascompared.

McClure et al. (1983) testedthe distributionof errorsin

predictingthe total volume of white oak (N = 1484) and 1oblollypine (N = 5134) trees.They studiedtheerrordistributionsby diameterclassand found the assumptionof normallydistributederrorsnot unreasonable for all but the largestdiameterclasses. Meng andTsai (1986), Kelly and Beltz (1987), GregoireandDyer (1989), andWilliamset al. (1993) usedweightingfunctionsto equalizethevarianceof the error distribution when total volume was to be estimated.

Thesemodelsassumed œi=fiDO orœi=f(Di, Hi) andnormally distributed errors.In thesestudiesthemostrobustandwidely usedweight functionhasbeen

vi: (D2Hp k

(2)

with k typicallyin therangefrom 0.75 to 1.0. Williams and Gregoire(1993) examinedthe weightfunction

Vi = (D/k• H/ka )

(3)

MichaelS. Williamsand HansT. Schreuder,Multiresource Inventory Techniques,RockyMountainForestand RangeExperiment Station,USDAForest Service, 240 W. Prospect, Fort Collins, Colorado80526-2098. Internet:/s=m.williams/[email protected]. Manuscript receivedJune 22, 1995. Accepted October27, 1995. This articlewas written by U.S. governmentemployeesand is thereforein the publicdomain.

ForestSctence 42(4)1996 419

and found this model to be preferableto (2) when usedin conjunctionwith (1). Carroll and Ruppert(1988) discussseveralaspectsof weighting and transformation functions to correct for nonnormalityandheteroscedasticity. Oneweakness of trans-

formationsis thatwhiletheymaycorrectfor nonnormality, thereis no guaranteethatheteroscedasticity is correctedby the same transformation.Thus, weighting functionsand transformations mayneedto be appliedsimultaneously. CarrollandRuppert(1984, 1987, 1988)discusstransformationsof bothsides(TBS). An exampleof thismethodis

= +

+ei.

RuppertandAldershof(1989)pointouttworeasons forusing TBS techniques.First, transformingto symmetry and homoscedasticity resultsin a more efficient estimateof the parameters.Second,by back-transforming, the TBS techniquecorrectlymodelsthe skewness andheteroscedasticity of the originaldata.This is especiallyimportantwhenestimating conditionalresponsesgiven the independentvariable, suchas confidenceand predictionintervals.Ruppert andAldershof(1989) useda TBS methodto obtainsymmetry and homoscedasticity of the errors. They considerthree

distinct values forthepowertransformation. These are1s,1h, andlbs,whicharethepowertransformation to symmetry, homoscedasticity, andhomoscedasticity andsymmetryatthe sametime.Theynotethatwhile thefirst two mayexistthere

is no guarantee thatlbsexistsandgivehypothesis testing methods to determine if lbsexists. CarrollandRuppert(1988)alsodiscuss theimportance of

to a normal curve. Results based on the normal dxstnbution

canthenbeapplied.Thisisimportantsincesimple,exacttests of significance areobtainableonlyfor a veryrestrictedrange of problems(Bartlett 1947). Of the three distributionsdiscussedby Johnson(1949), the Su distributionis appealing becausethe range of errors requiresno upper or lower bounds,whichmaybeunrealistic or hardto providein some cases,and expressions for all momentsare available. Thefour-parameter transformation usedto translateanSu distributedrandom variable to a normal one is (Johnson 1949),

z=y+asinh-•/x-•), wherez - N(0,1) and• and3.arepositive.Theparameters 0 and 3. are locationand shapeparametersfor x respectively. Using this four parametertransformation,the probability densityfunctionis givenby



-1/2(¾+$ sin h-I(•-•))2

f(x) =2•/(x-0)2+•,2 e

The ¾parameterdetermines the skewness of the distribution. For ¾= 0, the distributionis symmetricaboutE[z] and

kurtosis issolelydetermined bythesinh -1function and•. If ¾> 0, thenmean< median< modewhichimpliesnegative skewness.When ¾< 0, the inequalitiesare reversedand positiveskewness results.An exampleof theSu transformation is given in Figure 1.

correctlydeterminingdistributionfunctionsandtheincreased

efficiencyof maximumlikelihoodestimationovergeneralized leastsquares.They foundwith nearlynormallydistrib-

uteddatathatmaximum-likelihood estimation wasupto25% moreefficientthangeneralizedleastsquares,with increases of about8% beingcommon.However,whenthe distribution functionwas improperlyspecified,lossesin efficiencyfrequentlyoccurred.They alsofoundthatmaximumlikelihood estimateswere more adverselyaffectedthan generalized leastsquaregestimateswhenthe variancefunctionwasimproperlyspecifiedor whentherewere deviationsof normality in thedirectionof positiveskewness andkurtosis.Losses in efficiencyoccurredeven for data that appearedto be normally distributed.Carroll and Ruppertemphasizethe need to properly specifythe distributionfunctionfor the errors,yettheydonotgivea clearprescription of howto deal with nonnormallydistributederrorsother than to adopta generalizedleastsquaresapproach. Johnson(1949) gives a systemof curves based on a translation of thenormaldistribution. By varyingtheparameters and transformation function of the translation, a curve

of nearlyanyshapecanbeattained.Thefundamental appeal of thismethodof translation is thata singledistribution can be usedto fit a large numberof differentapplicationswith fewerrestrictionson skewness or kurtosisthanareimposed by thenormaldistribution. Anotheradvantage isthata simple transformation function can be used to translate results back

420

Forest Sctence 42(4)1996

Data Description SoutheasternLoblollyPine and White Oak Data Sets The individualsampletreesusedin our studyweremeasuredin a uniformmannerby highlytrainedfield crews,and their volumeswere computedusingmethodsdescribedby Cost(1978). Since1963,theForestInventoryandAnalysis (FIA) divisionof the USDA ForestServicehasmeasuredthe

volumesof individualstandingtreeson a subsampleof all regularForestSurveysamplelocationsin the southeastern United States (Virginia, North Carolina, South Carolina, Georgia,andFlorida).A supplemental sampleof felledtrees hasalsobeenmeasuredat hundredsof activeloggingoperations distributedthroughoutthe southeast.The data sets contain5,134 loblollypine treesand 1,484 white oak trees

andwillbereferred toasSELandWOKrespectively.1 SouthernLoblollyData Set The southernloblolly datasetconsistsof 14,379loblolly pinetreesmeasuredin Alabamaandwill bereferredto asthe SOL dataset.2

1 Datawereprovided byNoelCost,Project Leader, Southeastern Forest ExperimentStation,Asheville,NC.

2 Datawereprovided by RoyC. Beltz,Former Project Leader, Forest SciencesLab, Starkville, MS.

isnoreason toexpect theerrorinVforagiven D2Htodiffer fromthatof theactualpopulations. Whateffects,if any,this could have on the distribution of errors is unknown. We

ignoretheseeffectsandconsidereachdatasetasanindependent finite population.More detaileddescriptionsof the WOK, SEL, and SOL data setsare given in Williams and Gregoire(1993). Methods AxisofesU

o

Scaleofx

• + >,

Probability densily

function ofeSu Median

Figure

1,

Exemple

of the

S U distribution

where

We tooka classical approach todetermining whattypesof distributions couldaccurately modeltheerrorswhenestimating total tree volume.Due to the heterogeneous structureof the errors,the weightingfunctionin (2) was usedwith volumemodel(1). Maximumlikelihoodestimationwasused to fit the distributions.Mood et al. (1974) describethe optimalasymptotic statistical properties of maximumlikelihoodestimationandGregoireandDyer (1989),Williamset al. (1993), and Williams and Gregoire(1993) describein detailtheuseof thesetechniques for estimatingthe parametersfor weightingfunctions.A numberof differenttests wereappliedto theresidualsto determineif theassumption of normallydistributed errorswascorrector if theSu distributionwasbetterfor describing theerrorstructure.

ThedatasetcontainsD, H, andV. A smallnumberof opengrowthtreeswerediscarded because ofpossible morphologi-

Normal Distribution

cal differences from the rest of the data. Trees in this data set

likelihood solution minimizes

Assumingnormallydistributed errors,themaximumlog

have known probabilitiesof selectionso that the actual frequencies in thepopulationcanbe estimatedwithoutbias.

In ln2•r+ n ln02 +l'•ln(Di2Hi) 2k] /2 2 2i= /

Heavy TailedData Set

A fourthdatasetwasgenerated using theDi2Hivalues from the originalWOK data setto seewhat effect heavier

mini

_o,I

I

=

/,

(Vi -a-•D•Hi) 2

I

(4)

tailed error distributions had on the results derived from the

Su andnormaldistributions. The Vi valueswere generated using

Vi=5.5xlo-n+ 3.6xlO-SDi:Hi +ei wheretheerrorsweregenerated using

tan(•r- •/2) =2.0x 10-7(D2H i i••1.25 with r ~ Uniform(O.04,0.96). The regressionand variance parameterswere selectedto be approximatelythe sameas thosein theoriginalWOK dataset.The tan(•r- •/2) termis

theinversetransform for theCauchydistribution, withlocationandscaleparameters setto 0 and1respectively, however the random number, r, was restricted to the interval [0.4, 0.96] to eliminate outliers that can occur with this distribu-

tion. This gave a data set,referredto as CHY, which was visuallynodifferentfromtheoriginaldataset;however,the kurtosisof theerrordistribution isapproximately 5.72,which •s substantially largerthanfor the otherthreedatasets. AlthoughtheWOK andSEL datasetsarenotrepresentative of thetruepopulations, because treeswereselected with unequalprobabilities andfromopportunistic samples,there

for thevolumemodelandweightingfunctiongivenby (1) and (2) respectively. The completeparametervectoris

0•= (a,•, 02,k) .The maximum likelihood estimate (MLE) of the parametervector0] minimizesEquation(4). The

minimum of(4)withrespect to 0] wasfound using IMSL3 routine DBCOAH.

This routine uses a modified Newton's

methodwith anactivesetstrategyto find theminimumof (4)

withrespect to0•subject tosimple bounds. Theconstraint o2 > 0 was used.

Su Distribution The proposedformulationof the Su distributiongivesa

simultaneous transformation andweightingof theresiduals to achievenormallydistributed errors.ForEquation(1), the variabilityof the residualsis equalizedby dividingœiby

•/(o(Oi2Hi)k) 2,where œ, =V/-c•- •D•Hi.Then the transformedresidualsaredistributedwith equalvariance.When estimatingvolume,we wish to find transformations of the 3 International Mathematical Statistical Laboratory, 2500ParkWestTower One, 2500 City West Boulevard,Houston,TX 77402.

ForestSctence 42(4)1996 421

population residuals, e'i= f(ei), such thatthee'i- N(0,1).

eNormal =41j2(D/2Hi)2 & IJ(D/2Hi) •

Usingthefour parametertransformation of Johnson(1949) we have

(6)

The residualsfor theSu distributionwere

esv =T+[Jsinh-lI Vi-ø•-[•Di2Hi.) o(Oi2Ui) • ) =T+[Jsinh-l( •i) ei ).(7) where sinh -1

isanondecreasing function of•-•

and • and • are positive.The parametersq•and •i are location and shapeparametersrespectively. Since •x in volumeequation(1) is a locationparameter,q•is setto zero to avoid overparameterization.To achieveequalvariance

Thus,if theresiduals aretrulynormallyorSudistributed, thetransformedresiduals givenbyEquations (6) and(7) will be distributed as unit normals.

A QQ plot(Cleveland1993)wasusedasa visualcompari-

sonoftheerrors es•andeNormal. Thismethod compares the samplequantilesof the residualsagainstthe quantilesof a

of the errors,we chose;[i=o(Di2Hi) l:, because N(0,1) randomvariable.Any systematic deviationfrom a one-to-one relationship indicates that residuals are not disE[ei 2] = (o(Di2 Hi)k)2 fori =j. When estimatingtotal volume, there is somemorphologicaljustificationto expectthe skewnessparameter,¾, to be differentfrom 0. Intuitively it makessenseto predict

volumeusing•D2H sinceD2H= volume of a cylinder of diameterD and length H and • is a combinationof dimensionalconversions(inchesto feet or centimetersto

meters)anda linear tapercoefficient.However,it is well knownthatthesolidshapeof treeboleschangeswith time, albeit slowly (Larson 1963). These changescould cause skewness in theerrordistributionof Equation(1). For this reason we do not constrain¾ and deal with possible skewnessin the error distributionas required. This formulationgives

tributednormally. Several test of normality were made on the residuals.

Thesimplest testcompared thefirstfourmoments (g,02, [13,[14)of the transformedresiduals.If the residualswere normally distributedthen the first four momentswould be

approximately [1= 0, •2 =1, [13= 0 and[14= 3. The normalitytestdevelopedby D'Agostinoet al. (1990), with an empirical correctionof Royston(1991), was usedon the residuals.This test calculatesthe probabilitiesof the skewness(Pr(g3)) and kurtosis(Pr([14))beingconsistent with thatof a normaldistributionandusesajoint probabil-

ity(Pr(z2)),based onaZ2test,todetermine whether ornot to rejectthenull hypothesis. With thistestthesourceof the problemcan be determinedif normalityis rejected,i.e., too much

ei -

= Su(O,o(O2HOk,&) ß

or too little

skewness

or kurtosis.

The

Kolmogorov-Smirnovtest (Conover1980) was alsoperformed.

The likelihoodfunctionis givenby

(5)

withparameter vector_02 = (•x,•, o,k,¾,5). Themaximum likelihood estimate,_02, minimizes -In L in Equation (5) and was found usingIMSL routineDBCOAG. This routineis

This test is a cumulative

distribution

test for

which we report two-tailedp-values(K-S p-value). The Shapiro-Francia(1972) and Shapiro-Wilk (1965) are excellent testsfor checkingnormality, but are limited to a maximum of 2,000 and 5,000 observationsrespectively dueto the approximationusedin their construction.Since two of the four datasetsexceedthe maximumsamplesize permissible,thesetestswere not used. Once the appropriateness of the distributions had been determined,we evaluatedhow improvedknowledgeof the distributionmight affect estimationof total volume. We evaluated thisbyasimulation process, drawing50,000samples of sizen = 40 from eachof the four datasetsusingsimple randomsampling.The Horvitz-Thompson-type regression estimatoris then usedto estimatethe populationvolume usingthe two regression equations.This gave

similar to the one used to find solutions for the normal

distribution, but secondderivativesarenot required. Goodnessof Fit Testsand DistributionComparisons Tests of the residualswere performedto determine whetherthe normal or Su distributionswere appropriate distribution functions. The residuals for the normal distri-

where •i is the probabilityof selectionof a tree for the

bution were

sampling method chosen andE[V/IDi2Hi]istheexpected

422

Forest Science 42(4)1996

treevolume given thesample Di2Hivalue using thenormal andSu distributions results.For bothdistributions

E[VI D2Hi]= E[a+D2Hi]+ SinceE[ei] = 0 for the normaldistribution (8)

where &and• aretheMLEestimates. Since theSudistri-

using both visual and analytical methods.For the visual

comparison, the 95% predictionintervalswereplottedto give an ideaof themagnitudeof differencein theprediction intervalestimates.For the analyticalcomparison,the theoretical and achievedcoveragerates of the 1 - rl% predictionintervalswere compared,where the achieved coveragerate is the percentageof data points falling within the 1 - rl% predictioninterval.The 1 - rl% levels chosenfor comparisonwere 50, 80, 85, 90, 95, 97.5, and 99%. The 1 - rl% predictionintervalsfor the Su distributionwerederivedusingthepivotalquantitymethod(Mood et al. 1974). This gives

butionwas fitted with the skewness parameter• • 0, the expectationof the errorsis not zero, i.e., E[ei] -• O. The expectation of theerrortermfor theSudistribution, asgiven

P[ot +•Di2 H• +{s(Di2 Hi)k sinh(.-l'V••)