Prediction of Gross Tree Volume Using Regression Models with Non-Normal Michael
S. Williams
Error Distributions and Hans T. Schreuder
ABSTRACT. Previouswork in weighted linear regression, where weight functions are used to obtain
homogeneousvariance on a transformedscale, has often assumed that the errors are normally distributed. In a study of four data sets, three of which were actual data sets with unknown error distributions and one an artificial set with a known error distribution, this assumption is incorrect. Consequently,we tested a transformationof the normal distribution,called the Su distribution,and
comparedit withthe normalas an alternative.Forthree ofthe fourdata sets studied,the Sudistribution was superior.Predictionintervalsand biases forthe regressionestimatorsgenerated usingthe Suand normaldistributionswere also evaluated. Resultsfor the Sudistributionbetteredthose for the normal distributionin three of the four data sets. Forthe remainingdata set, they were comparable.FoR.Sc•. 42(4):419-430.
AdditionalKey Words: Bias, predictionintervals, Su distribution.
Literature
Review
INEAR REGRESSION IS WIDELY USED IN FORESTRY to esti-
mate total tree volume (V) as a function of tree
diameter squared (D2takenat1.37mor4.5ft above theground) timestotaltreeheight(H) or D2 alone.A commonlyusedmodelis
= = + IOHi
withœi- N[0,{J2vi], E[œiœj] = ({JVi) 2fori =j, and E[œiœj] =0 for i ½j. Much work hasalreadybeenperformedonestimating theregression coefficientsandweightingfunctionsused toequalizethevarianceforEquation(1). Thisworkhasoften assumed normallydistributed errors,whichhasrarelybeen verified.If the assumption is incorrect,undesirableeffects couldoccurwith themostlikely sideeffectbeinglessthan optimalestimatesof the toleranceandpredictionintervals. The aim of thisstudywasto testthe normalityassumption and if it was incorrect determine an alternative distribution to
thenormalfor describing thedistribution of theœiinEquation (1). The performanceof the estimatorsderivedfrom these two distributions wascompared.
McClure et al. (1983) testedthe distributionof errorsin
predictingthe total volume of white oak (N = 1484) and 1oblollypine (N = 5134) trees.They studiedtheerrordistributionsby diameterclassand found the assumptionof normallydistributederrorsnot unreasonable for all but the largestdiameterclasses. Meng andTsai (1986), Kelly and Beltz (1987), GregoireandDyer (1989), andWilliamset al. (1993) usedweightingfunctionsto equalizethevarianceof the error distribution when total volume was to be estimated.
Thesemodelsassumed œi=fiDO orœi=f(Di, Hi) andnormally distributed errors.In thesestudiesthemostrobustandwidely usedweight functionhasbeen
vi: (D2Hp k
(2)
with k typicallyin therangefrom 0.75 to 1.0. Williams and Gregoire(1993) examinedthe weightfunction
Vi = (D/k• H/ka )
(3)
MichaelS. Williamsand HansT. Schreuder,Multiresource Inventory Techniques,RockyMountainForestand RangeExperiment Station,USDAForest Service, 240 W. Prospect, Fort Collins, Colorado80526-2098. Internet:/s=m.williams/
[email protected]. Manuscript receivedJune 22, 1995. Accepted October27, 1995. This articlewas written by U.S. governmentemployeesand is thereforein the publicdomain.
ForestSctence 42(4)1996 419
and found this model to be preferableto (2) when usedin conjunctionwith (1). Carroll and Ruppert(1988) discussseveralaspectsof weighting and transformation functions to correct for nonnormalityandheteroscedasticity. Oneweakness of trans-
formationsis thatwhiletheymaycorrectfor nonnormality, thereis no guaranteethatheteroscedasticity is correctedby the same transformation.Thus, weighting functionsand transformations mayneedto be appliedsimultaneously. CarrollandRuppert(1984, 1987, 1988)discusstransformationsof bothsides(TBS). An exampleof thismethodis
= +
+ei.
RuppertandAldershof(1989)pointouttworeasons forusing TBS techniques.First, transformingto symmetry and homoscedasticity resultsin a more efficient estimateof the parameters.Second,by back-transforming, the TBS techniquecorrectlymodelsthe skewness andheteroscedasticity of the originaldata.This is especiallyimportantwhenestimating conditionalresponsesgiven the independentvariable, suchas confidenceand predictionintervals.Ruppert andAldershof(1989) useda TBS methodto obtainsymmetry and homoscedasticity of the errors. They considerthree
distinct values forthepowertransformation. These are1s,1h, andlbs,whicharethepowertransformation to symmetry, homoscedasticity, andhomoscedasticity andsymmetryatthe sametime.Theynotethatwhile thefirst two mayexistthere
is no guarantee thatlbsexistsandgivehypothesis testing methods to determine if lbsexists. CarrollandRuppert(1988)alsodiscuss theimportance of
to a normal curve. Results based on the normal dxstnbution
canthenbeapplied.Thisisimportantsincesimple,exacttests of significance areobtainableonlyfor a veryrestrictedrange of problems(Bartlett 1947). Of the three distributionsdiscussedby Johnson(1949), the Su distributionis appealing becausethe range of errors requiresno upper or lower bounds,whichmaybeunrealistic or hardto providein some cases,and expressions for all momentsare available. Thefour-parameter transformation usedto translateanSu distributedrandom variable to a normal one is (Johnson 1949),
z=y+asinh-•/x-•), wherez - N(0,1) and• and3.arepositive.Theparameters 0 and 3. are locationand shapeparametersfor x respectively. Using this four parametertransformation,the probability densityfunctionis givenby
•
-1/2(¾+$ sin h-I(•-•))2
f(x) =2•/(x-0)2+•,2 e
The ¾parameterdetermines the skewness of the distribution. For ¾= 0, the distributionis symmetricaboutE[z] and
kurtosis issolelydetermined bythesinh -1function and•. If ¾> 0, thenmean< median< modewhichimpliesnegative skewness.When ¾< 0, the inequalitiesare reversedand positiveskewness results.An exampleof theSu transformation is given in Figure 1.
correctlydeterminingdistributionfunctionsandtheincreased
efficiencyof maximumlikelihoodestimationovergeneralized leastsquares.They foundwith nearlynormallydistrib-
uteddatathatmaximum-likelihood estimation wasupto25% moreefficientthangeneralizedleastsquares,with increases of about8% beingcommon.However,whenthe distribution functionwas improperlyspecified,lossesin efficiencyfrequentlyoccurred.They alsofoundthatmaximumlikelihood estimateswere more adverselyaffectedthan generalized leastsquaregestimateswhenthe variancefunctionwasimproperlyspecifiedor whentherewere deviationsof normality in thedirectionof positiveskewness andkurtosis.Losses in efficiencyoccurredeven for data that appearedto be normally distributed.Carroll and Ruppertemphasizethe need to properly specifythe distributionfunctionfor the errors,yettheydonotgivea clearprescription of howto deal with nonnormallydistributederrorsother than to adopta generalizedleastsquaresapproach. Johnson(1949) gives a systemof curves based on a translation of thenormaldistribution. By varyingtheparameters and transformation function of the translation, a curve
of nearlyanyshapecanbeattained.Thefundamental appeal of thismethodof translation is thata singledistribution can be usedto fit a large numberof differentapplicationswith fewerrestrictionson skewness or kurtosisthanareimposed by thenormaldistribution. Anotheradvantage isthata simple transformation function can be used to translate results back
420
Forest Sctence 42(4)1996
Data Description SoutheasternLoblollyPine and White Oak Data Sets The individualsampletreesusedin our studyweremeasuredin a uniformmannerby highlytrainedfield crews,and their volumeswere computedusingmethodsdescribedby Cost(1978). Since1963,theForestInventoryandAnalysis (FIA) divisionof the USDA ForestServicehasmeasuredthe
volumesof individualstandingtreeson a subsampleof all regularForestSurveysamplelocationsin the southeastern United States (Virginia, North Carolina, South Carolina, Georgia,andFlorida).A supplemental sampleof felledtrees hasalsobeenmeasuredat hundredsof activeloggingoperations distributedthroughoutthe southeast.The data sets contain5,134 loblollypine treesand 1,484 white oak trees
andwillbereferred toasSELandWOKrespectively.1 SouthernLoblollyData Set The southernloblolly datasetconsistsof 14,379loblolly pinetreesmeasuredin Alabamaandwill bereferredto asthe SOL dataset.2
1 Datawereprovided byNoelCost,Project Leader, Southeastern Forest ExperimentStation,Asheville,NC.
2 Datawereprovided by RoyC. Beltz,Former Project Leader, Forest SciencesLab, Starkville, MS.
isnoreason toexpect theerrorinVforagiven D2Htodiffer fromthatof theactualpopulations. Whateffects,if any,this could have on the distribution of errors is unknown. We
ignoretheseeffectsandconsidereachdatasetasanindependent finite population.More detaileddescriptionsof the WOK, SEL, and SOL data setsare given in Williams and Gregoire(1993). Methods AxisofesU
o
Scaleofx
• + >,
Probability densily
function ofeSu Median
Figure
1,
Exemple
of the
S U distribution
where
We tooka classical approach todetermining whattypesof distributions couldaccurately modeltheerrorswhenestimating total tree volume.Due to the heterogeneous structureof the errors,the weightingfunctionin (2) was usedwith volumemodel(1). Maximumlikelihoodestimationwasused to fit the distributions.Mood et al. (1974) describethe optimalasymptotic statistical properties of maximumlikelihoodestimationandGregoireandDyer (1989),Williamset al. (1993), and Williams and Gregoire(1993) describein detailtheuseof thesetechniques for estimatingthe parametersfor weightingfunctions.A numberof differenttests wereappliedto theresidualsto determineif theassumption of normallydistributed errorswascorrector if theSu distributionwasbetterfor describing theerrorstructure.
ThedatasetcontainsD, H, andV. A smallnumberof opengrowthtreeswerediscarded because ofpossible morphologi-
Normal Distribution
cal differences from the rest of the data. Trees in this data set
likelihood solution minimizes
Assumingnormallydistributed errors,themaximumlog
have known probabilitiesof selectionso that the actual frequencies in thepopulationcanbe estimatedwithoutbias.
In ln2•r+ n ln02 +l'•ln(Di2Hi) 2k] /2 2 2i= /
Heavy TailedData Set
A fourthdatasetwasgenerated using theDi2Hivalues from the originalWOK data setto seewhat effect heavier
mini
_o,I
I
=
/,
(Vi -a-•D•Hi) 2
I
(4)
tailed error distributions had on the results derived from the
Su andnormaldistributions. The Vi valueswere generated using
Vi=5.5xlo-n+ 3.6xlO-SDi:Hi +ei wheretheerrorsweregenerated using
tan(•r- •/2) =2.0x 10-7(D2H i i••1.25 with r ~ Uniform(O.04,0.96). The regressionand variance parameterswere selectedto be approximatelythe sameas thosein theoriginalWOK dataset.The tan(•r- •/2) termis
theinversetransform for theCauchydistribution, withlocationandscaleparameters setto 0 and1respectively, however the random number, r, was restricted to the interval [0.4, 0.96] to eliminate outliers that can occur with this distribu-
tion. This gave a data set,referredto as CHY, which was visuallynodifferentfromtheoriginaldataset;however,the kurtosisof theerrordistribution isapproximately 5.72,which •s substantially largerthanfor the otherthreedatasets. AlthoughtheWOK andSEL datasetsarenotrepresentative of thetruepopulations, because treeswereselected with unequalprobabilities andfromopportunistic samples,there
for thevolumemodelandweightingfunctiongivenby (1) and (2) respectively. The completeparametervectoris
0•= (a,•, 02,k) .The maximum likelihood estimate (MLE) of the parametervector0] minimizesEquation(4). The
minimum of(4)withrespect to 0] wasfound using IMSL3 routine DBCOAH.
This routine uses a modified Newton's
methodwith anactivesetstrategyto find theminimumof (4)
withrespect to0•subject tosimple bounds. Theconstraint o2 > 0 was used.
Su Distribution The proposedformulationof the Su distributiongivesa
simultaneous transformation andweightingof theresiduals to achievenormallydistributed errors.ForEquation(1), the variabilityof the residualsis equalizedby dividingœiby
•/(o(Oi2Hi)k) 2,where œ, =V/-c•- •D•Hi.Then the transformedresidualsaredistributedwith equalvariance.When estimatingvolume,we wish to find transformations of the 3 International Mathematical Statistical Laboratory, 2500ParkWestTower One, 2500 City West Boulevard,Houston,TX 77402.
ForestSctence 42(4)1996 421
population residuals, e'i= f(ei), such thatthee'i- N(0,1).
eNormal =41j2(D/2Hi)2 & IJ(D/2Hi) •
Usingthefour parametertransformation of Johnson(1949) we have
(6)
The residualsfor theSu distributionwere
esv =T+[Jsinh-lI Vi-ø•-[•Di2Hi.) o(Oi2Ui) • ) =T+[Jsinh-l( •i) ei ).(7) where sinh -1
isanondecreasing function of•-•
and • and • are positive.The parametersq•and •i are location and shapeparametersrespectively. Since •x in volumeequation(1) is a locationparameter,q•is setto zero to avoid overparameterization.To achieveequalvariance
Thus,if theresiduals aretrulynormallyorSudistributed, thetransformedresiduals givenbyEquations (6) and(7) will be distributed as unit normals.
A QQ plot(Cleveland1993)wasusedasa visualcompari-
sonoftheerrors es•andeNormal. Thismethod compares the samplequantilesof the residualsagainstthe quantilesof a
of the errors,we chose;[i=o(Di2Hi) l:, because N(0,1) randomvariable.Any systematic deviationfrom a one-to-one relationship indicates that residuals are not disE[ei 2] = (o(Di2 Hi)k)2 fori =j. When estimatingtotal volume, there is somemorphologicaljustificationto expectthe skewnessparameter,¾, to be differentfrom 0. Intuitively it makessenseto predict
volumeusing•D2H sinceD2H= volume of a cylinder of diameterD and length H and • is a combinationof dimensionalconversions(inchesto feet or centimetersto
meters)anda linear tapercoefficient.However,it is well knownthatthesolidshapeof treeboleschangeswith time, albeit slowly (Larson 1963). These changescould cause skewness in theerrordistributionof Equation(1). For this reason we do not constrain¾ and deal with possible skewnessin the error distributionas required. This formulationgives
tributednormally. Several test of normality were made on the residuals.
Thesimplest testcompared thefirstfourmoments (g,02, [13,[14)of the transformedresiduals.If the residualswere normally distributedthen the first four momentswould be
approximately [1= 0, •2 =1, [13= 0 and[14= 3. The normalitytestdevelopedby D'Agostinoet al. (1990), with an empirical correctionof Royston(1991), was usedon the residuals.This test calculatesthe probabilitiesof the skewness(Pr(g3)) and kurtosis(Pr([14))beingconsistent with thatof a normaldistributionandusesajoint probabil-
ity(Pr(z2)),based onaZ2test,todetermine whether ornot to rejectthenull hypothesis. With thistestthesourceof the problemcan be determinedif normalityis rejected,i.e., too much
ei -
= Su(O,o(O2HOk,&) ß
or too little
skewness
or kurtosis.
The
Kolmogorov-Smirnovtest (Conover1980) was alsoperformed.
The likelihoodfunctionis givenby
(5)
withparameter vector_02 = (•x,•, o,k,¾,5). Themaximum likelihood estimate,_02, minimizes -In L in Equation (5) and was found usingIMSL routineDBCOAG. This routineis
This test is a cumulative
distribution
test for
which we report two-tailedp-values(K-S p-value). The Shapiro-Francia(1972) and Shapiro-Wilk (1965) are excellent testsfor checkingnormality, but are limited to a maximum of 2,000 and 5,000 observationsrespectively dueto the approximationusedin their construction.Since two of the four datasetsexceedthe maximumsamplesize permissible,thesetestswere not used. Once the appropriateness of the distributions had been determined,we evaluatedhow improvedknowledgeof the distributionmight affect estimationof total volume. We evaluated thisbyasimulation process, drawing50,000samples of sizen = 40 from eachof the four datasetsusingsimple randomsampling.The Horvitz-Thompson-type regression estimatoris then usedto estimatethe populationvolume usingthe two regression equations.This gave
similar to the one used to find solutions for the normal
distribution, but secondderivativesarenot required. Goodnessof Fit Testsand DistributionComparisons Tests of the residualswere performedto determine whetherthe normal or Su distributionswere appropriate distribution functions. The residuals for the normal distri-
where •i is the probabilityof selectionof a tree for the
bution were
sampling method chosen andE[V/IDi2Hi]istheexpected
422
Forest Science 42(4)1996
treevolume given thesample Di2Hivalue using thenormal andSu distributions results.For bothdistributions
E[VI D2Hi]= E[a+D2Hi]+ SinceE[ei] = 0 for the normaldistribution (8)
where &and• aretheMLEestimates. Since theSudistri-
using both visual and analytical methods.For the visual
comparison, the 95% predictionintervalswereplottedto give an ideaof themagnitudeof differencein theprediction intervalestimates.For the analyticalcomparison,the theoretical and achievedcoveragerates of the 1 - rl% predictionintervalswere compared,where the achieved coveragerate is the percentageof data points falling within the 1 - rl% predictioninterval.The 1 - rl% levels chosenfor comparisonwere 50, 80, 85, 90, 95, 97.5, and 99%. The 1 - rl% predictionintervalsfor the Su distributionwerederivedusingthepivotalquantitymethod(Mood et al. 1974). This gives
butionwas fitted with the skewness parameter• • 0, the expectationof the errorsis not zero, i.e., E[ei] -• O. The expectation of theerrortermfor theSudistribution, asgiven
P[ot +•Di2 H• +{s(Di2 Hi)k sinh(.-l'V••)