Some Nonparametric Statistical Tests for Quick ... - Clinical Chemistry

5 downloads 794 Views 1MB Size Report
present discussion is limited to those nonparametric ... nonparametric tests have been described (2-4). ... practical situations in the field, where immediate and.
Some Nonparametric Statistical Tests for Quick Evaluation of Clinical Data E. MelvIn Gindler

Some rapid statistical tests give (a) rapid answers on how well methods agree and control chart evaluation (sign and run tests) and (b) evaluation of distribution of test results (Tukey’s quick test and run test). These tests mainly require counting of data and the use of the given nomograms. An unusual distribution of patient test values-that is, unusual when compared with the generally observed distribution of the data seen in a particular laboratory-may indicate laboratory error, alteration of specimens (as from poor collection and/or storage techniques, such as evaporation), or an unusual patient population.

Statistical tests are commonly used to evaluate data when one substance, such as glucose in blood serum, is determined by two different procedures. The purpose of these tests is to determine if the two procedures give significantly different results. If the statistical tests indicate that there is no significant difference in results found by two procedures, one of which may be a reliable reference procedure and the other a less laborious new procedure, then the new procedure may be substituted for the reference procedure. The statistical tests are most often performed on the differences of the results found by each procedure for each sample. When two procedures are shown to give significantly different results for the same analyte it is best to consider the magnitude of the mean difference, with use of a large number of paired data. If this difference does not exceed a few percent of the span of the normal range it is often possible to use the new procedure without harm, provided that it can be conclusively demonstrated that the difference is essentially constant and a new normal range, characteristic of the new procedure, is established. When two procedures are to be compared, it is most important that a wide range of specimen values be used, both normal and pathological. Rockford School of Medicine, University of illinois, Rockford, Ill. 61101; Pierce Chemical Co.,’ Box 117, Rockford, Iii. 61105; and Associated Medical Laboratories, Park Ridge, IlL 60068. ‘To which address correspondence should be directed.

Received Dec. 2,

1974; accepted

Jan.

2, 1975.

If possible, the pathological values should be both above and below the normal range. Much can be learned by comparing subnormally valued specimens alone, normal range valued specimens alone and above-normal range valued specimens alone. In this way one set of data does not swamp out or compensate for another. When considering the importance of differences between two procedures it is necessary to take into account the nearness of overlapping of abnormal ranges, such as may be seen in some diseases. In general, greater differences between two procedures may be tolerated if the normal and pathological ranges are far apart. An extreme, but common, example of this is found in those test results in microbiology and immunology that are reported simply as positive or negative. In those situations there are relatively huge differences between normal and pathological concentrations of organisms (generally zero in the normal situation) or of an antibody. Rigorous statistical treatments of clinically significant differences between methods are still not fully developed, but it is necessary that the relatively primitive mathematical tools now available be used to determine if differences are clinically significant. Any complete consideration of clinical significance must include knowledge and consideration of both normal and pathological ranges, with the possibility of there being more than one pathological range. There are two general kinds of statistical test, parametric-such as Student’s t-test (1)-and non-

parametric-such

as those described here. The para-

metric tests generally assume that the numbers under evaluation have a distribution not too different from the gaussian distribution (also called the “normal” distribution). The nonparametric tests make no assumption of the nature of the distribution. The present discussion is limited to those nonparametric procedures that can be carried out in the laboratory within a few minutes with no equipment other than the nomograms given here and an inexpensive slide rule or battery-powered electronic calculator. Other nonparametric tests have been described (2-4). CLINICAL CHEMISTRY, Vol. 21, No.3, 1975

309

#{149} The chief justification for using the nonparanietric statistical tests is that they give reliable results with both usual (such as the gaussian) and unusual distributions, whereas the parametric tests may give unreliable results when the distributions are unusual. (This will be made clear by an example following the discussion of the sign test.) A second justification is that some nonparainetric tests can be calculated within a few minutes and may therefore find use in practical situations in the field, where immediate and reliable decisions are required. (Not all nonparametric tests are simple to apply.)

an, called

negative differences. The simple sign test used here ignores the magnitudes of the differences and the ties. [More elaborate nonparametric tests, such as the Wilcoxon signed-ranks test, do consider the magnitudes of the differences (2-4)]. The considerations given in the preceding paragraph also apply when differences between results obtained by two different methods for the determination of the same serum component are considered. Table 1 gives imaginary data for determination of glucose in several sera by two different methods, A and B. In these simple tests only the positive and negative differences are considered and both their magnitudes and the ties (zero differences such as for specimens 2 and 6) are ignored. Thus the differences

Rapid Evaluation of Control Charts and between Methods (FIgure 1)

Agreement

Sign Test

+0--+0++

are considered as being

+--+++.

Any group of differences, whether resulting from a control chart (where the difference is that of the found value minus the given mean) or from two different methods as in Table 1, are treated in exactly the same manner in the sign test. Data treated as in Table 1 are called “data pairs.” The sign test becomes most reliable when 20 or more values, including ties, are considered. Its greatest weakness lies in ignoring ties, because a large

Control charts are generally prepared by finding the mean values and multiples of the standard deviation from 20 or more consecutive determinations in the same specimen and then ruling these as lines on a graph containing future dates. Future values are plotted on the graph. In the application of the sign test three kinds of value points are recognized: those at the median, called ties; those above the median, called positive differences; and those below the medi-

N2

N1

-

2-

8

5-

-

--2 15....:

8-

3

-.4 20-

11 -

5

25 14 30

-

17 -

35_

20

3

5

4

11 14 20

5 6 6

8 11 14 20

6 7 8 9

11 14 17 20

8 9 10 11

14 17 20

11 12 13

17

13

-10 20

-

20

-

0.8

3 0.7

-

0.6

-

0.5

-

0.45

-

0.4

5-

2 C

0

a

2

C C

4

6

8

10

11

14

16

18 20

22

24

10

-

4

0

6

.8

-0.35 15 -

E 0

8

2

10 0

I-

z

205C 25

14

a30.a

14

2015

-0.3

12 -

0.25

-

0.2

-

0.15

16

40. Test of Number

of Runs

-15

18

-

20

-

j40_i 50 4

a

50

8

10

12

14

16

18

20

22

24

60

N1

0

S

._20

C

z

6

60_

65-

I

-24

Critical Values for the Tukey Quick Test.

70 Statistic

N1 = Number of patients in smaller group of data.

N2= Number of patients in larger group of data.

(Tq). 8090100110-

Kolmogorov-Sn*nov

0.13

StatleOc

Sign Test

FIg. 1. CrItical values (5%, two-sided) for the sign test, the run test, the Tukey quick test statistic, and the Kolmogorov-Smirnov

statistic 310

CLINICAL CHEMISTRY, Vol. 21, No.3, 1975

Table L Comparison of Results by Two Methods, for Eight Specimens

Table 4. Differences for Some Simulated Data 0,-1,0,-1,-2,-1,-2,0,-1,-1,-2,--1,-1,0,-1,0,-1,0,-2,+17

Difference

Method A

Method B mg/lOS

Method A-Method

B

ml

66

64

73

73

2

71

76

388

402

105 139

256

101 139 251

4 0 5

307

300

7

0 -5 -14

Table 2. Differences Evaluated by the Sign Test 0--+++0-++-0--++0-+0+-+-

3. Differences Evaluated by the Sign Test

Table 0--

--+

+

O---0--

- -+0-

--0

number of ties often indicates that all is well. When there are more than about 50 or 60 values it becomes a highly reliable procedure. The imaginary data given in Tables 2 and 3 will be evaluated by the sign test. In the data of Table 2 there are 10 positive differences, nine negative differences and five ties. The ties are ignored and only the sum 19 (= 10 + 9) of positive and negative differences is considered. The nomogram for the sign test indicates that for 19 differences there should be at least four of the rarer sign if there is no significant difference between methods or, as on a control chart, found and expected results. Because nine is greater than four it is concluded that there is no significant difference. In the data of Table 3 there are three positive differences, 25 negative differences and five ties. Again the ties are ignored and only the 3 + 25 28 positive and negative differences are considered. The nomogram for the sign test indicates that for 28 differences there should be at least eight of the rarer sign if there is no significant difference between methods or results. Because three is less than eight, it is concluded that there is a significant difference. In general it may be concluded that when there are about equal numbers of positive and negative differences and the number of differences exceeds 30 or 40, that there is probably no significant difference between two sets of numbers. When there are less than 20 differences, the sign test result may not be reliable. In this situation the nonparametric Wilcoxon signed-rank test, especially well discussed by Natrella (4), should also be used. The run test, discussed below, may also be used. Two examples of the effectiveness of the sign test are now given. Some imaginary data differences are shown in Table 4. =

Moroney (1) and Henry and Dryer (5) have given examples of the use of the Student’s t-test for paired comparisons. It is a test intended to determine if the mean difference found between two procedures is significantly different from an assumed mean difference. This assumed mean difference is usually, but not necessarily always, set at zero. (This is clearly shown in Moroney’s example.) In the use of Student’s t-test the calculated value of t is compared with that given in a graph (1) or table (5) for the given number of samples and level of significance sought (generally set at 5%). It is generally assumed that there is no significant difference between the values of two procedures when the calculated value of t is less than that given in the graph or table. Most textbooks do not adequately discuss the limitations’ of the Student’s t-test procedure. The use of Student’s t-test (1, 5) would show that there is no significant difference between the two sets of data from which the differences (x = difference) of Table 4 are taken. t = (i)(v’n 1)/s, where 2 = (x2/n) (i)2 and i = mean value of x. Here x = 0 and for 19 differences t does not exceed 2.1 in magnitude if the differences are not significant. If t does not exceed 1.96 with any number of tests and the distribution of the differences does not differ too much from gaussian, then it is concluded that there is no significant difference between the two sets of data. Inspection of Table 4 shows that the data do not approximate the gaussian distribution. Just calculating t without considering the nature of the distribution of Table 4 gives the misleading result t = 0. The sign test for 13 negative differences and one positive difference (14 differences; the five ties are ignored) shows that there must be at least two values of the rarer sign. Since there is only one value of the rarer sign it is concluded from the sign test that there is a significant difference between the two sets of data from which Table 4 was taken. The erroneous conclusion from the naive use of Student’s t- test that there is no significant difference is discarded. Natrella (7) has given criteria for discarding data like the +17 of Table 4, because it is far different from any of the other x values. If +17 is discarded then t 5.5, which indicates significant difference because it exceeds the maximum allowable value of 2.11 for the remaining 18 differences. Examples of the use of Student’s t-test and the sign test to demonstrate agreement between two procedures for the determination of calcium have been given by Gindler and King (8). Using an electronic counter for erythrocytes, of the type in which diluted blood streams through a fine orifice, we have found, using the sign test, that some -

-

=

CLINICAL

CHEMISTRY,

Vol. 21, No. 3, 1975

311

lots of control blood give values consistently above their stated mean and other lots give values consistently below their stated mean. In every case the values are well within the manufacturer’s stated

Table

mmol/Iiter

136,138, 138, 139, 139, 140, 142, 14, 143, 143, 148,148, 149, 150, 150, 151, 15, 15t, 155, 155,

standard deviation from the mean. In the case of unstable control materials or where several different lots of a control material are used consecutively, a control chart may be prepared by plotting the difference between found and given mean value versus time. Such a control chart indicates both precision and accuracy, if the given mean value is reliable.

5. Serum Sodium Concentrations 145

156

Table 6. Division of Data of Table 3 into Two Parts, with Omission of Ties A ----1+1

1+1--

(5 runs)

B

/+/---

(3 runs)

Run Test

The run test considers

the distribution

of positive

and negative differences; ideally both should be scattered in a random quence -----++++

manner.

For example,

the se-

is different from the sequence +-+-+-+-; the former has two runs and the latter has eight runs. By using the nomogram for the run test for these two sequences it is seen that for N1 = smaller sample size = 4 and N2 = larger. sample size 4, the closest listed situation is N1 N2 5, for =

=

which at least four runs would be expected

dom basis. The two-run

situation

=

on a ran-

is not random

whereas the eight-run situation is random. The tworun situation may be due to some source of bias such

as an unstable calibrator solution or bias in the control material. Used along with the sign test, the run test can give valuable insights into evaluation of a procedure or a control chart. In the run test the ties may be ignored. The run test may be used in other ways, limited only by the imagination of the laboratory worker. An example is its use in situations such as the determination of sodium, in which the upper limit of the normal range (155 mmol/liter) is only about 15% greater than the lower limit (135 mmol/liter). We shall imagine a situation in which serum was collected from 11

patients. Each specimen was divided into two tubes. The first set of tubes was analyzed immediately. The second set of tubes was allowed to stand open during the day so that the concentration of each increased 3% because of evaporation. The second set of tubes was then analyzed for sodium. The data are given in Table 5 in numerical order (a procedure called “ranking”) with the values of the second set italicized. In Table 5, N1 N2 11. For N1 N2 11 the nomogram for the run tests shows that there should be at least eight runs if the situation is random, but there are only six runs in Table 5, indicating that there is a difference between the two sets of data. In practical situations the laboratory has data =

available

=

=

that have been accumulated

=

over many

months from a particular patient population. (Some laboratories that serve hospitals, nursing homes, and facilities that test healthy people, distinguish between their different patient populations.) The data found on a particular day are ranked on a sheet of paper containing approximately the same number of earlier data for the particular population, with the 312

CLINICAL CHEMISTRY, Vol. 21, No.3, 1975

earlier data coming from days that are known to be typical. If the run test shows a situation akin to that in Table 5, then some of the specimens should be

reanalyzed, preferably with other reagents and calibrator solutions, and their entire post-collection history considered. This may help prevent release of data that are somewhat in error. The 3% evaporation figure was used here because the maximum allowable error in the determination of sodium is 3%. Control

sera will not allow the detection of errors involving mishandling of specimens, such as evaporation in an icebox from loosely stoppered test tubes. Statistical tests often may indicate, but they do not necessarily prove, that such errors have occurred. Considering the data in Table 2, without the ties, N2 10, N1 9 and the closest situation in the nomogram is N2 N1 11, for which there should be at least eight runs in the random situation. Actually there are 11 runs, so the distribution is probably random. Consideration of the data in Table 3, without the ties, shows N2 25, N1 3. As the nomogram does not go beyond N2 20 a simple method, of sufficient =

=

=

=

=

=

=

accuracy for most purposes is to divide the data into about two equal parts, A and B, shown in Table 6. For convenience the ties have been omitted and the

runs are separated by slanted lines. For part A, N2 12, N1 2, for which the closest approximation in the nomogram is N2 20, N1 2, with at least three runs necessary. For part B, N2 14, N1 1, for which the nearest approximation in the nomogram is again N2 20, N1 2, with at least three runs necessary. The run test shows that the data in Table 3 is randomly distributed. The sign test indicated bias. The two tests together indicate that the bias is probably distributed throughout the run. The reader may wish to use the run test to demonstrate that the data in Table 4 are not randomly distributed and that there is possibly bias somewhere in the run. =

=

=

=

=

=

=

=

Evaluation of Distribution of Test Results Tukey’s

Quick Test

This test is not very rigorous. Its main purpose is to see if two sets of samples have about the same average value. Unlike simply calculating the two aver-

ages, the quick test involves no arithmetic and indicates if the averages are significantly different. (A statistical test must always indicate if differences are significant.) The greatest weakness of Tukey’s quick test is that it is highly dependent on the accuracy of the highest and lowest values of each set of samples and, in practice, these usually are the values most likely to be in error. Consider the data given in Table 5. Although they have been ranked, there is no need for ranking in Tukey’s quick test. First the lowest and highest values in each set of data are found. (For convenience they may be marked by asterisks.) The first set of tubes varied between 136* and 151*, and the second set of tubes varied between 140* and 156*. We now count the number of values of the first set that lie below the lowest value of the second set and the number of values of the second set that lie above the highest value of the first set. Inspection of Table 5 shows that five values of the first set are less than 140 and five values of the second set exceed 151. The calculated statistic for Tukey’s quick test is Tq = 5 + 5 = 10. Inspection of the nomogram for Tukey’s quick test for N2 N1 = 0, N1 = 11 (N2 = number of data in the larger set of samples, N1 = number of data in the smaller set of samples) shows that the two sets of data have significantly different average values if Tq exceeds seven. This test has shown the two sets of data to have significantly different average values. If one set of data has both the highest and lowest values of the two sets of data then Tq = 0 and Tukey’s quick test says that there is no significant difference between the averages of the two sets of data. If the lowest value of the second set of data had been 139 instead of 140 then there would be a tie between the 139 value of the first set and the 139 value of the second set. Any tied values each are assigned counts of %instead of 1. In general, Tukey’s quick test is probably of limited reliability in the clinical laboratory and will be found to be most reliable when the upper and lower limits of each set of data are not too different from the other data in the set. That is, the upper and lower limits should not be outliers (7). The run test is less influenced by outliers than is Tukey’s quick test. -

Kolmogorov-Smirnov

Test

The Kolmogorov-Smirnov test determines if an observed distribution of data agrees well with the expected distribution. Recent work indicates that it is at least as good as the older chi-square test for this purpose (9) and it has the advantage of requiring much simpler calculations. It is gradually displacing the chi-square test, especially with smaller numbers of data. Unfortunately some books on statistics, a field that has seen many developments in the past 25 years, persist in emphasizing the older tests. The basic purpose of the calculation in the KolmogorovSmirnov test is to determine if the difference be-

tween the observed and expected cumulative distribution at any value exceeds a fraction given in a table or nomogram. We shall use the Kolmogorov-Smirnov test here

first in a rather restricted

form’by only considering

the difference between observed and expected cumulative distribution at the value corresponding to the midpoint of the normal range. My observations when using this test with clinical laboratory data have been that it is usually in this region where one is most likely to encounter significant differences between observed and expected distributions. When using the Kolmogorov-Smirnov, or any other tests of distribution, all data of the run should be retained (except for erroneous values) including both normal and pathological values. The data of Table 5 are considered here. The midpoint of the normal range is 0.5(135 + 155) = 145. At 145

mmol/liter

the cumulative distribution

of values

between 0 and 145 mmol/liter in the first set is found by counting the number of values at and below 145 mmol/liter and the total number of values in the first set. (It is not necessary to rank the values.) There are found to be six values between 0 and 145 mmol/liter in the first set, and the total number of values in the first set is 11. Calculation shows that ‘j1 = 0.545. The test of the second set of values will be to determine if 54.5% of the values are in the range 0 to 145 mmol/ liter. Inspection of the 11 values of the second set shows that five values of the second set are in the range 0 to 145 mmol/liter. Calculation shows that % = 0.455. The difference between the observed and expected cumulative distributions is 0.455 0.545 = -0.090. For the 11 values of the second set of data it is found, from the nomogram for the KolmogorovSmirnov distribution, that the difference between the observed and expected cumulative distribution should not exceed 0.391 if there is no significant difference between them. Thus, at this point the Kolmogorov-Smirnov test has not detected the difference between the first and second sets of data. It is seen that for a small number of data the Tukey quick test and sign test are more sensitive -

than is the Kolmogorov-Smirnov tion of changes in distribution.

test for the detec-

The sensitivity of the Kolmogorov-Smirnov test for the detection of changes in distribution increases

markedly as the number of data increases, because the allowable difference between found and given cumulative distribution decreases with increase in the number of data (see nomogram). An actual example

is presented

here, the data being those obtained in

the determination of albumin with an SMA 12/60 AutoAnalyzer (Technicon) by use of Pierce “SpecTru BCG” bromcresol green reagent. The samples were fasting blood sera collected in the early morning from patients who had been fasting and sleeping all night. The values found are given in Table 7. Inspection of these data shows that virtually no paCLINICAL CHEMISTRY, Vol. 21, No. 3, 1975

313

Table 7. Serum Albumin Valuesa g/lit.r

a

42 39 42 37

39 36 37 39

36 39 36 39

32

41 41

38 43

36 34

39 38

35 41 38

39

39

36

40

23

28

38

39

29

Data of Swedish-American

Hospital,

41

32 37

Rockford,

Ill.

tient has serum albumin concentrations exceeding the 45 g/liter midpoint of the 38 to 51 g/liter normal range, which is expected only for ambulatory persons. In this situation it is not possible to apply the Kolmogorov-Smirnov statistical procedure at the midpoint of the normal range. Instead we shall look at the cumulative distribution at the lower extreme, 38 g/liter, of the normal range. A count shows that 27 of the 36 patients had serum albumin concentrations of 38 g/liter or less. We shall now see if the KolmogorovSmirnov statistic can detect a 3% increase in each albumin concentration value, as might result from evaporation. If the data in Table 7 are each multiplied by 1.03, 19 of the 36 patients now have values of 38 g/liter or less. We determine that the fraction 27k6 = 0.750 and the fraction 1%6 = 0.528. The difference 0.750 0.528 = 0.222 and the Kolmogorov-Smirnov nomogram gives a maximum allowable difference in cumulative frequency of 0.221. This indicated that there is a significant difference in distribution, which is detected by the Kolmogorov-Smirnov distribution. (In this presentation an effort has been made to demonstrate both the statistical procedures and their -

limitations.) Individual

laboratories

may determine

their own

expected distributions of various data. The above example of albumin data could have instead been the comparison of a known distribution, obtained over

several weeks, with a distribution

found on a single

day. It is seen that the shift in distribution was not great, but was detected. (The reader may wish to con-

firm this as an exercise.) The Kolmogorov-Smirnov test has other applications in the laboratory. It is not unduly influenced by outliers. It may be used to see if the distribution of data differs significantly from gaussian before the data are evaluated by other statistical methods, especially distribution-dependent methods such as Student’s t-test. It may be used when normal ranges are determined, especially with a large number of data, in order to see if the distribution postulated for the data is actually being followed. (The nonparametric methods proposed in recent years for the determina-

314

CLINICAL CHEMISTRY, Vol. 21, No.3.1975

tion of normal range have the disadvantage in relying most heavily on the lower and higher values. They essentially use the central 95% of the data as the normal range.) For laboratory tests in which about 40 or more

specimens

per day are analyzed

the Kolmogorov-

Smirnov procedure may be carried out by using little more than counting as a tool. The counts niay be made at several values such as both ends of the normal range. For more than 40 values the approximate value of the Kolmogorov-Smirnov statistic is 1.36/N, where N is the number of values. It may easily be taught to students as a means of monitoring their own work. With smaller numbers of test results the run and Tukey quick test may be used. As statistics is rapidly developing in our own day the laboratory workers should be alert to newer and better techniques. Although the sign test is about two centuries old, the other nonparametric procedures discussed here were developed in the twentieth century. The methods given here may be used to study population variables. For example, a population eating a low-protein diet, such as homeless men, can easily be shown to have significantly lowered serum urea nitrogen concentration, when compared with a population of well-fed men in the same area. This, in turn, can lead to a more realistic medical interpretation of the data.

References 1. Moroney, M. J., Facts from Figures, 2nd ed., Penguin Books, Baltimore, Md., 1953, pp 227-233. This inexpensive book is an excellent introduction to the older work in the field and to many

common procedures. W. J., Practical Nonparametric Statistics, John Wiley and Sons, New York, N. Y., 1971. This is now perhaps the best introduction to the field and will be of great value to all who deal with numbers. It is clearly written and examples are given. It gives many useful tables. 3. Dixon, W. J., and Massey, F. J., Jr., Introducion to Statistical Analysis, 3rd ed., McGraw-Hill, New York, N. Y., 1969, pp 335361. In this widely used textbook there are useful examples and problems with answers. 4. Natrella, M. G., Experimental Statistics (National Bureau of Standards Handbook 91), U. S. Govt. Printing Office, Washington, D. C., 1963, pp 16-1 to 16-14, T-78, T-79. This practical manual is far ranging and deserves the widest possible distribution. It contains many practical examples and justifies close study. It should be consulted frequently by all who handle laboratory data or de2. Conover,

sign experiments involving numerical date. 5. Henry, R. J., and Dryer, R. L., Some applications of clinical chemistry. Stand. Methods Clin. Chem. 4, 205 especially pp 231-233, 237. This is a useful, well-known Some of the material has been discussed in a different

to (1963). See

statistics

discussion. manner in

reference 6. 6. Henry, R. J., Cannon, D. C., and Winkelman, J. W., Clinical Chemistry, 2nd ed., Harper and Row, Hagerstown, Md., 1974, pp 287-37 1. In addition to being perhaps the best available book on clinical chemistry, this book contains good discussions on statistics, with recent applications to clinical chemistry. 7. Reference 4, pp 17-1 to 17-6. 8. Gindler, E. M., and King, J. D., Rapid colorimetric determination of calcium in biologic fluids with methylthymol blue. Amer. J. Clin. Pat hol. 56, 376-382 (1972). 9. Reference 2, p 295.