Use of neural networks to diagnose acute myocardial infarction. I

Clinical 604-612

Cbeinistiy

42:4

(1996)

Use of neural networks to diagnose acute myocardial infarction. I. Methodology J#{216}RGEN

S.

J#{216}RGENSEN,1

J. BOIDEN

PEDERSEN,l*

We investigated several aspects of using neural networks as a diagnostic tool: the design of an optimal network, the amount of patients’ data needed to train the network, the question of training the network optimally while avoiding overfitting, and the influence of redundant variables. The specific clinical problem chosen for illustration was the diagnosis of acute myocardial infarction, given only the electrocardiogram and the concentration of potassium in serum at the time of admission. We found that, in contrast to usual practice, the termination of the training process should be based on the generalization performance and not on the training performance. We also found that a principal component analysis can be used to eliminate redundant variables, thereby reducing the data space. The diagnostic perfurmance of the neural network we used was 78%superior to that of linear discruninant function analysis but similar to that of quadratic discriminant function analysis. INDEXING

analysis

Tutj1S:

#{149} potassium

diagnosis, computer-assisted #{149} electrocardiogram

SUSANNE

M.

PEDERSEN2

An obvious problem when designing a neural network is estimating the number of patients’ data needed to train a network containing a specific number of neurons. In a discussion of neural networks Cicchetti [8] addressed this problem and suggested that the subject:variable ratio should be at least 5:1, according to a paper by Fletcher et a!. [9] on linear discriminant function analysis (LDFA). The argument stated was that a low ratio can lead to an artificially high performance because of chance relations in the training set. It is not clear, however, whether this rule has a wider range of applicability. Generally, an accidentally high training performance will yield a low rather than high generalization performance. A rule of thumb, or even better, an indicator, as to whether a training set is sufficiently large will therefore be important. In this work we were concerned with the design of an optimal network and with an examination of the variability of the performance as a function of the network parameters. Thus we investigated whether there exists an optimal number of neurons in a hidden layer and, if so, how it would be related to the number of laboratory variables. We also examined how many patients would be needed to train the network and whether a better performance could be obtained by elimination of redundant data. Finally, we compared the neural network results with those obtained by linear and quadratic discriminant function analysis (LDFA and QDFA). The type of neural network we used is the layered feedforward neural network. This type of network has proved itself capable of learning, i.e., able to extract complex rules from examples shown to it and to generalize to new data belonging to the same distribution [10].

discriminant

#{149}

During the last decade neural networks have been shown to be a powerful tool in many areas of science, but only in the last few years has the technique emerged in the field of diagnostics [1-7]. With the objective of examining the usefulness of neural networks as a diagnostic tool, we trained several networks by using laboratory data from patients suspected of having acute myocardial infarction (AMI).3 To provide a rapid diagnosis of AMI, we included two early markers of ischemic myocardial injury in the analysis: nine deviations of the electrocardiogram (ECG), and the concentration of serum potassium in blood taken at admission.

Materials PATIENTS

2

and

AND

and Methods

SAMPLES

Patients. The data used in this investigation refer to 250 patients with suspected AMI admitted to the coronary care unit at Svendborg Hospital. These were found to be 125 with AMI and 125 with non-A!Vll. The diagnosis of AMI was based on the World Health Organization criteria [ii], i.e., characteristic chest pain and typical evolutionary electrocardiographic findings, supplemented with serial changes in cardiac enzymes. The procedures followed were in accordance with the Helsinki Declaration of 1975, as revised in 1983.

Fysisk Institut, Odense Universitet, DK-5230 Odense M, Denmark. Klinisk Kemisk Afdeling, Svendborg Sygehus, DK-5700 Svendborg,

Denmark. Author for correspondence. Fax tnt + 45 66158760; e-mail [email protected]. ‘Nonstandard abbreviations: AMI, acute myocardial infarction; DFA, discriminant function analysis; EGG, electrocardiogram; LDFA, linear discriminant function analysis; PCA, principal component analysis; and QDFA, quadratic discriminant function analysis. Received July, 5, 1995; accepted January 31, 1996.

604

Clinical Chemistiy

Training and test data. A training set of 100 randomly selecting from the two groups 50 cases of non-AMI. The remaining 150 cases set, which also had equal numbers of AMI

cases was formed by cases of AMI and 50 were used as the test and non-AMI cases.

Concentration ion-selective

of potassium in serum. Potassium was analyzed with electrodes on a Model 761 Monarch 2000 analyzer (Instrumentation Laboratory, Warrington, UK). NEURAL

NETWORKS

The term neural network covers a wide range of research, but one of the simplest and most well-studied types is the so-called layered feed-forward neural network. Like all networks, it consists of several small computational units (neurons) interconnected by axons in specific ways so that information can travel from the output section of one neuron to the input section of the following neuron. The structure of the layered feed-forward network, sketched in Fig. 1, consists of an input layer, zero or more hidden layers, and finally an output layer. The term “hidden” refers to the fact that these layers do not belong to the inputloutput region of the network and are therefore not observable from a user’s point of view. The number of hidden layers as well as the number of neurons within a hidden layer may be varied, but for many applications one hidden layer is sufficient; e.g., all Boolean functions can be represented by one layer. The number of input neurons must equal the number of input data; each marker has its own input channel. The number of output neurons is determined by the level of detail desired of the output. In the present requires

work the desired output is AMI or non-AMI, which one or two output neurons. As illustrated in Fig. 1, the neurons in one layer are fully connected to the neurons in the neighboring layers, but no connections exist to neurons within the same layer or to neurons in nonneighboring layers. The connection between any two neurons n and in is characterized by the weight w,,,,,, a real number signifying the strength of the connection. The output 0,, of neuron n gives an input signal at neuron m equal to O,,w,,,. Thus the weights describe how the output signal of a neuron is divided into input signals to the neurons in the following layer.

42, No. 4, 1996

605

The flow of information in this type of network is one way only. The measured data are fed into the input neurons, the output of the input neurons is fed into the neurons of the first hidden layer, and so forth, ending as the output of the output layer. Figure 2 illustrates the computational function of a neuron. The part calculates the weighted sum of all the inputs arising from the outputs of the neurons in the preceding layer. A technical detail is that from this value is subtracted a so-called bias or threshold value x’,,,, which is specific for neuron m. The value of 11mcan be viewed as a weight connecting neuron m with an invisible neuron of constant output equal to -1. The value of the sum is then fed into a transfer function that describes the action or output of the neuron corresponding to a given input. The simplest transfer function is a step function with only two values, -1 and + 1, which correspond to firing and nonfiring of the neuron. A continuous transfer function implies that the output can be any value within a specified range. Usually the function is chosen to be linear around 0 and to have a sigmoid shape. We have used a hyperbolic tangent that gives values in the range -l to +1. Training the network is accomplished by tuning the values of all the weights (including 11,,, as mentioned above) so that, when presenting the training set, the difference between the actual output of the network and the desired output is minimized. The simplest method for training the network, i.e., tuning the values of the weights, is error back-propagation, a gradient descent method well known in curve-fitting problems. A description of this method may be found in standard textbooks [10], and this or a similar method is included in any commercial neural network software package. The minimization of differences is performed on a data set called the training set; in this way, the network learns the information contained in the training set. This obviously puts some requirements on the training set and also implies that the minimization has to be controlled in some way. The network should learn the important and underlying dependencies, but it should not learn any spurious ones arising from errors or inconsistencies that are inevitably present in the data. The learning of random dependencies is usually called overfitting. The networks mentioned in this paper were constructed and trained by use of the neural network simulator program SN2 (Neuristique, Paris, France). The computer used was an HP! Apollo 9000 Series 400 (Hewlett-Packard, Chelmsford, MA). CODING

AND

TRAINING

THE

NETWORK

Before training the neural network, one must transform the data into a suitable numeric form. In numerical problems it is always

Input layer

Hidden layer #1

Hidden layer #2

ng. 1. Information flow in the feed-forward

Hidden layer

WSfl:1mf

Output layer

weighted sum network.

Fig. 2.

Computational

transfer function

function of a neuron.

606

J#{248}rgensen et al.: Neural

networks

to diagnose

acute myocardial

infarction

a good strategy to transform the variables to be of order 1, because that will minimize errors of imprecision. The same precautions are appropriate for neural networks, given that training a network In the present

is done by minimizing study we used two

a function. potential markers

90

-

of #{149}

AMI-the serum concentration of potassium and the EGG described by nine deviations: ST-segment abnormalities in leads 1,11,111, and V1-V6. Only values that were numerically 0.l mV in leads I, II, III, and V4-V, and 0.2 mV in V1-V3 were recorded; smaller values were set to 0. These data are associated with the diagnosis of AMI or non-AMI. The values of potassium in serum were distributed symmetrically about the average

0) C.) C CO

80

0

70-.

-

a)

0

#{149}AAAAA#{149}_AAAAAAAAAAAAAAAAA4AAsss

60

value, and the values were therefore transformed by subtracting the mean (3.99 mmollL) and dividing by the SD (0.37 mmol/L). Typical values of the EGG measurements are in the range -0.5 mV to +0.5 mV and are thus suitably scaled. We trained several networks with the number

r a Af_4

I

)

10000

of input

90

-

ones as AMI. Values close to + I and -1 represent certain diagnosis, and values close to 0 represent very uncertain predictions. In training the network we attempted to get an output of +1 for AMI and -1 for non-AMI. The training of a network

70 #{128}

which

took

-10

mm of

STRATEGY

Figures 3 and 4 illustrate typical learning curves. The performance of the network is displayed vs the number of iterations. In each iteration, the current network is applied to the data of one patient in the training set; the network is then adjusted to improve its ability to reproduce the training results. The performance is defined as the percentage value of the number of patients for whom the correct diagnosis has been predicted by the network. As expected, the network’s ability to reproduce the training set, i.e., the training performance, increases monotonically with the number of training sessions until a maximum value is obtained asymptotically. This maximum represents the amount of information present in the training set that can be learned by the network. Use of a different number of neurons in the hidden layer or a different number of hidden layers may, and most likely will, change this maximum value. A network with a maximal training performance is not necessarily an optimal network with respect to its ability to generalize on unknown data. During the training session, the network will try to find any relation between the data used in training and the diagnosis, including accidental relations between the data caused by experimental scatter or other more or

40000

50000

I

‘

0

C Cs

E

ci)

0.

60

-

#{163} A0AAALAA1

AAAAAAAA1AAAAAA#{225}.AAAA.AAAA

ALA #{163}

#{149}I

Results and Discussion TRAINING

30000

A

#{149}

0

10000

OPTIMAL

I

.

20000

I

ci) 80-

100 000 iterations,

I

00000000000000000000000000000000000000000004

different approach and used only a single output neuron. The output of this neuron is a number between -1 and + 1, with negative numbers being interpreted as non-AMI, and positive

required time.

.

Ite rations

neurons equal to the number of laboratory data (10) and a variable number of neurons in a hidden layer. The desired output, AMI or non-AMI, may be represented by a discrete two-valued function that can be modeled by two output neurons, of which only one fires at a time. However, we chose a

typically computer

0000000000000000000000

20000

#{149}

30000

40000

50000

Iterations Fig. 3. Performance of a neural network with 15 hidden neurons as a function of the number of iterations in the training process. (0), training process; (L4, generalization performance. The training and the test sets consist of 20 and 150 patients. respectively, with an equal number of AMI and non-AMI cases. Top and bottom correspond to two different training Sets.

less random effects. Because such a dependence is characteristic of the particular training set, it will have a negative effect on the predictions of the network when applied to other data sets. This effect is called overfitting. If an accidental dependency is weaker than the main dependencies, it will be learned after the main dependencies in the learning process. Therefore, an optimal learning process should continue until all the wanted main dependencies have been learned, but it should stop before spurious dependencies are being learned. If the random and the main dependencies are equally strong, then there is no way to separate the two and the data are useless for training. After a number of iterations, we tested and recorded the generalization performance of the networks on the independent test set (data from 150 patients). The result of a test of generalization performance has no influence on the network; the network only learns the training set. Rather, the test set is used to test the trained network’s ability to generalize, i.e., to predict the correct diagnosis with unknown data.

Clinical Cbemictiy

100

-

I

corresponding generalization performance 60% and then drops only a little to a constant

I

I

607

42, No. 4, 1996

fact that the maximum 90

00000000

C.) C CO

000

0

000000

000

former

-

A

aAAAaaAL

AA

AAALLAAAA

A

ci)

a-

“0

100000

200000

300000

Iterations Fig. 4. Performance of a neural network with 20 hidden neurons function of the number of iterations in the training process. (0). training sets consist

as a

process; (i), generalization performance. The training and the test of 100 and 150 patients, respectively, with an equal number of AMI

generalization

ability

of the networks,

measured

as the

performance on an unknown test set, is also shown in Figs. 3 and 4 as a function of the number of iterations in the training process. The general behavior is that the generalization performance first increases until a maximum is reached, then decreases, and understandable

finally levels off. This behavior is immediately in terms of the above comments. The maximum represents a network that has learned all the main dependencies present in the data set. The continuing training after the maximum is an overfitting, i.e., learning of spurious dependencies present in the training set only. This overfitting results in erroneous predictions and thus in a lower generalization ability. Although this kind of behavior is quite general, the explicit location of the maximum and the size of the following decrease of the generalization are strongly dependent on the characteristics of the training set. Figure 3 illustrates the large variations in the training and generalization performances that can be observed for small data sets. The panels of the figure correspond to two randomly selected training sets consisting of 20 patients, 10 with and 10 without Aft/il. The different behavior must be ascribed to statistical differences between the two sets. The upper panel has three different levels of learning performance and a very clear maximum of 65% in the generalization performance, corresponding to a training performance of 85%; i.e., 17 of the 20 in the training set have been learned. By continuing the training, the network learns 1 more patient (90% training performance), but this leads to a strong decrease in the generalization performance, which decreases to 50%. Thus, the extra patient either has the wrong diagnosis or has an atypical pattern of laboratory test values and exemplifies an instance of overfitting. The performance curves in the bottom panel are different: The learning curve goes fast and smoothly to a maximum value that is not changed by a continuation of the training. The patients

value in this set is lower than that of the network behavior in general-can be the second set contains a mixture of a strength that cannot be separated this is an example of a data set that because the main dependencies and

definite conclusions about the main dependencies, although the form of the training curve might be indicative of the quality of the data. Large differences in generalization performance for different training sets are expected only for small data sets (see below). Figure 4 displays

the performance

of a network

that is trained

on a large data set consisting of 100 patients. Again, the test performance increases to a maximum and then decreases be-

and non-AMI cases.

The

the

the random, spurious dependencies are equally strong and cannot be separated. The different results for the two sets show that it is possible by chance to obtain good results for a small training set, but the opposite is just as possible. Also, it is impossible to make any

70;

60

set-and

explained by assuming that good and bad examples with in the training process; i.e., is not so usable for training

80

E 0

000000

first increases to value of 57%. The

cause of overfitting. decrease is smaller

However, (3%) than

the maximum is broader and the for smaller data sets. Moreover,

the curves show small fluctuations, indicating distinct minimum for the cost function used process;

i.e.,

many

similar

combinations

that there is no for the training of weights in the

network can give rise to the same performance. On the basis of these observations, we advise network ing

Figs.

to maximum

generalization

performance

the generalization during an initial 3 and 4. When the maximum

network

is trained

again,

starting

from

training

has been the same

training by monitor-

session,

as in

located, point

a

the as the

initial run and in exactly the number of iterations corresponding to the maximum. If the training set is of sufficient quality, such a trained

network

will have learned

all the general

dependencies

and no or as few spurious dependencies as possible. Not all neural network programs have the flexibility that allows this If the program does not have this option, one must make several training sessions with different values of the program parameters to locate the values that give maximum strategy.

performance for the given training set. One might imagine that this training mance mance

to maximum

perfor-

would result in a network that has a maximum perforonly with respect to the test set used. This cannot be the

case, however, because the network training involves learning the training set only; the test set is used only to determine when to stop the training process. No information from the test set is used in the training. Although this argument should be convincing, we have nevertheless performed an independent and direct test, training an identical network on 100 patients from the previous test set and using the remaining 50 patients and the previous training set to test the performance. When this network was trained to maximum performance, its performance was identical

to that of the original

columns

of Table

1).

network

(compare

the last two

608

J#{248}rgensen et al.: Neural

Table 1. OptimIzed

networks

to diagnose

generalIzation performance vs network parameters.a No. of patients in the training

set0

0

20(5)

10(10)

neurons

50(2)

100

0d

08

58±5

61±6

60±1

62

69

5

58±5

62±9

74±2

72

77

10

59±6

61±6

71±4

75

77

15

59 ± 5

60 ± 5

66 ± 11

75

76

20

58±5

60±5

63±9

74

77

30

58±6

61±5

62±5

75

Based network mean neurons

on 10 input neurons was

±

tested

on a test

and 1 output

The performance

73 of each

patients. Results are shown as the

set of 150

SD % performance of networks in the hidden layer after optimal

neuron.

containing the indicated numbers of training on data sets of the indicated

sizes. For training sets containing 4N1,(N1, + 2). However, data of low quality will require a larger training set. The use of more hidden neurons did not require more training examples when the optimal training procedure was used. The present work was concerned with some of the fundamental problems that arise in the use of neural networks to solve clinical problems. In the subsequent paper [16] we apply the method described here to a problem of direct clinical relevance: training neural networks to diagnose APvII within 24 h after admission by using different combinations of laboratory data, and then using the results of the training to determine which combination of laboratory data gives the most certain diagnosis and which combination is the most cost-effective.

We thank J. Hertz, NORDITA, and P. Salamon, San Diego, for their comments on the manuscript. We are grateful to Fyns Amt for financial support and to the Danish Natural Science Research Council for a fellowship to J.Sj.

References Conclusions The present work demonstrates that a properly designed and trained neural network can be a useful diagnostic tool, although the full potential of this new technique still has to be explored. Even with the limited amount of data used (EGG and serum potassium at admission), we could obtain the correct diagnosis for 78% of all AMI-suspected patients, which includes patients with a bundle branch block. However, this high performance is obtained only by using a sufficient number of hidden neurons and patients’ data, by applying PCA, and by optimally training the network. Too few hidden neurons will give rise to a lower performance, typically -60%. Performance also decreases drastically if the training set is smaller than the critical size, i.e., the smallest set that gives optimal performance. The performance of the neural network was significantly better than that of LDFA and linear logistic regression (see Table 1) but similar to that of QDFA. The neural network should be optimally trained by the supervised early-stopping procedure. The number of input variables and thus the amount of numerical work can be reduced by applying independent

PCA. Performance of the network, however, is of the number of input variables as long as the main

1. Bounds

DG, Lloyd PJ. Comparison

of neural network and other pattern recognition approaches to the diagnosis of low back disorders. Neural Netw 1990;3:583-91. 2. Mulsant BH. A neural network as an approach to clinical diagnosis. MD Comput 1990;7:25-36. 3. Boone JM, Gross GW, Greco-Hunt V. Neural networks in radiologic diagnosis. I. Introduction and illustration. Invest Radiol 1990:25: 1012-6. 4. Gross

5.

6. 7.

8.

9.

GW, Boone JM, Greco-Hunt V, Greenberg B. Neural networks in radiologic diagnosis. II. Interpretation of neonatal chest radiographs. Invest Radiol 1990:25:1017-23. Astion ML, Wilding P. Application of neural networks to the interpretation of laboratory data in cancer diagnosis. din Chem 1992:38:34-8. Baxt WG. Use of artificial neural network for the diagnosis of myocardial infarction. Ann Intern Med 1991;115:843-8. Furlong JW, Dupuy ME, Heinsirner JA. Neural network analysis of serial cardiac enzyme data. A clinical application of artificial machine intelligence. Am J Clin Pathol 1991:96:134-41. Cicchetti DV. Neural networks and diagnosis in the clinical laboratory: state of the art [Editoriall. Clin Chem 1992:38:9-10. fletcher JM, Rice Wi, Ray RM. Linear discriminant function analysis in neurophysiological research: some uses and abuses. Cortex 1978:14:564-77.

612

Jorgensen

et al.: Neural

networks

Hertz J, Krogh A. Palmer RG. Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley, 1991:352 pp. 11. World Health Organization, Regional Office for Europe. lschaemic heart disease registers: Report of the Fifth Working Group. Copenhagen: WHO. 1971:54 pp. 12. i#{248}rgensen iS, Pedersen JB. Calculation of the variability of model parameters. Chemom Intell Lab Syst 1994:22:25-35. 13. Murtagh F, Heck A. Multivariate data analysis. Dordrecht, Holland: Reidel, 1987:210 pp. 10.

to diagnose

14.

acute myocardial

Astion

ML, Wilding

infarction

P. The

application

of backpropagation

neural

networks to problems in pathology and laboratory medicine. Arch Pathol Lab Med 1992;116:995-1001. 15. Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine [Review]. Clin Chem 1993:39:561-77. 16. Pedersen SM, i#{248}rgensen iS, Pedersen iB. Use of neural networks

to diagnose acute myocardial infarction. II. A clinical application. Clin Chem

1996:42:613-7.