Appraisal of companies with Bayesian networks ... - Semantic Scholar

5 downloads 392 Views 383KB Size Report
and Nakamura, M. (2006) 'Appraisal of companies with Bayesian networks', .... Note that this type of improvement is what you see in neural network or support.
Int. J. Business Intelligence and Data Mining, Vol. 1, No. 3, 2006

329

Appraisal of companies with Bayesian networks Priyantha Wijayatunga*, Shigeru Mase and Masanori Nakamura Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, W8-28, O-okayama, 2-12-1, Meguro-ku, 152-8552 Tokyo, Japan E-mail: [email protected] E-mail: [email protected] E-mail: [email protected] *Corresponding author Abstract: Appraisal of companies is an important business activity. We mainly apply Bayesian networks for this classification task for Japanese electric company data. Firstly, few standard statistical techniques are performed. Then Bayesian networks are applied in four steps; (1) for implementing a current procedure of economical experts, where economical variables are clustered and then summarised for computing a score for deciding the economical state of the company, (2) the same is done but with clustering of economical variables based on data, (3) the naive Bayes classifier and (4) an improved naive Bayes classifier through adjusting its conditional density of each feature variable given the class variable, which are initially obtained by maximum likelihood estimation. Adjustments are done by using the simulated annealing optimisation. Finally, a sensible way for appraisal of companies is discussed. Keywords: dependence; naive Bayes assumption; Bayesian networks; probability adjustment; classification accuracy. Reference to this paper should be made as follows: Wijayatunga, P., Mase, S. and Nakamura, M. (2006) ‘Appraisal of companies with Bayesian networks’, Int. J. Business Intelligence and Data Mining, Vol. 1, No. 3, pp.329–346. Biographical notes: Priyantha Wijayatunga received his MSc Degree in Mathematical and Computing Sciences from the Tokyo Institute of Technology, Japan and currently, he is working for a Doctoral Degree there. His research interests include probabilistic reasoning, statistical learning theory and statistical modelling. Shigeru Mase is a Professor in Statistics in the Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Japan. Masanori Nakamura received his MSc Degree in Mathematical and Computing Sciences from the Tokyo Institute of Technology, Japan.

1

Introduction

Appraisal or credit-rating of companies is very important business activity benefiting both companies themselves and potential investors. Usually, it is based on many economical and financial feature variables on the performance of the company, which are used to predict the class variable the economical and business state of each of them. Copyright © 2006 Inderscience Enterprises Ltd.

330

P. Wijayatunga, S. Mase and M. Nakamura

The working data set in this study is on Japanese electric companies. It consists of data on a set of financial feature variables and a class variable that ranks each company’s financial state. Presently, the economists, accountants and investment analysts who are collectively called economical experts usually perform the credit-rating by summarisation of distinct clusters of feature variables and then computing a score with those summaries. NEEDS-CASMA (Corporate Appraisal System by Multivariate Statistical Analysis), which is used for rating 2276 companies in Japan is such an example. It uses 15 financial indices for summarising them into five indices through principal component analysis. Then a score is calculated as a weighted sum of those summary indices where the weights are determined by a regression analysis. Bayesian networks, which are a type of probabilistic graphical model have become very popular among the Artificial Intelligence community as a tool for reasoning under uncertainty with the seminal work of Pearl (1988). These probabilistic expert systems that are defined over set of random events (variables) represent the probabilistic relationships among them. Their efficient reasoning (inference) power is because of their ability to model conditional independencies among the events concerned and hyper-graphs derived from them, which allow implementation of inference algorithms. Therefore, Bayesian network is graph (acyclic directed) with nodes representing random variables and links (arrows) representing probabilistic relationships, where each node is associated with a (conditional) probability distribution. These (conditional) probability distributions altogether represent the assumed joint probability distribution of the whole set of variables (see Section 2). And by those inference algorithms, any marginal (conditional) distributions of interest, i.e., given some observations (evidence) on certain variables a probability distribution of other variable(s), can be calculated efficiently. Thus a Bayesian network acts as probabilistic expert system. For classification tasks like ours, i.e., given observations on some of the feature variables, finding the probability distribution of the class variable, application of Bayesian networks is straightforward. In fact, applications of Bayesian networks are numerous in almost every field that deals with random and uncertain contexts. In particular in finance, there are applications of them as a general inference tool (Gemela, 2001) or a classification tool (Baesens et al., 2002; Sarkar and Sriram, 2001; Shenoy and Shenoy, 1999). An important thing about Bayesian networks is their transparent or white box nature (as opposed to black box nature of neural networks), which allows efficient communication between modeller and application domain expert/user (in our case economical expert and investors). Further, for discrete variables, Bayesian networks assume no distributional restriction other than the unrestricted multinomial assumption. However there are other types of models for the task. Some of them are regression models, neural network models, discriminant analysis and more recently, support vector machines. We apply some of them for our data set and show their performance. It is difficult to claim that there is a universally best method of classification as some methods are preferred over the others for certain contexts. But Bayesian networks have some advantages over them, besides those mentioned above. They are able to make inferences on available evidence, possibly partial, whereas neural networks and some regression methods may need data on all the feature variables to decide on the class variable. This is very important when an investor who is having partial data on the feature variable set needs to classify his/her case. Further, often

Appraisal of companies with Bayesian networks

331

regression models assume mutual independence covariates (feature variables) but Bayesian networks do not. For these simple reasons, we are interested in building classifiers using Bayesian networks for our task. Note that there may be some remedies to overcome those problems faced by other classification methods; but we are interested in Bayesian networks and possible improvements for them for classification tasks. We use special type of Bayesian network called naive Bayesian network and variants of it. In this paper, firstly we perform classification task based on the methods of discriminant analysis, neural networks and linear multiple regression (Venables and Ripley, 1994) but not going into deep using the built-in functions available in statistical language and environment R, which is freely available on internet (R Development Core Team, 2004). Secondly, we apply Bayesian networks for the task. In this step, we try few network models. They are as follows. •

A network model to implement a current classification method of economical experts, where economical variables are clustered and then summarised for computing a score for deciding the economical state of the company. This is done by constructing a Bayesian network on class variable, feature variables and their summaries, which are obtained by principal component analysis.



A network model is obtained in the same way but the clustering of variables is based on information contained in data. For identifying feature variable clusters, we draw a hierarchical cluster dendrogram on the set feature variables using the strength/weakness of conditional dependencies among them given the class variable.



The naive Bayes classifier (Domingos and Pazzani, 1997; Friedman et al., 1997; Heckerman, 1997), even though the data are indicative of violation of naive Bayes assumption – conditional independence of feature variables given the class variable.



An improved naive Bayes classifier obtained by simply adjusting the conditional density of each feature variable given the class variable, which are initially obtained by maximum likelihood estimation (or Bayesian estimation with non-informative priors) to increase the classification accuracy. These adjustments are done by straightforward application of simulated annealing optimisation function available in R. It may be that these adjusted probabilities are able to capture the previously ignored conditional dependencies among feature variables as they give a better classification accuracy for the new naive Bayes model.

Note that this type of improvement is what you see in neural network or support vector machine training; learning algorithm tries to minimise the classification error when deciding on parameters of the model. Thus, this is a simple proposal to improve naive Bayes classifier not restricting to traditional maximum likelihood learning framework. We perform one-leave-out cross-validation on each model to identify its classification accuracy except for the last model. Further, we discuss a heuristic rule that can be useful for users rather than classifying their test cases into predefined class variable categories.

332

2

P. Wijayatunga, S. Mase and M. Nakamura

Bayesian networks as classifiers

2.1 Independence and Bayesian networks Firstly, let us briefly introduce what are Bayesian networks (for extensive discussion on the topic, the reader is referred to Cowell et al. (1999) and references therein) and their use as classifiers. Consider a Bayesian network on the discrete finite random variables (X1, …, Xn) = X. Then, under suitable conditions, we have n

p(x) = ∏ p ( xv | x1 ,… , xv −1 ) v =1

which is the chain rule of probability, where p(xv|x1, …, xv–1) is the conditional density of Xv given x1, …, xv–1. Let the conditional independence statement that Xv is conditionally independent of {X1, …, Xv–1}\Xpa(v) given Xpa(v) is true, where Xpa(v) ⊆ {X1, …, Xv–1}, i.e., p(xv|x1 ,… , xv–1) = p(xv|xpa(v)) holds for v = 1, …, n. Note that a conditional independence statement represents an information irrelevance – in the above case, given Xpa(v), the density of Xv is fully defined irrespective of the realisation of the variable set {X1, …, Xv–1}\Xpa(v) for v = 1, …, n. When causality is concerned, this means that given some causes some of the other causes become irrelevant or neutral. Then we have the factorisation of probability distribution on X, say, P(X) with the density p(x), n

p(x) = ∏ p ( xv | x pa ( v ) ). v =1

This factorisation can be represented by a directed acyclic graph where the arrows are drawn to represent probabilistic influences among variables. Therefore, the arrows are drawn from each variable in pa(Xv) to Xv for v = 1, …, n and the variable set pa(Xv) is called the parents of Xv. And it is assumed that for each v ∈ {1, …, n}, the conditional density of Xv|xpa(v) is multinomial with its parameter, say, θv|pa(v) written as X v | x pa ( v ) ,θ v| pa ( v ) ∼ Mn(θ v| pa ( v ) ).

(1)

It is easy to see that P(Xv = k|xpa(v)) = E{(θv|pa(v))k}, where the expectation is taken with respect to θv|pa(v) and (θv|pa(v))k is the kth component of it whose length equals the size of the state space of Xv. Or simply when we do not have the prior knowledge on the parameter θv|pa(v) for v = 1, …, n (non-informative priors), then for a given data set maximum likelihood estimation yields the conditional probability P(Xv = k|Xpa(v) = t) equals corresponding relative frequency, i.e., P ( X v = k | X pa ( v ) = t ) =

#( X v = k , X pa ( v ) = t ) #( X pa ( v ) = t )

(2)

where #(Xpa(v) = t) denotes the number of cases that have the realisation t for the variable set Xpa(v) and #(Xv = k, Xpa(v) = t) denotes that with the extra condition of the variable Xv having the realisation k in the training data. Therefore, given a network structure and the training data, it is quite easy to define the joint density of X in terms of conditional density of each X ∈ X given its parent variable set (empirically).

Appraisal of companies with Bayesian networks

333

The Bayesian networks can be used to represent probabilistic influences, often causal among set of random variables. For a given random vector X with its observed data matrix D, one can fit a Bayesian network; first it needs to decide the probabilistic influence structure and then to learn the parameters from the data. As humans are better in identifying interaction structure of a set of variables the network structure can be defined with the help of the domain experts. However, there are some mathematical methods to learn the network structure. Then the parameters are learned as mentioned above. Note that when we have prior knowledge on the parameters of the Bayesian network, we encode it into the prior densities of parameters. Then we do the posterior analysis to get the conditional probability densities, p(xv|xpa(v)) for v ∈ {1, …, n}. When deciding the structure of a Bayesian network or testing its validity if it has been defined by the domain experts, one may use the information contained in data for identifying (conditional) (in)dependence relations among variables. Further, by doing so one can aid the experts in their task. The (conditional) dependence between two variables (given another) is tested by (conditional) mutual information (Cheng et al., 2002). For three random variables, X1, X2 and X3 with sizes of their state spaces r1, r2 and r3, respectively, the mutual information (MI) between X1 and X2 and the conditional mutual information (CMI) between X1 and X2, given X3, is defined as follows. MI ( X 1 , X 2 ) =

p( x1 , x2 )

∑ p( x , x ) log p( x ) p( x ) 1

x1 , x2

CMI ( X 1 , X 2 | X 3 ) =

1



(3)

2

2

p( x1 , x2 , x3 ) log

x1 , x2 , x3

p( x1 , x2 | x3 ) . p( x1 | x3 ) p ( x2 | x3 )

(4)

It has been shown (Kullback, 1997) that under the assumption of conditional independence of X1 and X2 given X3, we have 2N ⋅ CMI (X1,X2|X3) ~ χ r23 ( r1 −1)( r2 −1) , where N is the number of cases used to calculate the conditional mutual information. Similarly, under the assumption of independence of X1 and X2, we have 2 N ⋅ MI ( X 1 , X 2 ) ∼ χ (2r1 −1)( r2 −1) . But we have observed that often researchers do not use the critical values obtained from corresponding distributions but use arbitrary values (Cheng et al., 2002) that can be larger than the theoretical counterparts when they perform the test on dependency. It seems these arbitrary values are used in practice to avoid obtaining densely connected networks. We can define another measure to test the dependency as follows. The Matsushita distance (Read and Cressie, 1988) between two discrete probability distributions Φ and Ψ is:  M = ∑  x

{

1/ 2

} 2



φ ( x) − ψ ( x)  .

(5)

It is a metric (true distance); therefore it can be used to measure the strength or the degree of a (conditional) dependence relation. In order to measure the degree of the dependency between X1 and X2, we let φ(x) = p(x1, x2) and ψ(x) = p(x1)p(x2). Smaller the distance weaker the dependency and zero distance implies independence. Since Matsushita distance is a true distance, the derived measure may show the degree of dependency in some sense. Similarly we can define it for measuring the degree of a conditional dependence relation.

334

P. Wijayatunga, S. Mase and M. Nakamura

For a given data set and possibly for any available prior knowledge one can build a Bayesian network. Thereafter it can be used for inference and classification. For example, it can be used to calculate any marginal conditional distribution of a variable given some of other variable, i.e., p(x|data on some other variables). When it comes to a classification task, other variables are a subset of feature variables and X is the class variable. And the predicted value xˆ for X is the one that maximises above probability, i.e., xˆ := arg max p ( x | data on some other variables ) x

(6)

Evaluation of above probability density can be done by the junction tree algorithm (Cowell et al., 1999), which we do not describe here.

2.2 Naive Bayes’ network classifier Consider a classification problem where C is the unobservable class variable and A = ( A 1 , … , A n ) is the observable attribute or feature vector. For observed A = a, the state of the class variable is usually estimated by p ( a | c ) p (c ) p( a ) = arg max p ( a | c) p (c).

arg max p (c | a ) = arg max c

c

(7)

c

As n increases, it may not be possible to define the above equation explicitly as there may not be enough data for every configuration of the set of the variables. For example, if n = 15 and each variable has a state space of size 3, then there must be approximately 4 × 107 number of cases for each configuration of the set of variables to have at least one case. Clearly this number is not even sufficient to define the empirical probability of p(a|c). So, a remedy often used is the application of so-called naive Bayes’ assumption – all the feature variables are conditionally independent from each other given the class variable (Domingos and Pazzani, 1997; Friedman et al., 1997). With the assumption, we let n

cˆ = arg max p (c | a ) := arg max ∏ p (ai | c) p (c). c

c

(8)

i =1

This simplifying assumption can be represented as a Bayesian network shown in the Figure 1. Therefore naive Bayes classifier is a special case of Bayesian network, but it does not need to use junction tree algorithm to make predictions in it.

3

Description of the data

The data that are on electric companies in Japan, consist of the class variable called ‘superiority of the company’ (EC) – economical expert categorisation of the company to be ‘risky’ or ‘ordinary’ or ‘excellent’ and a set of economical feature variables. Each feature variable has the state space of ‘bad’, ‘moderate’ or ‘good’ (in fact, the raw data are real valued). Further we use values 1, 2 and 3 for ‘risky’, ‘ordinary’ and ‘excellent’, respectively, for EC and values 1, 2 and 3 for ‘bad’, ‘moderate’ and ‘good’,

Appraisal of companies with Bayesian networks

335

respectively, for feature variables in our analysis as all the variables are ordinal. The data set consists of a total of 523 number of cases. Figure 1

Naive Bayes assumption

For example, three raw data vectors for three companies are as follows. Data ordered according to the feature variables listed below company = Sony (EC = 3) (2.99, 3.31, 2.48, 24.25, 147363, 129.53, 7.32, 0.882, 0.424, –1.38, 37.74, 166.90, 0.9, 31.72, 1.537) company = Cannon (EC = 3) (9.92, 9.68, 9.69, 51.27, 113160, 190.86, 27.2, 0.671, 0.203, 7.83, 14.57, 23.93, 1.02, 58.94, 3.008) company = TDK (EC = 2) (2.42, 2.97, 3.63, 74.11, 57713, 400.86, 40.6, 0.511, 0.004, 5.89, –150.50, –141.38, 0.81, 49.48, 0.570) The datum for each feature variable for each company, which is official is obtained in soft form from a company financial database of ‘Nikkei database’ called ‘NEEDS’, where it is not for free. Note that a similar but a hard form publication by ‘Nikkei’ named ‘Nikkei Financial Analysis’ is also available in Japan. These sources are in Japanese language. The data for the class variable for all the companies concerned are obtained from economical experts subjectively, who estimate them based on their profile analyses on companies. That is, we take expert evaluated subjective state as the data for the class variable for each company. So, it is often the case that experts do not agree on the same value for a given company. As a result, sometimes we have to consider few estimates by different experts to decide on a value for the class variable for a particular company. And then a suitable value is chosen. Thus, the data for class variable are imprecise. The definitions of feature variables that are categorised into four groups are as follows.

336 •



P. Wijayatunga, S. Mase and M. Nakamura Profit performance (p) p1 =

ordinary profit × 100; gross capital

p2 =

ordinary profit × 100; sales

p3 =

operating profit × 100. sales

Financial security (s) s1 =

equity capital × 100; gross capital

s2 = free cash flow s3 =

circulating assets × 100; current liabilities

s4 = instant coverage ratio



s5 =

fixed assets × 100; equity capital + fixed liabilities

s6 =

liability with interest × 100. equity capital

Growth potential g1 = percentage of increase in sales for the term; g2 = percentage of increase in operating profit for the term; g3 = percentage of increase in ordinary profit for the term.



Corporate efficiency and productivity e1 =

sales × 100; gross capital

e2 =

inventory balance at the end of the term × 100; sales

e3 =

ordinary profit × 100. number of employees

Clearly there may be many functional relationships among feature variables. Further, it may be good if the class variable is given a larger state space than the size of three, as there are many feature variables to predict it. But we have to solve this classification problem with the available information.

Appraisal of companies with Bayesian networks

4

337

Computational experiments

4.1 Few standard statistical classification methods We use built-in functions in R corresponding to standard statistical classification techniques, namely, (1) quadratic discriminant analysis (qda), (2) linear discriminant analysis (lda), (3) neural network (nnet) and (4) linear multiple regression (lm), when we perform one-leave-out cross-validation on our data set to obtain classification results shown in the Tables 1 and 2. Note that categorisation of feature variable values as ‘bad’, ‘moderate’ and ‘good’ is done carefully before they are used in calculations because some feature variables have seemingly outlying values. Table 1

Overall performance of all the methods (for 523 cases) Test error: number of cases (%)

Method

0

1

–1

2

–2

qda

255

(48.8)

233

(44.5)

21

(4.0)

12 (2.3)

2

(0.4)

lda

423

(80.9)

54

(10.3)

46

(8.8)

0 (0.0)

0

(0.0)

nnet

412

(78.8)

60

(11.5)

51

(9.7)

0 (0.0)

0

(0.0)

lm

286

(54.7)

123

(23.5)

103

(19.7)

4 (0.8)

7

(1.3)

BN–1

329

(62.9)

85

(16.3)

109

(20.8)

0 (0.0)

0

(0.0)

BN–2

366

(70.0)

61

(11.6)

96

(18.4)

0 (0.0)

0

(0.0)

naive-B

353

(67.5)

94

(18.0)

75

(14.3)

0 (0.0)

1

(0.2)

naive-B*

429

(82.0)

47

(9.0)

47

(9.0)

0 (0.0)

0

(0.0)

Table 2

Performance of all the methods according to the true value of EC Test error: number of cases (%) True EC = 1 (59 cases)

Method

0

–1

qda

36 (61)

lda

39 (66)

nnet

31 (53)

lm

11 (19)

BN-1 BN-2 naive-B

True EC = 2 (342 cases)

True EC = 3 (122 cases)

–2

0

1

–1

22 (37)

1(2)

270 (79)

38 (11)

34 (10)

74 (61) 48 (39)

0(0)

20 (34)

0(0)

298 (87)

18 (5)

26 (8)

86 (70) 36 (30)

0(0)

28 (47)

0(0)

299 (87)

20 (6)

23 (7)

82 (67) 40 (33)

0(0)

41 (69)

7(12)

237 (69)

43 (13)

62 (18)

38 (31) 80 (66)

4(3)

19 (32)

40 (68)

0(0)

241 (71)

32 (9)

69 (20)

69 (57) 53 (43)

0(0)

0( 0)

59 (100)

0(0)

305 (89)

0(0)

37 (11)

48 (81)

10 (17)

1(2)

210 (61)

67 (20)

65 (19)

95 (78) 27 (22)

0(0)

5(8)

0(0)

261 (77)

39 (11)

42 (12)

114 (93)

0(0)

naive-B* 54 (92)

0

0(0)

1

2

61 (50) 61(50) 8(7)

The Table 1 displays how test errors (test error = true value of EC – estimated value of EC) for all 523 cases, which are found by a cross-validation are distributed for all the classification methods except the last. In one-leave-out cross-validation, for each case, a statistical model is learned from the rest of the cases and then the prediction of the statistical model and the true value of EC for the case is compared. Table 2 shows the breakdown results of the overall results of Table 1. That is, Table 2 displays how test

338

P. Wijayatunga, S. Mase and M. Nakamura

errors are distributed for the cases whose true value of EC is 1 (59 cases), 2 (342 cases) and 3 (122 cases), for all the classification methods in its three blocks. The Figure 2 illustrates the information contained in Table 2 for some methods graphically. The rows of the figure that is a matrix of stacked-bars correspond to true value of EC and columns correspond to the method. Figure 2

Stacked-bar matrix: classification performance of method (column) by true value of EC (row). For example, stacked-bar of the first row and first column shows the classification performance of the method ‘lda’ for the cases whose true value of EC = 1 (59 cases). Number of cases are stacked in the increasing order of absolute value of test error. Each stack has blocks for |test error| = 0,1

In the first row (true EC = 1), number of cases with test error 0, –1 and –2 are stacked for each classification methods. In the second row (true EC = 2), it is for those with test error 0, –1 and 1 and in the third row (true EC = 3), it is for those with test error 0, 1 and 2 that the number of cases are stacked. In our case, if any bar is having only two blocks, it means it misses the last error category. You may note that from the Table 1, both ‘nnet’ and ‘lda’ have very good overall classification accuracies, whereas both ‘lm’ and ‘qda’ do not. Their performances for every true value of EC should be noted separately from the Table 2.

4.2 Bayesian network methods In this subsection, we describe few Bayesian network models, namely, ‘BN-1’, ‘BN-2’, ‘naive-B’ and ‘naive-B*’ for our task. Note that classifications are done by applying junction tree algorithm and equation (6) for the first two networks whereas for the last

Appraisal of companies with Bayesian networks

339

two it uses only the equation (8). For testing the predictive performance of the first three networks, one-leave-out cross validation technique is used, whereas for the fourth, a different method, which is described later is used. The classification results are shown in the Tables 1 and 2. Implementation of the junction tree algorithm and other computations are done in R.

4.2.1 BN-1: clustering of feature variables by the experts A current method of economical experts for the classification task is calculation of a score for a given company from the feature variable summaries, which are obtained through principal component analysis. An obvious way to do it with Bayesian networks is to derive summaries in the same way and then to include these summaries as some of the nodes of the Bayesian network, along with feature variables and the class variable. The Figure 3 shows the network structure obtained in this way. We find the principal components of each of the categories ‘profit performance’ (p), ‘growth potential’ (g) ‘corporate efficiency and productivity’ (e) and ‘financial security’ (s). Those are the clusters made by economical experts. It is found that for the categories ‘p’, ‘g’ and ‘e’, only the first two principal components are sufficient for each, but for ‘s’, we cannot reduce its dimension. Therefore, we connect the original variables directly to the class variable for the category ‘s’, whereas variables in other categories are connected to the class variable via respective principal components. Thus, we get the Bayesian network shown in the Figure 3, where P1 and P2 are two principal components of p1, p2 and p3 and so on. The classification results show that it has a very moderate performance. Figure 3

Bayesian network for implementing a method of economical experts (BN-1)

4.2.2 BN-2: Clustering of feature variables using conditional dependency Next we try to cluster feature variables according to the conditional dependency among them given EC. For that we build a hierarchical cluster tree, which arranges strong-conditionally dependent variables into the same cluster. The Matsushita distance rather than the conditional mutual information is used, i.e., in order to measure the strength of the relation that X1 and X2 are conditionally dependent given X3 we let

340

P. Wijayatunga, S. Mase and M. Nakamura

φ(x) = p(x1,x2|x3) and ψ(x) = p(x1|x3)p(x2|x3) in the equation (5). Smaller the distance weaker the relation, therefore zero distance implies conditional independence. As the hierarchical cluster tree brings dependent variables together according to the degree of the dependency, the reciprocals of the strengths of the dependencies are used to build it. The Figure 4 shows the hierarchical clustering where quantity Height in the figure refers to this reciprocal. Note that clustering is done on variables based on their data rather than clustering the data as is usually done. Figure 4

Weakness of the conditional dependence relations among feature variables given the class variable

It can be seen that we can have five clusters, namely, {p1, p2, p3, e2, s4}, {s1, ss, s5, s6, e1}, {g1, g2, g3}, {s2} and {e3}, if we cut the cluster tree at the Height of 15. These clusters are somewhat closer to the expert defined clusters. The hierarchical clustering done with conditional mutual information gives little different clusters that are not as close as those obtained using Matsushita distance to expert defined clusters. In the similar way, we build a Bayesian network and perform the cross-validation of our data set. The classification results are included in the Tables 1 and 2. Unfortunately, classification results are not good for the network. In fact, it gives a superficial overall accuracy as it is understandable from the Table 2 that it has a tendency to classify many ˆ = 2. cases as ec

4.2.3 Naive-B: naive Bayes classifier The naive Bayes classifier assumes the conditional independence of feature variables given the class variable, even though most of the real world data sets do not show such a property. However, it has shown considerable accuracy in prediction. This may not be surprising if the two modes of the conditional densities of the class variable given the feature variables with and without the assumption agree quite often, where latter is often unknown owing to lack of data. In fact, if we can define the required conditional density without any assumption, we may not need another.

Appraisal of companies with Bayesian networks

341

We apply naive Bayes classifier, namely, naive-B shown in Figure 5 to our data set. From the classification results based on cross-validation with application of equation (8), we can see that naive-B has a moderate overall classification accuracy, but for cases whose true value of EC is either 1 or 3, accuracy is very good. The classification accuracy for the cases whose true value of EC is 2 is not impressive. Figure 5

Naive Bayesian network for our classification problem (naive-B)

4.2.4 Naive-B*: adjusted naive Bayes classifier A straightforward way to build a Bayesian network on the set of variables is to learn a structure from the data using mutual and conditional mutual information or any other method (Cheng et al., 2002; Cowell et al., 1999; Heckerman, 1998). However, surprisingly, such unrestricted Bayesian networks are shown not to be increasing the prediction accuracy generally (Friedman et al., 1997). And they may be densely connected networks. More connections in the network means more data required for estimation of its parameter as the number of parameters in the network become larger then. Since our data are sparse, we are interested in naive Bayes network and its extensions but keeping the network structure the same. For our data set, the theoretical properties of (conditional) mutual information, i.e., using theoretical critical values to decide on dependencies (see Section 2), yield a network structure where feature variables are densely connected among themselves. So, we will face again the problem of no-data to estimate the parameters of it. Even if an arbitrary critical value that is used in common practice (some researchers use 0.01 as the critical value (Cheng et al., 2002) irrespective of the size of the training data set, whereas some use even bigger values) is used for testing the conditional independencies, we obtain a network with a densely connected feature variables. To avoid such problems, one can ignore some of the dependencies and build a network structure. One such example is so-called tree augmented naive (TAN) Bayes classifier (Friedman et al., 1997), where each feature variable is allowed to have at most one feature variable as a parent in addition to the class variable. The name derives from the fact that the subgraph induced by the feature variables forms a tree (or several trees). But in some cases TAN Bayes classifier does not increase the classification accuracy. Therefore, we do not employ it here and instead, try the following. One attempt to enhance the classification accuracy of naive Bayes classifier was to build so-called ‘recursive Bayesian classifier’ (Langley, 1993) where usual naive Bayes training was followed by hierarchical building approach based on misclassifications in the training data. Another one was adjusting class probabilities during classification time using certain numerical weights and it has been shown that it was improved (Webb and Pazzani, 1998).

342

P. Wijayatunga, S. Mase and M. Nakamura

But we try to increase the classification accuracy by altering its conditional probability densities, p(ai|ec) for i = 1, …, n, obtained by maximum likelihood estimation (note that since we have no prior knowledge on the parameters of the Bayesian network, we perform the maximum likelihood estimation as it gives the same results as in the Bayesian estimation in our case). The altered probability densities may not show conditional independence relations among feature variables given the class; therefore, they may be more suitable for the classifier as data show such dependencies. Following two steps show how we derive the values of the parameters. •

initialise the probability densities, p(ai|ec) for i = 1, …, n and p(ec), by the maximum likelihood estimation



iterate an optimisation procedure by changing p(ai|ec) for i = 1, …, n to get optimal classification accuracy.

Simply we use the simulated annealing optimisation function available in R for minimising the objective function (our error function), N

obj ( β ) := ∑ (| eci − 2 | +1) 2 ( xi − eci ) 2 i =1

where β = (log[p(a1|ec)], …, log[p(an|ec)]) and xi is the prediction of the classifier for the ith case. Note that predicted vector at the jth iteration, say, x ( j ) is a function of, say, (p(j)(a1|ec), …, p(j)(an|ec)) – the densities generated at the jth iteration. Also note that we use all the data as the training data to obtain the initial estimates for the conditional probability densities, p(ai|ec) for i = 1, …, n (by the maximum likelihood estimation). Further, we do not alter the density p(ec) as it is desirable that classifications with no evidence on feature variables should be based on true distribution of the class variable even in this adjusted naive Bayes classifier. Simulated annealing optimisation procedure is iterated for 10,000 times to minimise the objective function. Then the marginal conditional density vector ( pˆ (a1|ec), …, pˆ {an|ec)) corresponding to the smallest value of the objective function is chosen. This probability vector gives best classification accuracy among the others available. Implicitly in some sense, the objective function tries to capture the dependence structure of feature variables partially through adjusting (p(a1|ec), …, p(an|ec)). Therefore, we get an enhanced classification accuracy. The classification results are shown in the tables along with those for other methods. It can be seen that naive-B* classifier has the best overall classification accuracy and it is the best for cases whose true value of EC is either 1 or 3. The classification accuracy for cases whose true value of EC is 2 is somewhat lower. One remark is that our use of all the data to obtain initial estimates (mles) for the vector (p(a1|ec), …, p(an|ec)) is not really contradicting the cross-validation concept. This is because we adjust it 10,000 times through the simulated annealing optimisation procedure and the estimate derived for the vector is often after many iterations. In fact, simulated annealing algorithm generates a solution at each iteration randomly and they form a Markov chain (Ansari and Hou, 1997). The solution chosen is a random proposal by the algorithm.

Appraisal of companies with Bayesian networks

343

4.3 A better rating method Note that all the above models classify a given company as risky or ordinary or excellent based on the mode of the density p(ec|data). But sometimes, this type of rating can be misleading, for example, consider the case of having p(ec|data) = 0.11, 0.44 and 0.45 when ec = 1, 2 and 3, respectively. In this case, perhaps other information may be useful for user to decide the state of company rather than being informed of it as an excellent company. So, better way to inform the user is to give the mean of the density p(ec|data) rather than its mode. Then we get rating = 2.34 for the case. Note that in our domain of classification, we do not get p(ec|data) which is bimodal at EC = 1 and EC = 3; so mean cannot be misleading. The graphs in the Figure 6 show how the mean of EC is distributed for a naive-B* classification. Therefore, we propose a score of 50 × [E{EC|data} – 1] that varies on [0,100] for credit-rating of companies. Note that the graphs show some misclassification of companies (relative to the feature variables and according to the model) by the experts when they assign values for the variable EC. The distribution of the expectation of EC when true value of EC is 1 shows isolated mode at EC = 2. The cases that fall into this probability mass are classified as ˆ = 1 by the expert but p(ec|a) should be symmetric about 2 to yield E[EC] = 2 for such ec a case. But for cases in the probability mass near 1, p(ec|a) should be skewed to left. Similarly the distribution of the expectation of EC when true value of EC is 3 shows isolated mode at EC = 2. Further, the distribution of the expectation of EC when true value of EC is 2 shows two isolated modes at EC = 1 and EC = 3. And it can be seen that significant number of cases may be misclassified in this case. This may be because of the class noise as the variable EC is evaluated subjectively by the experts. But if they are not in fact misclassifications, then there should be some attribute noise (feature variable noise) for those cases as feature variables do not support the classification by the experts. Figure 6

Histograms of the means of densities p(ec|data) where p(.) is the adjusted density

344

5

P. Wijayatunga, S. Mase and M. Nakamura

Discussion

By examining the Table 1, it can be seen that both the linear discriminant analysis and the neural network model fitted by using the built-in functions in R give very close figures to that of naive-B* for the overall classification accuracy. But they are very poor for both quadratic discriminant analysis and linear multiple regression models. The models BN-1, BN-2 and popular naive Bayes (naive-B) have a moderate overall classification performance. Sparseness in data has affected their performances. By the principal component nodes, dependencies among subsets of feature variables are introduced. As the subsets become bigger, sparseness in data affects more. Note that a flat classifier, which classifies every case into the major category of the ˆ = 2), has the overall class variable in the data set (one classifies every case as ec performance of (342 (65.39 %), 122 (23.33 %), 59 (11.28 %), 0 (0.00 %), 0(0.00 %)) for test errors corresponding to Table 1. So, overall figures can be misleading to a data set like ours whose 65.39% of cases are in one category. This problem can be seen in literature such as in ‘credit card fraud detection’ (Chan et al., 1999) where main interest of the credit card company is to predict minor category correctly, i.e., identify potential deceitful applicants as much as possible. Therefore, in terms of investors, it is best to have a classifier which has very good classification performances for the cases whose true value of EC is either 1 (risky) or 3 (excellent) (see the Table 2). However, in terms of company, a ‘risky’ company ˆ = 2) or (true value of EC = 1) may like itself to be classified as either ‘ordinary’ ( ec ˆ = 3), whereas an ‘excellent’ company (true value of EC = 3) does not ‘excellent’ ( ec want to be classified to a lower category. Also, in terms of attraction of investors, ‘ordinary’ and ‘excellent’ companies may not want to see investors mislead by a classifier whose prediction performance is poor for companies whose true economic state is ‘risky’. Further, companies with their true value of EC = 2 will have a mixed likeness towards each of classifiers. Of course they would have little to say against BN-2, which is a poor classifier classifying them 89.18% correctly, unless they all may ask to be classified as ‘excellent’ companies. Note that if a classifier does so for a certain subset of them, they would like it to happen, but others in the group may not like to see that. And the reverse holds for them being classified as ‘risky’ for the middle category. Also as we have mentioned earlier, the data set contains subjective values (economical expert opinion) for the variable EC rather than precise values. And also the state space of EC is ordinal. Therefore when experts assign each company a state, they can be more frequent in making errors for companies whose true value of EC = 2 (‘ordinary’ which is the middle value of the ordinal scale) than doing it for the two extreme states, i.e., 1 (‘risky’) and 3 (‘excellent). In other words, class noise should be larger for cases with true value of EC = 2 than those of the others. Some classifiers may take this noise as a noise, whereas others may take it lightly. Thus, it can affect the classification performance of them, especially for cases with true value of EC = 2. You may note that naive-B* is a very good improvement for the ordinary naive Bayes classifier. But this improvement is heavily dependent on the objective function chosen to optimise and the optimisation procedure.

Appraisal of companies with Bayesian networks

6

345

Conclusion

Bayesian network classifiers mimicking a current classification method proved to be poor in classification accuracy. This may raise the question of accuracy of the experts’ current method of classification. Naive Bayes classifier has a moderate overall classification accuracy but it performs well when classifying cases whose class variable has values in important categories. Generally, our improved naive Bayes classifier can be preferred over other methods even though its classification accuracy for cases whose true value of EC = 2 is lesser than the some of the other methods. In fact, preference for a particular classifier may depend on the user who is going to use it. In this exercise, we have seen that there can be competing methods for one classification task where some of them are good at classifying cases whose true states in certain categories while others are better for other categories. Especially we saw that some traditional classification methods can give good performances. One to note is linear discriminant analysis and another is the neural network model, which performed good even in their basic form. So, it is always good to consider few methods. Thereafter, combining classifiers may solve the problem of each other’s drawbacks. In classification tasks like ours where class variable has ordinal state space, it is more informative for the user to do his/her test case following a rule, which is based on the expectation of the class variable given the feature variables. When it comes to naive Bayes classifier and its improvements, adjustment of conditional probability densities of feature variables given the class seems to be a promising way.

References Ansari, N. and Hou, E. (Eds.) (1997) Computational Intelligence for Optimization, Kluwer Academic Publishers, Boston, USA. Baesens, B., Egmont-Petersen, M., Castelo, R. and Vanthienen, J. (2002) ‘Learning Bayesian networks classifiers for credit scoring using Markov Chain Monte Carlo search’, Proceedings of International Congress on Pattern Recognition, IEEE Computer Society, pp.49–52. Chan, P.K., Fan, W., Prodromidis, L.A. and Stolfo, S.J. (1999) ‘Distributed data mining in credit card fault detection’, IEEE Intelligent Systems, Vol. 14, No. 6, pp.67–74. Cheng, J., Greiner, R., Kelly, J., Bell, D. and Liu, W. (2002) ‘Learning Bayesian networks from data: an information-theory based approach’, Artificial Intelligence, Vol. 137, pp.43–90. Cowell, R.G., Dawid, A.P., Lauritzen, S.L. and Spiegelhalter, D.J. (1999) Probabilistic Expert Systems Springer, New York, USA. Domingos, P. and Pazzani, M. (1997) ‘On the optimality of the simple Bayesian classifier under zero-one loss’, Machine Learning, Vol. 29, pp.103–130. Friedman, N., Geiger, D. and Goldszmidt, M. (1997) ‘Bayesian network classifiers’, Machine Learning, Vol. 29, pp.31–163. Gemela, J. (2001) ‘Financial analysis using Bayesian networks’, Applied Stochastic Models in Business and Industry, Vol. 17, pp.57–67. Heckerman, D. (1997) ‘Bayesian networks for data mining’, Data Mining and Knowledge Discovery, Vol. 1, pp.79–119. Heckerman, D. (1998) ‘A tutorial on learning with Bayesian networks’, in Jordan, M.I. (Ed.): Learning in Graphical Models, The MIT Press, Massachusettts, USA, pp.301–354.

346

P. Wijayatunga, S. Mase and M. Nakamura

Kullback, S. (1997) Information Theory and Statistics, Dover Publications Inc., New York, USA, pp.155–188. Langley, P. (1993) ‘Induction of recursive Bayesian classifiers’, in Brazdil, P.B. (Ed.): Proceedings of European Conference on Machine Learning, Lecture Notes in Artificial Intelligence, Vol. 667, Springer, Berlin, Germany, pp.153–164. Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, San Francisco, California, USA. R Development Core Team (2004) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, URL http:// www.R-project.org. Read, T.R.C. and Cressie, N.A.C. (1988) Goodness-of-Fit Statistics for Discrete Multivariate Data, Springer-Verlag, New York, USA, pp.110–112. Sarkar, S. and Sriram, R.S. (2001) ‘Bayesian models for early warning of bank failures’, Management Science, Vol. 47, No. 11, pp.1457–1475. Shenoy, C. and Shenoy, P.P. (1999) Bayesian Network Models of Portfolio Risk and Return, https://kuscholarworks.ku.edu/dspace/bitstream/ 1808/161/1/CF99.pdf. Venables, W.N. and Ripley, B.D. (1994) Modern Applied Statistics with S-Plus, Springer-Verlag, New York, USA. Webb, G.I. and Pazzani, M.J. (1998) ‘Adjusted probability naive Bayes induction’, in Antoniou, G. and Slaney, J.K. (Eds.): 11th Australian Joint Conference on Artificial Intelligence, Lecture Notes in Computer Science, Vol. 1502, Springer-Verlag, Heidelberg, Germany, pp.285–295.