Electronics Packaging Manufacturing, IEEE ... - Semantic Scholar

IEEE TRANSACTIONS ON ELECTRONICS PACKAGING MANUFACTURING, VOL. 23, NO. 4, OCTOBER 2000

345

Decomposition in Data Mining: An Industrial Case Study Andrew Kusiak, Member, IEEE

Abstract—Data mining offers tools for discovery of relationships, patterns, and knowledge in large databases. The knowledge extraction process is computationally complex and therefore a subset of all data is normally considered for mining. In this paper, numerous methods for decomposition of data sets are discussed. Decomposition enhances the quality of knowledge extracted from large databases by simplification of the data mining task. The ideas presented are illustrated with examples and an industrial case study. In the case study reported in this paper, a data mining approach is applied to extract knowledge from a data set. The extracted knowledge is used for the prediction and prevention of manufacturing faults in wafers. Index Terms—Data mining, decision making, decomposition, integrated circuit, quality engineering.

sion making based on the acquired knowledge is to reduce the volume of data to be processed at a time, which can be accomplished by decomposition. In this paper, numerous decomposition approaches are defined and applied for effective knowledge discovery and decision making. Besides easing computation, decomposition offers an added benefit. It facilities dynamic knowledge extraction that can be coupled with decision making, thus resulting in real-time autonomous systems able to continuously improve their predictive accuracy. The research reported in the paper is based on the developments in machine learning and data mining discussed for example in [1]–[3]. Some of the most known learning algorithms are listed next. ID3:

I. INTRODUCTION

T

HE VOLUME of data is growing at an unprecedented rate, both in the number of features (attributes) and objects (instances). For example, many databases with genetic information may contain thousands of features for large number ofpatients.Inthetechnologyapplications,quantitative(e.g.,from sensors) and qualitative (e.g., from manufacturing environment) data from diverse sources may be linked, thus significantly increasing the number of features. For example, to analyze the quality of wafers in semiconductor industry upstream chemistry information proved to be useful. Such information combined with the manufacturing process data results in a large number of features objects (cases) for which the information is collected. Data mining offers tools for discovery of patterns, associations, changes, anomalies, rules, and statistically significant structures and events in data. The patterns and hypothesis are automatically extracted from data rather than being formulated by a user as it is done in traditional modeling approaches, e.g., statistical or mathematical programming modeling. As a new discipline, data mining draws from other areas such as statistics, machine learning, databases, and high performance computing. In many applications, data is automatically generated and therefore the number of objects available for mining can be large. The time needed to extract knowledge from such large data sets is an issue, as it may easily run from seconds to days and beyond. One way to reduce computational complexity of knowledge discovery with data mining algorithms and deci-

AQ15:

NaïveBayes:

OODG:

Lazy decision trees

C4.5: CN2:

IB: Manuscript received February 7, 2000; revised October 2, 2000. The author is with the Intelligent Systems Laboratory, The University of Iowa, Iowa City, IA 52242-1527 USA (e-mail: [email protected]). Publisher Item Identifier S 1521-334X(00)11075-4.

OC1:

1521–334X/00$10.00 © 2000 IEEE

Induction Decision Tree is a supervised learning algorithm developed by Quinlan [4]. Inductive learning system generates decision rules, where the conditional part is a logical formula [5]. Domain knowledge is used to generate new attributes that are not present in the input data. A simple induction algorithm that computes conditional probabilities of the classes. Given the instance, it selects the class with the highest posterior probability [6]. Oblivious read-Once Decision Graph induction algorithm for building oblivious decision graphs, using a bottom-up approach [7].

An algorithm for building the best decision tree for every test instance developed by Friedman et al. [8]. The decision-tree induction algorithm by Quinlan [9]. The direct rule induction algorithm by Clark and Boswell [10]. This algorithm combines the best features of both ID3 [4] and AQ [5], where it uses pruning techniques similar to the techniques used in ID3 and related to the conditional rules used in AQ. The instance based learning algorithms by Aha [11]. The Oblique decision-tree algorithm by Murthy and Salzberg [12].

346


T2:

The two-level error-minimizing decision tree by Auer et al. [13]. It minimizes the number of errors and discretizes continuous attributes. LERS: Learning from Examples using Rough Sets System [14]. Examples of other algorithms and developments in learning and data mining can be found in the edited volumes by Lin and Cecerone [15], Carbonell [16], and Michalski et al. [17], and the book by Mitchell [3]. For a survey of important applications of machine learning see [18] and [19]. Most of the developed to date rule extraction algorithms fall into the following classes: 1) Decision tree algorithms, for example ID3 [4] and C4.5 [9]. A decision tree algorithm derives a tree from a data set (learning mode) that is used to classify (decision making mode) objects with unknown decisions. 2) Decision rule algorithms, for example AQ15 [15]. A particular subcategory in this class includes algorithms based on the rough set theory proposed by Pawlak [20], for example LERS [14]. The basic concept in rough set theory is called a reduct, which is a set of features that uniquely identify each object in the training set. The formal reduct definition is provided in Pawlak [21]. Numerous algorithms have been developed for the extraction of decision rules from a training data set and using the extracted knowledge for decision making. The two classes of algorithms are illustrated with the example presented in the next section. In this paper, an approach based on the rough set theory is used to identify causes of wafer defects. One of the reasons for using the rough set theory approach over the decision tree approach is the belief that the former approach is more suitable for the problem of considered in this paper. The latter can be justified with the characteristics of the rules derived by each of the two algorithms. One feature (the parent node of the decision tree) is common to all rules derived by a typical decision tree algorithm. The features included in the rules derived by a rough set algorithm form patterns of different shapes and therefore offer a different predictive power. The example presented in the next section illustrates the results produced by the two classes of algorithms. II. EXAMPLE OF RULE EXTRACTION The example presented next illustrates the rules derived by different algorithms. Example 1 Consider the training set in Fig. 1 containing data for eight objects (manufacturing batches). Each batch is described by four and being process parameters and features, with features and characterizing each batch of products. features The rules in Fig. 2 are derived from the data in Fig. 1 with the decision tree algorithm [9]. The objects supporting each decision rule are listed behind each rule. The elements of the data set included in the rules in Fig. 2 create a pattern in the matrix in Fig. 3. All feature values of

Fig. 1. Training data set.

column F4 are involved in the rules, which is characteristic of the tree type algorithms. Another set of rules extracted with a rough set algorithm and the corresponding patterns are shown in Figs. 4 and 5. In this case the values of column 4 are only partially involved in the rules. In addition object 3 appears in two rules 3 and 4. The existing rule extraction algorithms are computationally complex, often aiming at forming rules with the minimum number of features. Experience indicates that many data sets contain both large number of features and objects as the data is frequently automatically collected. For example, data sets in semiconductor industry may contain large number of features of chemical composition and process-related information, all being continuously collected. Decomposition of such data sets becomes a key issue. III. DECOMPOSITION IN THE DATA MINING LITERATURE Decomposition has been discussed in the data mining literature, however, largely in the context of distributed and parallellearning.Thispaperemphasizestheuseofdecomposition to enhance decision making rather than learning. One of the most comprehensive sources of papers on distributed learning is the edited volume by Zaki and Ho [22]. Several contributors to this book discussed ways of leveraging parallel and distributed techniques in knowledge discovery, such as data cleaning and preprocessing, transformation, and learning. Grossman et al. [23] outlined fundamental challenges for mining large-sale databases, with one of them being the need to develop distributed data mining algorithms. Guo and Sutiwaraphun [24] described a meta-learning concept named Knowledge Probing to distributed data mining. In Knowledge Probing, supervised learning is organized into two stages. At the first stage, a set of base classifiers is learned in parallel from a distributed data set. At the second stage, the relationship between an attribute vector and the class predictions from all of the base classifiers is determined. Zaki [25] discussed a project called SPIDER that uses shared-memory multiprocessors systems (SMP’s) to accomplish parallel data mining on distributed data sets. The above references indicate that distributed data mining is an active area of research. As this research continues, the outcomes are likely to parallel those of single-source data mining—there will be multiple approaches, some based on unique software (i.e., mobile agents), some based on specific hardware (i.e., SMP), and some involving hybrid methods. Certain approaches may work better for a given application, as in the case of nondistributed data mining. The continuing growth

KUSIAK: DECOMPOSITION IN DATA MINING

347

Fig. 2. Rule set derived by the C4.5 algorithm.

quality and quantity of data available to each site/classifier may vary substantially, leading to large discrepancies in their performance, thus integrating classification models derived from distinct and distributed databases is complex. The decomposition concept discussed in this paper is applicable to decision making in both distributed and single-data source applications. Fig. 3. Patterns corresponding to rules in Fig. 2.

Fig. 4. Rule set derived by the rough set algorithm.

Fig. 5. Patterns corresponding to the rules in Fig. 4.

of data bases and the increasing success of data mining in solving problems in engineering, business marketing, finance, and healthcare will determine the future needs for distributed data mining. Numerous researchers have recognized the importance of decision making in a distributed environment. One approach of mining and combining useful information that is distributed across diverse databases is to apply different machine learning algorithms to discover patterns exhibited in the data and then combine the computed descriptive representations. Combining multiple classification models has received attention in the literature [26]. Much of the research is concerned with combining models obtained from different subsets of a single data set as a means to increase accuracy and to integrate distributed information. The Java agents for meta-learning (JAM) system addresses the later by employing meta-learning techniques [27] extracting a higher level knowledge. Integrating the outputs of multiple classifiers via combiners or meta-learners has led to substantial improvements in several difficult pattern recognition problems. In a typical setting investigated, each classifier is trained on data taken or re-sampled from a common data set, or randomly selected partitions thereof, and thus experiences similar quality of training data. However, in distributed data mining involving heterogeneous databases, the nature,

IV. DECOMPOSITION PRINCIPLES There are two basic approaches to data mining. 1) Direct mining of data sets. 2) Mining of transformed data sets. The first approach is most often applied to data sets that can be processed in a reasonable time by the existing data mining algorithms. Transforming data sets before using a data mining algorithm is intended for large data sets. Perhaps the most useful form of transformation of data sets is decomposition. The decomposition may take place in space and time. The area of decomposition in time is extensive has received some coverage in the literature (see [22]) and is beyond the scope of in this paper. The following two forms of decomposition in space are considered in this paper. 1) Feature set decomposition; Partitioning a data set based on features, e.g., columns of the spreadsheet. 2) Object set decomposition; Partitioning a data set based on objects (training examples), e.g., rows of the spreadsheet. The two types of decomposition offer numerous benefits to data mining. 1) Reduction of computational time. 2) Increased transparency of the data mining process. 3) Suitability for parallel data mining algorithms. 4) Increased effectiveness of the decision making based on the extracted knowledge. The first three benefits are rather obvious and do not require justification. The decision making effectiveness argument will be explored and discussed in more details in this research. The benefits are due to the fact a decision is usually made based on a subset of features according to the algorithm presented next. Decision Making Algorithm 1 Step 1) Match the feature values of the object with unknown outcome with the conditions of the extracted rules. If no conflict occurs, go to Step 2; otherwise go to Step 3. Step 2) Assign decision to the new object equal to the matching rule(s), and Stop. Step 3) Generate the most likely solution and assign a significance measure to it.

348


The decision significance measure could be expressed with different metrics, e.g., probability, rule support. For the discussion of various measure of rule significance see [19] and [28]. Note that this decision making algorithm uses rules that have been previously generated with a data mining (learning) algorithm. The object’s feature set may not fully match the feature set of any of the individual decision rule, thus resulting in the generation of approximate solution (see Step 3 of Decision Making Algorithm 1) or possibly no decision at all. Decomposition of the training data set allows for the extraction of rules with the set of features dictated by the object with unknown outcome. In essence, the algorithm outlined next can be used for decision making. Decision Making Algorithm 2 Step 1) List all feature values of an object with unknown outcome. Step 2) For the features of Step 1, extract rules with a data mining algorithm. Step 3) Match the feature values of the object with unknown outcome with the conditions of the extracted rules. If no conflict occurs, go to step 3; otherwise go to Step 4. Step 4) Assign decision to the new object equal to the matching rule(s), and Stop. Step 5) If possible, get additional feature values, and go to Step 1; otherwise go to Step 5. Step 6) Generate the most likely solution. Of course this algorithm is one of many that could be used for decision making. To increase the accuracy of decision making, the concept of orthogonal algorithms was proposed in [29]. According to the orthogonality concept, the rule-matching algorithm (e.g., Decision Making Algorithm 1 or 2) generates a (primary)-decision that is compared with the C(confirmation)decision generated by an affinity based algorithm. In fact, muland confirmation -ortiple primary thogonal algorithms can be used. In the next sections, the two types of data set decomposition will be discussed using the data of Example 2. Example 2 Consider the data set in Fig. 6 with seven features and the decision for eight objects. V. FEATURE SET DECOMPOSITION In this section, three modes of feature decomposition are discussed. 1) Content-based decomposition; the feature set is decomposed into mutually exclusive or partially overlapping used for each subset. subsets with the same decision The feature origin, availability, and other criteria could drive the content of each feature set.

Fig. 6. Data set with seven features

F 1–F 7 and the decision D.

Fig. 7. Rule set obtained from the data set in Fig. 6.

2) Intermediate-decision decomposition; in some applications feature values are generated over time. In addition, the downstream features may dependent on the upstream features. 3) Feature type decomposition; some of the existing rule extraction algorithms are intended for specific types of features, e.g., discrete value features. 4) Feature relevance decomposition; features may show various degree of relevance to the outcome, measured with statistical metrics (e.g., correlation) and context relationship, which is more tacit and difficult to measure (e.g., the impact of outside temperature on the computer energy consumption). A. Content-Based Decomposition An entropy based rule extraction algorithm was used to extract rules from the data set in Fig. 6, i.e., eight objects, seven – , and the decision . The rule set is shown features in Fig. 7. For the definition of entropy and the discussion of entropy algorithms the reader is referred to [3] and [30]. Stefanowski [28] presented a comprehensive survey of algorithms for extraction of rules and their performance. The support for the decision rule varies between 1 and 3. For example, rule 1 applies to objects 2, 5, 8, while rule 2 is supported by object 6. Using the same entropy algorithm, the two rule sets in Figs. 8 and 9 were generated for the decision . The first rule set uses the features – , while the second uses the features – . Each of the data set used to extract the rules in Figs. 8 and 9 is a subset of the original data set in Fig. 6. Rather than using the same decision for each of the training data sets, an alternative approach is proposed next. B. Intermediate-Decision Decomposition One of the features of the original data set could be treated as a decision for the first data set and a feature for the next. For a set with large number of features a cascade of numerous subsets could be created. The cascade concept is illustrated with the rules in Figs. 10 and 11. The cascade decomposition concept may find uses in applications, where intermediate decisions are made. The intermediate


Fig. 8.

Fig. 9.

349

Rule set for features F 1–F 3.

Rule set for features F 5–F 7.

decision may be indicative of a decision path followed or be a milestone of a decision process where the feature values are generated over a period of time. The cascade approach to decision making parallels the incremental learning approach used in knowledge extraction [17]. To illustrate the application of the two-cascade rule sets in Figs. 10 and 11 consider the following decision making scenario. A decision is to be made for an object with the values and . The values of two known features of additional features are to be generated from a test to be performed two days later. The type of the test to be ordered depends on the outcome of an intermediate decision. As the two feature and match rule 3 of Fig. 10, the invalues Medium is generated. For this termediate decision value of the intermediate decision, test producing feature is the most likely needed. To generate the necessary decision and and the outcome rule the training set with features needs to be considered by the rule extraction algorithm. The produced by the test invokes the decifeature value . Note sion rule 4 of Fig. 6, thus producing the decision that the vector of feature values Medium does not match any of the rules in Fig. 7 derived form the full data set. The rule set in Fig. 11 was generated with the entropy algorithm used throughout this paper. The concept of intermediate decisions calls for new rule extraction algorithms that would include a specified set of features in all decision rules. In the discussed case the specified set of feature and . contained feature The above considerations illustrate the power of decomposition in mining and decision making of large data sets. C. Feature Type Decomposition In this decomposition mode, features are grouped according to the type that is handled by a particular feature extraction algorithm. The rule set in Fig. 12 is derived from the subset , and of the data set in Fig. 6 with categorical features only, while the rules in Fig. 13 is extracted from the set , and with numerical values.

data. In medical applications, the data may originate from an unrelated clinical study. A question arises whether all the features included in these data sets have any relevance to the partial outcomes or outcomes considered in decomposition? For example, does it make sense to group feature and from the data set in Fig. 6? This feature relevance issue becomes especially important in “wide” (with many features) data sets. A measure of feature relevance would allow eliminating irrelevant combinations of features, thus reducing the computational effort. VI. OBJECT SET DECOMPOSITION A data set that contains large number of features is referred to as wide and a data set that includes large number of objects are referred to as deep. Application type and cost of getting the data are the main factors impacting the width and depth of data sets. The feature set decomposition discussed in the previous section was intended primarily to wide data sets. In this section, the decomposition aimed at deep data sets is presented. The following two modes of object set decomposition are considered. 1) Object content decomposition; Objects are grouped according to the time interval, origin, applicability, and so on. 2) Decision value decomposition; The set of objects is split into subsets according to the decision value. 3) Feature vale decomposition; The objects (and possibly features) are partitioned into subsets based on the value of selected features. A. Object Content Decomposition The objects from Fig. 6 are arbitrary decomposed into two groups {1–4} and {5–8}. The rule sets extracted for each of the two sunsets are shown in Figs. 14 and 15. B. Feature Value Decomposition In some cases removal of objects (and possibly features) from the training data set with the values outside of the decision set is warranted. The objects in Fig. 6 are separated into two subsets {1, 4} and {2, 3, 5, 6, 7, 8} based on the following feature values: and . The decision rules extracted from the latter data sets are shown in Fig. 16. C. Decision Value Decomposition

D. Feature Relevance Decomposition The feature relevance issue is more general than the items previously discussed. The data used for mining is often dictated by its availability or being generated by another application. For example, in industrial applications the data mining sets are often files with the statistical process control or design of experiments

Considering objects corresponding to the decision leads to the objects 2, 5, 6, and 8 from the data set in Fig. 6 and the rules in Fig. 17. Considering objects corresponding to the decision results in the set of objects 1, 3, 4, and 7 from the data set in Fig. 6 and the rules in Fig. 18.

350


Fig. 10.

Rule set with features F 1–F 3 and the intermediate decision D =

Fig. 11.

Rule set with features F 4–F 6 and the decision D = [1; 2].

Fig. 12.

Rule set with categorical features.

F

4.

Fig. 18.

Rule set for the objects 1, 3, 4, 7 corresponding to the decision D = 2.

Fig. 13. Rule set with numerical features.

Fig. 14.

Rule set with objects 1–4 in Fig. 6 and the decision D . Fig. 19. Wafer disks: (a) roll-off = 0, (b) roll-off > 0, and (c) roll-off > 0 and integrated circuits marked for cutting.

Fig. 15.

Rule set for objects 5–8 in Fig. 6 and the decision D .

Fig. 16.

Rule set for objects 2, 3, 5, 6, 7, and 8 in Fig. 6 and the decision D .

Fig. 17. Rule set for the objects 2, 5, 6, 8 corresponding to the decision D = 1.

The decision value decomposition concept may be applied for cases where the number of examples supporting one decision outcome (e.g., positive outcome) is large and the number negative outcomes is low. Mining the larger data set produces rules with significant support measured with the number of objects represented with the corresponding rules. This set of decision rules could be used to define the conditions for the positive outcomes. In other words, objects matching the decision rules will be assigned positive outcomes while the remaining objects would be assigned negative outcomes. A considerable time might be needed to collect sufficient number of negative

Fig. 20.

Data mining controller.

examples. During the time the data is being collected the rules derived from the positive examples could support the decision making process. VII. INDUSTRIAL CASE STUDY One of the processing steps in semiconductor industry involves polishing wafers. The problem considered here is that of improving quality of wafers and eliminating waste at the downstream production stages (e.g., companies such as INTEL, Motorola), where the actual integrated circuits are produced. This important production problem described below could not be


Fig. 21.

Illustrative rules.

Fig. 22.

Rules for the modified data set.

solved with traditional approaches such as quality control, design of experiments, and operations research. Before a wafer is transformed into an integrated circuit, it undergoes an elaborate manufacturing process with polishing being one of the final steps. For unknown reasons, in some wafers, a portion of the surface around wafer’s edge becomes round rather than flat; thus resulting in so called roll-off. The roll-off surface is undesirable because it reduces the area available for the creation of chips. The roll-off (see Fig. 19) results in waste of material that has undergone a lengthy and expensive manufacturing process. The existing statistical and optimization approaches applied to this production problem have not produced satisfactory results. The ultimate goal of the research discussed in this paper is to develop a data-mining controller shown in Fig. 20. Using the features of fabricated wafers (e.g., material properties, X-ray information) a data-mining controller will determine the value of control parameters such as temperature, pressure, type of polishing pads, etc. The preliminary analysis of the data provided by an industrial company indicates that the production problem can be solved with the proposed data mining approach. To nature of the problem is illustrated with the features of the preliminary data set that was collected at the company for quality control purposes. The models and algorithms to be developed in this research will be used to determine the set of features needed for high accuracy decisions. A. Data Set The data set includes 40 000 observations, each containing 15 features (material, process, and operations characteristics) – . denoted as : Year.

351

: : : : : : : :

Sequential number. Date/Time. Lot number. Wafer specification. Operator name. Rough pad lot number. Intermediate pad lot number. Polisher identification number. : Rough pad life time. : Intermediate pad life time. : Vacuum used on rough turn-table. : Vacuum used on intermediate turn-table. : Average removal. : Sigma removal. And two output features related to the quality of the wafers. : Average roll-off. : Sigma roll-off. B. Computational Results The preliminary data set has been collected for the purpose of statistical process control rather than data mining. The data file was created at different pieces of processing equipment, at different days, shifts, and for different operators. However, due to the effort in the data collection, the new requirements for more systematic data collection and organization could not be considered. The fact that numerous prior design of experiments and statistical process control efforts did not identify specific causes of the quality problem was a hindrance. As processing the entire data file was not possible, therefore the object decomposition approach discussed in this paper was used to create a training set of 1000 examples representing the entire database. The data set was decomposed based on a threshold measure and

352


the training set was randomly selected among the partitions. Different experiments have been performed to link the process, material, and operational factors with the specific quality levels of wafers. Some of these experiments and their results of are discussed next. In one of the experiments of the extraction of rules data sets with the high and low (four level) degree of granularity of Wafer_specification was used. The four rules in Fig. 21 illustrate the nature of relationships between the control parameters and the process outcome (Roll_off) reflecting the product quality for the first experiment. The desired value of Roll_off is zero. The greater the deviations from zero the poorer quality of the wafer. Decreasing granularity of the feature Wafer_specification did not change the nature of decision rules as illustrated in Fig. 22. The might be two possible reasons behind the insensitivity of the above rules. 1) The detailed symbols associated with feature Wafer_specification do not well correlate with the chemical and other properties of the wafer that contribute to the unacceptable levels of Roll_off 2) Feature other than Wafer_specification may result in excessive values of Roll_off The preliminary results are encouraging. First of all, at the learning phase has produced only exact rules, which implies that there exists a definite link the process, material, and operational factors and the wafer quality. Secondly, the prediction algorithms discussed in [29] generate accurate decisions. It appears the goal of building the data mining controller in Fig. 20 will be accomplished in the near future.

VIII. CONCLUSION The computational complexity and robustness of knowledge extraction from large data sets and decision making can be enhanced by decomposition. The data decomposition and structuring for effective decision making was emphasized in this paper. The proposed ideas were illustrated with examples and an industrial case study. A data mining approach was used to solve a quality engineering problem in a semiconductor industry. The proposed approach linked material, process, and operations parameters with the level of product quality. These relationships captured in the form of decision rules were used to improve the quality of products. The results derived did not require any experimentation with the manufacturing process, rather actual material, process, and operations data was collected. The latter is an asset of data mining, especially in situations where experimentation is costly or is not feasible.

REFERENCES [1] M. J. A. Berry and G. Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support. New York: Wiley, 1997. [2] R. Groth, Data Mining: A Hands-On Approach for Business Professionals. Upper Saddle River, NJ: Prentice Hall, 1998.

[3] T. Mitchell, Machine Learning. New York: MacGraw-Hill, 1997. [4] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986. [5] R. S. Michalski, I. Mozetic, J. Hong, and N. Lavrac, “The multi-purpose incremental learning system AQ15 and its testing application to three medical domains,” in Proceedings of the 5th National Conference on Artificial Intelligence. Palo Alto, CA: AAAI, 1986, pp. 1041–1045. [6] P. Domingos and M. Pazzani, “Beyond independence: Conditions for the optimality of the simple Bayesian classifier,” in Machine Learning: Proceedings of the Thirteenth International Conference. Los Altos, CA: Morgan Kaufmann, 1996, pp. 105–112. [7] R. Kohavi, “Wrappers for performance enhancement and oblivious decision graphs,” Ph.D. dissertation, Comput. Sci. Dept., Stanford Univ., Stanford, CA, 1995. [8] J. Friedman, Y. Yun, and R. Kohavi, “Lazy decision trees,” in Proceedings of the Thirteenth National Conference on Artificial Intelligence. Boston, MA: AAAI Press and MIT, 1996. [9] J. R. Quinlan, C4.5: Programs for Machine Learning. Los Altos, CA: Morgan Kaufmann, 1993. [10] P. Clark and R. Boswell, “The CN2 induction algorithm,” Machine Learning, vol. 3, no. 4, pp. 261–283, 1989. [11] D. W. Aha, “Tolerating noisy, irrelevant and novel attributes in instancebased learning algorithms,” Int. J. Man–Mach. Stud., vol. 36, no. 2, pp. 267–287, 1992. [12] S. K. Murthy and S. Salzberg, “A system for the induction of oblique decision trees,” J. Artif. Intell. Res., vol. 2, no. 1, pp. 1–33, 1994. [13] P. Auer, R. Holte, and W. Maass, “Theory and application of agnostic PAC-learning with small decision trees,” in ECML-95: Proceedings of 8th European Conference on Machine Learning, A. Prieditis and S. Russell, Eds. New York: Springer Verlag, 1995. [14] J. W. Grzymala-Busse, “A new version of the rule induction system LERS,” Fund. Inform., vol. 31, pp. 27–39, 1997. [15] T. Y. Lin and N. Cecerone, Eds., Rough Sets and Data Mining. Boston, MA: Kluwer, 2000. [16] J. G. Carbonell, Ed., Machine Learning: Paradigms and Methods. Cambridge, MA: MIT Press, 1990. [17] R. S. Michalski, I. Bratko, and M. Kubat, Machine Learning and Data Mining: Methods and Applications. New York: Wiley, 1998. [18] P. Langley and H. A. Simon, “Applications of machine learning and rule induction,” Commun. ACM, vol. 38, no. 11, pp. 55–64, 1995. [19] A. Kusiak, Computational Intelligence in Design and Manufacturing. New York: Wiley, 2000. [20] Z. Pawlak, “Rough sets,” Int. J. Inform. Comput. Sci., vol. 11, no. 5, pp. 341–356, 1982. , Rough Sets: Theoretical Aspects of Reasoning About [21] Data. Boston, MA: Kluwer, 1991. [22] M. J. Zaki and C.-T. Ho, Eds., Large-Scale Parallel Data Mining. New York: Springer-Verlag, 2000. [23] R. Grossman, S. Kasif, R. Moore, D. Rocke, and J. Ullman, “Data mining research: Opportunities and challenges. Report of three NSF workshops on mining large, massive, and distributed data,” Tech. Rep., http://www.ncdm.uic.edu/M3D-final-report.htm, 1999. [24] Y. Guo and J. Sutiwaraphun, “Knowledge probing in distributed data mining,” in Proc. 4h Int. Conf. Knowledge Discovery Data Mining, 1988, http://www.eecs.wsu.edu/ hillol/kdd98ws.html. [25] M. J. Zaki, C.-T. Ho, and R. Agrawal, “Scalable parallel classification for data mining on shared-memory multiprocessors,” in Proc. IEEE Int. Conf. Data Eng., Sydney, Australia, 1999, http://www.cs.rpi.edu/ zaki/papers.html#WKDD99, pp. 198–205. [26] T. G. Dietterich, “Machine learning research: Four current directions,” AI Mag., vol. 18, no. 4, pp. 97–136, 1997. [27] S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan, “JAM: Java agents for meta-learning over distributed databases,” in Proc. 4rth Int. Conf. Knowledge Discovery Data Mining, 1997, pp. 74–81. [28] J. Stefanowski, “On rough set based approaches to induction of decision rules,” in Rough Sets in Knowledge Discovery, L. Polkowski and A. Skowron, Eds. Heidelberg, Germany: Physica-Verlag, 1998, pp. 501–529. [29] A. Kusiak, J. A. Kern, K. H. Kernstine, and T. L. Tseng, “Autonomous decision-making: A data mining approach,” IEEE Trans. Inform. Technol. Biomed., to be published. [30] P. Clark and R. Boswell, “Rule induction with CN2: Some recent improvements,” in Proceedings of the Fifth European Working Session on Learning, EWSL’91, Y. Kondratoff, Ed. Berlin, Germany: SpringerVerlag, 1991, pp. 151–163.


Andrew Kusiak (M’90) is a Professor of Industrial Engineering at the University of Iowa, Iowa City. He is interested in applications of computational intelligence and optimization in product development, manufacturing, medical informatics, and medical technology. He has published research papers in journals sponsored by AAAI, ASME, IEEE, IIE, INFORMS, ESOR, IFIP, IFAC, IPE, ISPE, and SME. He speaks frequently on international meetings, conducts professional seminars, and consults for industrial corporations. He serves on the editorial boards of 18 journals, edits book series, and is the Editor-in-Chief of the Journal of Intelligent Manufacturing.

353