Component-based Data Mining Frameworks - Semantic Scholar

2 downloads 0 Views 108KB Size Report
database (for instance, the Object. Pool). Component-based ... for example, OLE. DB for Data Mining (see .... schema to deal with data com- ing from different ...
Fernando Berzal, Ignacio Blanco, Juan-Carlos Cubero, and Nicolas Marin

Component-based Data Mining Frameworks OLAP Vs. OLTP in the middle tier. oth researchers and practitioners accept that decision support systems (DSSs) have specific needs that cannot be properly addressed by conventional information systems. Online Transaction Processing (OLTP) systems work with relatively small chunks of information at a time, while DSS applications require the analysis of huge amounts of data. Online Analytical Processing (OLAP) [1], data mining [3] and data warehouses [7] emerged during the last decade in order to fulfill the expectations of executives, managers, and analysts (also known as knowledge workers). We have also witnessed the flourishing of component-based frameworks during the last few years (see Communications’ special section on object-oriented application frameworks, Oct. 1997 and Communications’ special section on component-based enterprise frameworks, Oct. 2000). These frameworks are intended to help developers to build increasingly complex systems, enhancing productivity and promoting component reuse in well-defined patterns. Nowadays those systems are widely used in enterprises throughout the world, but they

PAUL WATSON

B

usually provide only low-level information processing capabilities, since they are OLTP-application-oriented. Here, we propose the development of custom-tailored component-based frameworks to solve DSS problems, although this approach can be extended to a wide range of scientific applications.

A Component-based Data Mining Framework An open framework for the development of data mining algorithms and DSSs should include capabilities to analyze huge datasets, cluster data, build classification models, and extract associations and patterns from input data. The conceptual model for such a system is shown in Figure 1. The data miner (the user of the system) has to analyze large datasets and he or she needs to make use of data mining tools to perform a task. Data is gathered and data mining algorithms are used in order to build knowledge models that summarize the input data. Those models may provide the information our user needs, or they may just suggest new ways to explore the available data. Moreover, those knowledge models could be used as input to other

mining algorithms in order to solve second-order data mining problems. Both knowledge models and dataset metadata might be stored for later use in the back-end database (for instance, the Object Pool). Component-based frameworks such as Enterprise JavaBeans (see java.sun.com/products/j2ee) and Microsoft .NET (see www. microsoft.com/com/net) are based on a common architectural pattern, a.k.a. the Enterprise Component Framework [5]. A simplified representation of this pattern is shown in Figure 2. This pattern, modeled as a parameterized collaboration in UML [6], contains six classifier roles which are depicted as rectangles: • Client. Any entity that requests a service from a component in the framework. These requests could be performed using special-purpose data mining query languages, for example, OLE DB for Data Mining (see www.microsoft.com/data/oledb). Instead of calling the component directly, the client internally uses a pair of proxies that relay calls from the client to the component. This level of indirection is, however, hidden

COMMUNICATIONS OF THE ACM

December 2002/Vol. 45, No. 12

97

Technical Opinion from the client perspective; it Commercial application servers back-end database is also useful makes location transparency may be suitable for e-business to provide a reliable computing possible and, when needed, it applications, but they lack the environment (preserving the supports message interception. stricter control of computing system state against power out• Factory proxy. Performs object resources data mining systems ages, for example). factory operations, common to require. This shortage is also Design Principles all framework components applicable to a wide range of (such as create or find) and scientific applications by exten- We believe every componentbased, data mining framework facilitates class methods. sion. Custom-tailored frameshould focus its design efforts • Remote proxy. Handles operaworks should include tions specific to each kind of capabilities such as CPU/mem- into two major objectives: component (such as inspection, ory/database usage monitoring, • Transparency. Both for parameter setting, and so users and programforth) and facilitates User mers. Users do not instance methods. need to be aware of the • Component. Both Data Mining System Graphical User Interface underlying complexity datasets and knowledge of the system, while models are components in Knowledge programmers should a data mining framework. Datasets Models Classifiers Data Mining Algorithm be able to create new Data mining algorithms framework compocould also be considered Persistence nents just by impleas components on their Service JDBC ODBC ASCII menting a core set of own, but they are just Wrapper Wrapper Wrapper well-defined interfaces. used through factory • Usability [4]. As any proxies to build knowlother software system, edge models. data mining systems • Container. Represents the SQL Server ASCII are used by people and framework’s runtime envi- Oracle Back-end Database DBMS DBMS Files (Object Pool) system usability is critronment, and holds the ical for user accepcomponents and both Figure 1. A conceptual model for DSSs. tance, provided that knowledge proxy roles. This containment workers are not usually knowlis shown by aggregation links in resource discovery, reconfigedgeable about computers. Users Figure 2. urable load balancing, and a should be able to store anything The container supports dishigher degree of freedom for they can reuse in future sessions, tributed computing services, programmers to implement disor even share the information such as security, interprocess tributed algorithms. they obtain. This groupware communication, persistence, • Persistence service. Permits the focus is especially important in and hot deployment. Transacstorage and retrieval of framedata mining applications, where tions are also supported by work components. This persisthe discovered knowledge must enterprise frameworks (such as tence service can be managed be properly represented and EJB/.NET frameworks) but are and coordinated by the concommunicated. not needed in a data mining tainer. Dataset metadata, disenvironment. Such an environcovered knowledge models, and Component-based ment, however, needs scheduluser session information are all Dataset Modeling ing, monitoring, and candidates to be saved for notification mechanisms to future use in the back-end data- Let us consider, for example, the case of assembling the datasets manage data mining tasks. base (the Object Pool). This 98

December 2002/Vol. 45, No. 12 COMMUNICATIONS OF THE ACM

that are used as input to build languages to define the cus• Joiners are used to join multiknowledge models. These tomized datasets they need. They ple datasets. They allow the datasets may come from heterowould probably reject a system user to combine information geneous information sources. that requires them to learn any coming from different sources. Data mining tools usually work complex formalism. In order to Joiners are also useful to with tables in the relational sense. improve system acceptance, we include lookup fields into a Each table contains a set of fixed- propose a bottom-up approach. A given dataset (as in data warewidth tuples that can be obtained family of dataset-building compohouse star schemas) and to either from relational databases or nents should provide users with specify relationships between any other information two datasets from the source (ASCII or XML same source (for example, files, for example). master/detail relationFactory Component Client All tabular datasets ships). proxy have a set of columns • Aggregators summarize Remote proxy (also called attributes). datasets in order to proEach one of them has a vide a higher-level view of unique identifier and an the available data. AggreContainer associated data type gations are useful in a Persistence (strings, numbers, dates, wide range of OLAP Service and so forth). A flexible applications, where trends tool should allow the are much more interesting Object specification of order than particular details. Pool relationships among Common aggregation attribute values and the functions include MAX, MIN, TOP, BOTTOM, grouping of attribute Figure 2. A simplified representation of the Enterprise Component Framework. COUNT, SUM, and AVG. values to define concept hierar• Filters perform a selection over chies. the input dataset to obtain subA data mining system should all the primitives they need to sets of the original input also be capable of performing het- build their own datasets from the dataset. In data mining applicaerogeneous queries over different available data sources: tions, filters can be used to perdatabases and information form samplings, to build sources. The independently • Wrappers are responsible for training datasets (such as when retrieved datasets, in fact, might providing uniform access to difusing cross-validation in classibe processed further in order to ferent data sources. Data stored fication problems), or just to join them with other datasets as sets of tables in relational select the data we are interested (data integration), to standardize databases can be retrieved perin for further processing. concept representations and elimiforming standard SQL queries nate redundancies (data cleaning), through call-level interfaces such • Transformers are also needed to modify dataset columns. to compute aggregations (data as JDBC and ODBC. Data Encoders are used to encode summarizing), or just to discard stored in other formats (locally input data, for example, propart of them (data filtering). as ASCII/XML files or remotely viding a uniform encoding All the aforementioned operain DSTP servers [see schema to deal with data comtions involving datasets can be www.ncdm.iuc.edu/ dstp], for ing from different sources. performed using powerful formal example) require specific wrapEncoders are useful to ensure models and query languages. pers. In fact, any accessible inforthat real-world entities are However, typical users are not mation source requires its own always represented in the same prepared to use such models and suitable wrapper. COMMUNICATIONS OF THE ACM

December 2002/Vol. 45, No. 12

99

Technical Opinion way, even when represented differently in different data sources. In fact, the same entity could even have several representations in a given data source. This kind of component is useful for data cleaning and integration. Extenders are used to append new fields to a given dataset. Those fields, known as calculated fields, are useful for managing dates and converting measurement units. The value of a calculated field is completely determined by the other field values in the same tuple. A calculated field could be specified using a simple arithmetic expression (including operators such as +, -, *, and /) or even more complicated algorithms (involving if statements and table lookups, for example). These components can be combined easily in tree-like structures to build highly personalized datasets. These datasets are amenable to standard query optimization techniques, therefore improving system performance. Using our approach, even computer-illiterate users are able to use complex data mining systems by linking dataset modeling components. Commercial enterprise application servers (the containers in the framework pattern) are currently restricted to OLTP applications, and we believe it is time for system architects to focus on higherlevel information processing capabilities in order to take advantage of the vast computing 100

December 2002/Vol. 45, No. 12 COMMUNICATIONS OF THE ACM

resources available with current corporate intranets. Although we have proposed a componentbased framework model for building a data mining system here, our approach is extendible to any CPU-intensive computing application. c

References 1. Chaudhuri, S. and Dayal, U. An overview of data warehousing and OLAP technology. ACM SIGMOD Record, Mar. 1997. 2. Codd, E.F., Codd, S.B., and Salley, C.T. Providing OLAP to User-Analysts: An IT Mandate. Hyperion Solutions Corporation, Sunnyvale, CA., 1998; www.hyperion.com/ products/whitepapers. 3. Han, J. and Kamber, M. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, 2000. 4. Juristo, N., Windl, H. and Constantine, L. (Eds.) Usability engineering. IEEE Software 18, 1 (Jan./Feb. 2001). 5. Kobryn, C. Modeling components and frameworks with UML. Commun. ACM 43, 10, (Oct. 2000), 31–38. 6. Object Management Group. OMG Unified Modeling Language Specification, 1.4, 1st. Ed. (Sept. 2001); www.omg.org/technology/uml/. 7. Widom, J. Research problems in data warehousing. In Proceedings of the 1995 International Conference on Information and Knowledge Management (CIKM'95), Nov. 29–Dec. 2, 1995, Baltimore, MD.

Fernando Berzal ([email protected]) is a researcher at the Intelligent Databases and Information Systems research group in the Department of Computer Science and Artificial Intelligence at the University of Granada, Spain. Ignacio Blanco ([email protected]) is an assistant professor at the University of Almeria, Spain. Juan-Carlos Cubero (jc.cubero@ decsai.ugr.es) is an associate professor in the Department of Computer Science and Artificial Intelligence at the University of Granada, Spain. Nicolas Marin ([email protected]) is an assistant professor in the Department of Computer Science and Artificial Intelligence at the University of Granada, Spain.

© 2002 ACM 0002-0782/02/1200 $5.00