A Generic Datamining System. Basic Design and ... - Semantic Scholar

A Generic Datamining System. Basic Design and Implementation Guidelines. Bota, J. A . y [email protected]

Garijo, M. y Velasco, J. R. z fmga,[email protected]

Skarmeta, A. F.x [email protected]

Abstract

The aim of this work is to study the engineering of a generic datamining system, being generic as it must try to integrate as many learning algorithms as possible. Meanwhile the system must be capable of generating, by means of meta-learning, a decission mechanism and so being able to decide the most adecuate algorithm for each datamining task, depending on basic features of the data set, requeriments of the user and the background knowledge adquired on previous datamining sessions. Obviously, to aord the integration of such number of learning algorithms, the ideal processing platform must be distributed because of the system's scalability. Dierent challenges appearing are analized. The rst one is the engineering of a distributed system for assuring scalability in order to integrate a potentially large number of machine learning algorithms. Another important problem is the de nition of a common functionality for all machine learning problems to ease integration and management of algorithms. However, the most important task is metalearning because algorithms and source data features, user requirements and metrics have to be formally de ned. Besides, dierent machine learning performance metrics should be stated and combined.

Keywords: generic datamining, meta-learning, distributed system engineering, multiagent platform, algorithms reusability.

1 Introduction Data Mining information systems or KDD(Knowledge Discovery in Databases)[18, 5] systems, represent one of the most important challenge of this decade. Both advances in computer science and the progressive and unstoppable reduction in price of massive data storage devices have made possible that many enterprises, public oces, institutions, etc keep great amounts of data over electronic format. All that information re ects general events, orders, customers relations, relevant economic data in general. However, there is implicit information that is no visible in a trivial way. That is due, in one one hand to that that information may be based in complex relations between dierent data elds of the y z x

This work has been developed while the author was visiting the Technic University of Madrid Departamento de Ciencias de la Computacion, Universidad de Alcala. Departamento de Ingeniera Telematica, Universidad Politecnica de Madrid Departamento de Informatica, Inteligencia Arti cial y Electronica, Universidad de Murcia

1

data base, and in the other hand to the great volume of data available, not manageable but computacionally. This article tries to study the engineering of a generic data mining system that shall integrate many learning algorithms as possible. Using an IA base trading[?], it should be possible to decide, each moment, the most appropiate algorithm to use, regarding data characteristics and the way and morphology of dessired information type to be extracted. Dierent design problems that appear in this particular application are depicted in section 2. CORBA, the most popular software architecture for distributed and objectoriented processing is considered as a possible desing and implementation platform in section 3.1. Then, an alternative conceptual and implementation tool is suggested: a metodolody along with a tool to develop multi agent systems, in section 3.2. In section ??, main ideas of the application design are outlined. Section 4 is dedicated to outline very preliminary considerations about the metalearning process in the context of this particular application. Finally, a summary of our initial conclusions are provided in section 6.

2 Application Threads

2.1 Using a Distributed Processing System

Both Software Engineering and Distributed Systems elds are continously getting closer. Computers are getting cheaper and cheaper and bandwidth avalability in open systems is growing each day. This two reasons have made Grosch's Law[4, pg. 588]1 to become obsolete. Taking both factors into account, ISO and ITU-T have outlined the basic set of characteristics for an open distributed system, by means of the ODP(Open Distributed Processing)[1] standard. That document has been the starting point of the OMG(Object Management Group) for specifying its software architecture[8] for open and distributed object processing. All that reasons motivate the building of a distributed KDD system. The KDD system must be distributed to accomplish scalability. It must be taken into account that the number of machine learning algorithms to integrate in the system is potentially unlimited. The more number of algorithms the systems is able to manage, the more adecuate is the KDD service given to nal users. Besides being distributed, the KDD system must be openned to allow for cooperative data mining tasks with other KDD systems or such both in a free of pay-per-learning way.

2.2 Integration and Reusability

Incorporating a new machine learning algorithm into the system not only implies to take into account all problems that comes from compilation of external software, but to make a mapping from a general machine learning functionality to the particular algorithm implementation's set of function calls. All machine learning algorithm follow a basic functionality. That's why it is possible to de ne a few generic calls to subtasks of the machine learning service. These generic calls will be the common basic interface to all machine learning algorithms at the system. The person dealing with the integration task must use the function calls that the new implementation to integrate oers, to redesign and reimplement the algorithm with the common interface that the KDD system impose. This strategy gives the system Grosch's Law states, in one of its possible interpretaions, that is more interesting to use a big computer machine than a small one. 1

independency from both the concrete algorithm and the particular implementation. A similar approach has been developed in [17], however the particular algorithm been used at each data mining session has to be determined by the user. The present work goes one step forward, propossing the automation of that process by means of meta-learning.

2.3 Automatic Con guration of the KDD process

Integration problem, outlined in the previous section, deals with o-line task. Anyway, each KDD process con guration must be executed on-line. That is, the system must be capable of doing the data mining by choosing the most idoneous learning algorithm automatically. Now, it can be felt that we have a two-level learning. The speci c level learning is the one accomplished by the the proper data mining task. The general level learning is, however, a meta-learning. This one is used to re ne the process by means of which deciding the more adecuate algorithm for each data mining session. The meta-learning process is not trivial because there are a lot of details that take part in it. Some of them are depicted here: Available resources: available resources on the machine that runs a particular algorithm should in uence the adecuacy of it at a given moment. That resources could be those of cpu workload, massive storage capacity available, etc. Available bandwidth: in a not paralelized and distributed KDD system, there is an important bottle neck and it is the great amount of source data to deal with. The approach in our system is to move that data to the machine in which the choosen algorithm is running so we neither consider paralelizing task nor movile code. So the available bandwidth along the path between the point where data is initially located and the machine that runs the algorithm must be considered. Accuracy of the discovered knowledge: machine learning algorithms dier from each other in features like noise endurance and eciency at producing new knowledge. There could be users who put more care in the accuracy rather than in the eciency of discovering it. Other users would be more interested about a slight look at data to make a quick decission. Topology of the discovered knowledge: there are lots of machine learning algorithms and the topology of new produced knowledge depends on the concrete algorithm being used. Some algorithms produce decission trees. Others produce classi er functions. Another group discover cuantitative relations between variables. The format of the new knowledge may be a constraint guided by the user. From this point, it can be seen that each particular implementation of a given machine learning algorithm must be given a set of features to characterice its demands on resources and on quality of service.

3 Development Issues

3.1 OMA as a reference architecture

OMA(Object Management Architecture), from OMG, states at a high abstraction level, all characteristics that must be reached by a software system to accomplish a distributed and

object-oriented computation. Inside the general architecture, CORBA(Common Object Request Broker Architecture))[8] is the most important component and it tries to describe the central part of it. That is the ORB(Object Request Broker) which makes all services available to the system's clients. Rest of components of OMA deals with questions like common facilities for all object, speci c application interfaces and application objects, etc.

Integrating and Reusing in CORBA CORBA's software integration mechanism is based on IDL(Interface De nition Language)[9]. This is a common syntax for de ning interfaces in a declarative way that must be observated by all client importing services from the system and all server exporting services to the system. The Stub is used by the client to import services. That Stub is a code produced by the IDL's compiler from a previous IDL declaration of the services the client wants to import. The code is produced in the language dessired to develop the client(e.g. C, C++, Java, Smalltalk, Cobol, etc.). The server programming is analogous. First of all, the skeleton is needed. Skeleton is produced along with the Stub. The former contains empty calls to oered services, in the language the server wants to be programmed. Then, the programmer must code the proper service's. Components integration and reusability with CORBA only needs a litte of reverse engineering for mapping original function/procedure calls to IDL ones.

Conclusions CORBA is de facto standard for object oriented distributed computation. It makes considerably easier the integration of dierent applications, coded in almost any programming language. It also allows reusability in a feasible way. But most of all, it makes possible to build large and scalable systems. However, this technology is not totally developed and still less extended. For example, there are some ORB implementations available. Most popular are Orbix and OrbixWeb from IONA and VisiBroker from Visigenic. They integrate code from C++, Java and Java, respectively. All those products are distributed by modules, and aren't aproppiate for an University research project.

3.2 An alternative approach

Multi Agent Platforms

Research interest in AI has been focused, mainly, to knowledge based distributed systems for the last decade. More precissely, multi agent systems(MAS) have experimented important advances in the last years. These systems are made up by autonomous and cooperant agents[16] each of them playing a speci c role in their society and gives the designer a conceptual framework for interoperability of heterogeneous systems. Next, basic characteristics[7] of those systems are given: modularity: a system based on MAS is building up from basic blocks which covers all sofware/hardware subsystems that compound the whole system.

Encapsulation: software components are encapsulated by a common interface, avoiding the heterogeneity problem. Cooperation: mechanism suitable for intelligent cooperation should be supported to allow complex interactions among the components in the framework. Distribution: modules location shouldn't be a constraint. They must be able to be distributed following any requirements. Openness: cooperation between heterogeneus components must be also understood as cooperation among heterogeneous systems to allow heterogeneous MAS to cooperate. Ease of use: CORBA based systems are neither so easy nor so confortable to use; they are really quite complex and demand considerable initial eort from the programmer. MAS incorporate poverful abstraction mechanisms that ease developing. The MAS approach as an alternative to OMA have just been mentioned. Following, we propose MAST, a particular example of a tool for developing multiagent systems to develop the early mentioned generic KDD system.

MAST: the Tool for Developing Multi Agent Systems MAST(Multi Agent Systems Tool)[7] is a toolbox for developing distributed processing systems in an agent-oriented fashion. It gives mechanism to specify individual agents, with both oered and needed services and particular goals. It also incorporates agent groups de ntion, tools for knowledge representation and exchange, and coordination mechanisms between agents.

Mix Architecture MIX[10, 13] is the reference architecture for all systems developed with MAST. This presents two basic models: the agent model and the network model. The former de nes an agent as a set of elements: Services: functionality oered to other agents. Goals: functions that an agent carries out for self-interest(not as a result of a petition form another agent). Resources: information about external resources like services, libraries, ontologies, etc. Internal Objects: data structures shared by all the processes that can be launched by the agent to carry out service requests or to achieve goals. Control: speci cation of how service requests are handled by the agent.

The network model de nes two distinct agent types: network agents that oer network management services and application agents that oer its own services depending on their particular role. As well as this, that model is structured in three levels: interface, message and transport layer. The rst one oers an API with both C++ and Java that gives communication facilities between agents by means of message passing. The second level services are oered to manage addresses and intentionality and body of the message. The transport level oers basic functionalities for sending and receiving messages over TCP/IP.

De ning Agents with MAST ADL(Agent De nition Language)[13, pg. 39-47] is a declarative language integrated in MAST for de ning the externals of an agent. Internals of agents shall depend on each particular implementation. Two agents which share the same ADL de nition could be implemented dierently and even with dierent programming languages. ADL allows to de ne agents in a jerarchical way, using the well known concepts of class and inheritance. Besides, services oered, goals to achieve, resources needed, internal data structures and possible policies to apply can be speci ed with ADL. Examples of agent de nitions using ADL can be seen at Chapter 8 of [13].

Exchanging Information Objects with MAST Once we have de ned agents that compound the systems, we should be able to exchange knowledge among them. CKRL(Common Knowledge Representation Language)[3] is the language used in MAST for declaring knowledge objects to be sent, received and managed by agents. It has been developed by the MLT Consortium under ESPRIT PROJECT 2154. Examples of using CKRL come along with the former ones ilustrating the use of ADL.

A Multiagent System for the Generic KDD System The engineering of a generic data mining system is a research task that is being done under the M2D22 Project. MAST is being used as a developing platform and MASCommonKads[6] as a metodollogy for design of multiagent systems there. Architecturally, the generic data mining system de nes the following agent roles: User agent: collects all parameters form the interface that compound the request of data mining service, does the request and wait for results that must be conveniently showed to the user. This agent should oer facilities to manage those results depending on its topology to help the user understanding them. Federation of naming service agents: agents capable of locating speci c machine learning agents must be observed in the system but the number of machine learning algorithms to be potentially integrated on the systems could be very high. Both reasons make necessary stating a federation of agents ofering naming services to assure the other machine learning agents extensibility. 2

This Project is funded by the Spanish Government under the CICYT project TIC 97-1343-C02-01

User Agent

S R

? Trading Agent R

A,B,C

? Naming Agent

A,B,C

Naming Agent

S, A C A

Control Agent

B

S’,S’’,S’’’,... R

Figure 1: An example data mining session in the generic data mining system Negotiation agent: once all machine learning services available for the user request have been located, this agent must start a negotiation process with all machine learning agents oering them to decide which one will serve the request. Machine learning agents: each one of these agents will encapsulate a machine learning algorithm with its own dinamic features that will describe the learning service its oering at the moment. Control Agent: after the negotiation process, and with the choosen machine learning agent, the control agent shall make a task planning for coordinating and controlling the whole machine learning process. A possible system's situation is depicted in gure 1. In that setting, the user agent makes a service request, S, to the trading agent. This trading agent ask for the given service to the suitable3 naming service agent. In this particular example, that particular agent nd out that no one in its hierarchy oers that service, so the request is forwarded to another naming service agent, in this case by the former but it could be forwarded by the trading agent, however. The next naming service agent that receives the request knows three agents, A,B and C , which oer the service. Now, the trading agent receives that information along with particular features of each service oer and makes a decision4 about what agent should carry on with the requested service. The A agent is choosen, in this example. After that, the trading agent informs the control one that makes its task planning and starts with the machine learning process, sending service primitives to the machine learning agent choosen. Once the learning process have nished, results are returned to the trading agent that pass them to the user agent and makes a feedback to its trading decission process to improve the meta-learning.

3 The suitability of a naming service agent should be understood depending on more low-level design cuestions like if the request conveys some information about the location of possible adecuate services, or like the place that the naming service agent takes at a possible agents hierarchy. 4 The search for possible valid service agents should be done exhaustively or partially depending on policies applied to the trading process.

Preprocessing

Preprocessed Data

Meta-Data

QoS

Algorithm

User Goal

Model

Learning

Trading Control

Process

Feedback

DataMining Process

Figure 2: Functional description of the dynamic data mining approach

4 What's Metalearning here Before de ning data mining in our particular context, lets have a look at the whole KDD process. Figure 2 describes, in a very high abstraction level our own KDD process. First of all, we have two paralell task. They are preprocessing and recovering user request about the data mining task. They are supossed to be ortogonal. MetaData gives preliminary meassures about source data. The User Goal informs about user's intentions and QoS about user preferences. The trading process recover these three parameters to make a decission about what machine learning algorithm could be adeccuate, and to state a control over it. After making this decission, the learning process can start. This will produce a data model for the user and a feedback of accuracy and others for improving metalearning at the trading process.

4.1 Parameters for De ning the Data Mining Service MetaData

A previous source data summarization may be useful for making the decission about the most adecuate learning algorithm. In the Statlog Project[14], various meassure types are considered: simple, statistical and information entropy related. Simple meassures are inmediatly worked out: parameters like number of classes, number of attributes and number of examples are enumerated there. Statistic meassures take a very high compu-

tational cost which makes them inadeccuate for the typical data sizes we're working with. Other meassures group, based on information entropy, are very interesting and shall be considered in the future.

User Goals Basic ideas about user goals are taken from [19]. In this revision two main data mining goals are outlined. They are description and prediction. Concrete task to achieve those are summarized in the following table: Description Prediction Clasi cation Regression Clustering Change Detection Summarization Deviations Detection Modelling of Dependencies We consider it as a starting point to allow the user specifying its basic intention of data manipulation.

QoS But allowing the user indicating what data mining to perform with data is not enough. The system should observate the way to perform data mining too. The concept os QoS(Quality of Service) is used there and, as we see it, is de ned by three parameters: Accuracy: dessired cuality of new knowledge could be important for the user, or maybe not. The accuracy oers the user the possibility of requiring it in a fuzzy way. Temporal restrictions: related to the former parameter, the system observates by incorporating this parameter, user requeriments about the interactivity of the learning process. Morphology of results: with this parameter, the user can state its preferences about decission trees, visual maps that clusters data in some way, a set of rules, concrete histograms, etc. After getting those three parameters well speci ed, and having a mechanism for negotiation with inteligent agents, what machine learning algorithm is the best for the learning process, the trading process can start and choose the most suitable algorithm, along with a control mechanism over it to carry out the learning. Once results are produces, a feedback is provided to the trading process so, by means of metalearning, it can improve the decission process.

4.2 A BootStrap Approach to Metalearning. Basic Ideas.

Automating the choose of most suitable machine learning algorithm to apply in each session is a dicult design task. We have observed three valid approaches for it. The rst one works with o-line decission rules based on other's results and our own previous experiences with algorithms use to construct some kind of expert system for dealing with

Preprocessing

Preprocessed Data

Meta-Data

Algorithm

QoS

Model

Learning Trading

User Goal

Control

Process

User Preferences

Feedback

KB User Model

Static KB

Expert System

Figure 3: The Concept of Bootstrap Metalearning the choice. The second way considered is learning by observing the user preferences about algorithms. This approach could work well in a system in which the user, possibly an expert one, has all control. So the metalearning process makes its own reasonings based on user actions. The third way is based on reinforcement learning. Again, user actions are important because he has to reward or punish in some way, system decissions. We propose a new way of data mining metalearning merging these three partially valid approaches in a bootstrap basis. This concept is ilustrated in gure 3. The Knowledge Base is used, in a read only mode, for the trading process. User model takes its entries exclusively from the user actions and the evolutionary set of rules, so called expert system, get entries form the learning process results instead.

5 Related Work This work focuses mainly on metalearning, although many particular issues are adressed in it. Anyway there is an important design task about the multiagent system. An interesting project based on Java agents is JAM[15]. In JAM, agents are distributed depending on where the multiple data sources are located. Each agent makes its own data mining and then another metalearning agents combine all results. They are used for fraud and intrusion detection. A similar approach is observed in the PADMA[11] architecture in which data mining agents make the discovering process in paralell without merging results. Integrating dierent machine learning tools have been outlined previously at the KEPLER system[17] which introduces the concept of extensibility of a data mining systems, in the sense of integrating any machine learning algorithm in the system. It is

based in the concept of \plug-in", however it doesn't incorporate decission mechanism to choose among those algorithm for a given data mining session. Related works are those in Statlog[14], MLT(Machine Learning Tool) project(ESPRIT-2154) and MLC++[12]. Statlog project was intended to compare statistical approaches with machine learning and neural networks in order to obtain some basic guidelines for deciding algorithm's best possible uses. In the MLT project, a set of machine learning algorithms were developed and compared in performance. MLC++ provides a set of C++ classes for supervised machine learning and will be used for being integrated in our system. They have been used yet in MineSet[2], a data mining system which focusses mainly in results visualization. Metalearning is also outlined in the arly mentioned JAM. JAM's metalearning must be understanded in terms of combining dierent classi ers previously obtained by each data mining agent.

6 Conclusions For the engineering of a generic data mining systems being able of integrate any implementation of any machine learning algorithm, a distributed processing platform is needed, in order to assure scalability. CORBA has been signaled here as the most popular framework for object-oriented distributed processing. However, it is not adecuate for this project because of its technology is not still very accesible. Instead of it, we propose MAST as a powerful tool for developing, in a modular way, a potentially large distributed system. Besides, MAS-CommonKads is used as a metodollogy for developing knowledge bases systems, in an agent oriented manner.

7 Future Work Although basic guidelines have been outlined, there is still a lot of work to be done. Parameters de ning the datamining service must be operative. Basic machine learning algorithms integration mechanism, totally or partially automated, must be de ned and implemented along with some criteria to dimensionate resources in the distributed system. A trading mechanism must be implemented and it could be good to de ne machine learning algorithms serach policies, for delimiting functional and authoritative domains. All ideas about metalearning must be tested and re ned and its bootstrap must end with the most suitable metalearning strategy, or maybe an hybrid one.

References [1] ISO/IEC CD 10746-1. Basic Reference Model of Open Distributed Processing - Part 1: Overview and Guide to Use. July 1994. [2] Cli Brunk, James Kelly, and Ron Kohavi. Mineset: An integrated system for data mining. In Daryl Pregibon & Ramasamy Uthurusamy David Heckerman, Heikki Mannila, editor, The Third International Conference on Knowledge Discovery & Data Minin. AAAI Press, August 1997.

[3] K. Causse, Marc Csernel, and Jorg-Uwe Kietz. Final speci cations of the Common Knowledge Represe ion Language of the MLToolbox. Document ITM 2.2, MLT Consortium, ESPRIT project 2154, March 1992. [4] Robert P. Cerveny and Kenneth E. Knight. Grosch's Law. In Anthony Ralston and Edwin D. Reilly, editors, Encyclopedia of Computer Science, Third Edition. Van Nostrand Reinhold, New York, 93. [5] U.M. Fayyad, G Piatetsky-Shapiro, P Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, 1996. [6] Carlos A. Iglesias Fernandez. De nicion de una Metodologa para el Desarrollo de Sistemas Multiagente. PhD thesis, Dep. Ingeniera de Sistemas Telematicos, E.T.S.I. Telecomunicacion Universidad Politecnica de Madrid, Febrero 1998. [7] Jose C. Gonzalez, Juan R. Velasco, Carlos A. Iglesias, Jaime Alvarez, and Andres Escobero. A multiagent architecture for symbolic-connectionist integration. Technical Report MIX/WP1/UPM/3.2, Dep. Ingeniera de Sistemas Telematicos, E.T.S.I. Telecomunicacion, Universidad Politecnica de Madrid, November 1995. [8] Object Management Group. The Common Object Request Broker: Architecture and Speci cation. Technical report, Object Management Group, July 1995. [9] Object Management Group. OMG IDL Syntax and Semantics. Technical report, Object Management Group, May 1997. [10] Carlos A. Iglesias, Jose C. Gonzalez, and Juan R. Velasco. MIX: A general purpose multiagent architecture. In M. Wooldridge, K. Fischer, P. Gmytrasiewicz, N.R. Jennings, J.P. Muller, and M. Tambe, editors, Proceedings of the IJCAI'95 Workshop on Agent Theories, Architectures and Languages, pages 216{224, Montreal, Canada, August 1995. International Joint Conference on Arti cial Intelligence. (An extended version of this paper has appeared in INTELLIGENT AGENTS II: Agent Theories, Architectures, and Languages, Springer Verlag, 1996, pages 251{266.). [11] Hillol Kargupta, Ilker Hamzaoglu, and Brian Staord. Scalable, distributed data mining-an agent architecture. In Daryl Pregibon & Ramasamy Uthurusamy David Heckerman, Heikki Mannila, editor, The Third International Conference on Knowledge Discovery & Data Minin. AAAI Press, August 1997. [12] R. Kohavi, G. John, R. Loing, D. Manley, and K. P eger. Mlc++: A machine learning library in c++. In Tools with Ariti cial Intelligence, page 249. [13] Luis Magdalena. Analysis of hybrid models: Fuzzy logic/neural nets. Technical Report MIX/WP2/UPM/1.2, Dep. Ingeniera de Sistemas Telematicos, E.T.S.I. Telecomunicacion, Universidad Politecnica de Madrid, March 1995. [14] Donald Michie, David J. Spiegelhalter, and CharlesC. Taylor, editors. Machine Learning, Neural and Statistical Classi cation. Ellis Horwood, 1994. [15] Shelley Tselepis Wenke Lee Salvatore Stolfo, Andreas L. Prodromidis and Dave W. Fan. Jam:java agents for meta-learning over distributed databases. In Daryl Pregibon

[16] [17]

[18]

[19]

& Ramasamy Uthurusamy David Heckerman, Heikki Mannila, editor, The Third International Conference on Knowledge Discovery & Data Minin. AAAI Press, August 1997. Robin Smith. Software agents technology. In PAAM 96. The Practical Application Company, April 1996. Dietrich Wettschereck Stefan Wrobel, Edgar Sommer, and Werner Ende. Extensibility in data mining systems. In Jiawei Han Evangelos Simoudis and Usama Fayyad, editors, The Second International Conference on Knowledge Discovery & Data Mi ning. AAAI Press, August 1996. Gregory Piatetsky-Shapiro Usama Fayyad and Padhraic Smyth. Data Mining and Its Applications: A General Overview. In Jiawei Han Evangelos Simoudis and Usama Fayyad, editors, The Second International Conference on Knowledge Discovery & Data Mining. AAAI Press, August 1996. Gregory Piatetsky-Shapiro Usama Fayyad and Padhraic Smyth. From data Mining to Knowledge Discovery: An Overview. In Padhraic Smyth Usama Fayyad, Gregory Piatetsky-Shapiro and Ramasamy Uthurusamy, editors, Advances in Knowledge Discovery and Datamining. AAAI Press, 1996.