draft comune - Semantic Scholar

1 downloads 0 Views 621KB Size Report
Jun 2, 1997 - Interacting with IDL: The Adaptive Visual Interface. M. F. Costabile, F. ... most of the users that usually access digital libraries. Machine Learning.
Interacting with IDL: The Adaptive Visual Interface M. F. Costabile, F. Esposito, G. Semeraro, N. Fanizzi, S. Ferilli Dipartimento di Informatica, Università di Bari Via Orabona 4, 70125 Bari, Italy {costabile,esposito,semeraro,fanizzi,ferilli}@di.uniba.it

Abstract. IDL (Intelligent Digital Library) is a prototypical intelligent digital library service that is currently being developed at the University of Bari. Among the characterizing features of IDL there are a retrieval engine and several facilities available for the library users. In this paper, we present the web-based visual environment we have developed with the aim of improving the user-library interaction. The IDL environment is equipped with some novel visual tools, that are primarily intended for inexperienced users, who represent most of the users that usually access digital libraries. Machine Learning techniques have been exploited in IDL for document analysis, classification, and understanding, as well as for building a user modeling module, that is the basic component for providing IDL with user interface adaptivity. This feature is also discussed in the paper.

1 Introduction and Motivation The rapid advance of computing power and networked connectivity is determining increasing attention on interconnected digital libraries, and on all functions they must support in order to cope with the wide variety of users who access, retrieve, and display information from such systems, and also with the nature of the stored information, that is distributed on various sources which differ in type, form and content. Users need to easily understand the kind of objects they have access to, how they can retrieve and organize them along ways that permit to make rapid decisions on what is relevant and which patterns exist among objects. Users also need to manipulate the retrieved information in order to incorporate it in their specific tasks. As a consequence, digital libraries must provide enhanced user interfaces that support this intensive interaction between users and information. In this context, conventional interfaces, based on the view of information retrieval as an isolated task in which the user formulates a query against a homogeneous collection to obtain matching documents, are completely out of date. Indeed, this view does not correspond to the reality of users working with both digital and physical libraries for several reasons. For example, users are often unable to formulate specific questions, and they realize what they are trying to ask and how to ask it by browsing the system. This process has been called progressive querying in [9] and iterative C. Nikolaou, C. Stephanidis (Eds.): ECDL’98, LNCS 1513, pp. 515-534, 1998. © Springer-Verlag Berlin Heidelberg 1998

516

M.F. Costabile et al.

query refinement in [24]. Moreover, users often consult multiple sources with different contents, forms, and methods of access. Several authors agree that users interacting with a huge amount of (unknown and various) information find extremely useful some meta-information on the following different aspects of the stored data [24]: 1) content, that is, what information is stored in the source; 2) provenance, which refers to how the information in the source is generated and maintained, whether it is a public source or a personal archive, how frequently it is maintained, etc.; 3) form, i.e. the schemes for the items in the source, including their attributes and the types of values for these attributes; 4) functionality, that concerns the capability of the access services, such as the kinds of search supported with their performance properties; 5) usage statistics, that is statistics about source usage, including previous use by the same user or other ones. IDL (Intelligent Digital Library), is a prototypical intelligent digital library service that is currently being developed at the University of Bari [11; 28]. Among the characterizing features of IDL there are a retrieval engine and several functionalities that are available to the library users. IDL exploits machine learning techniques (hence the adjective "intelligent") for document analysis, classification, and understanding, as it has been already discussed in previous works [12; 15; 29]. One of the goals of our work is to investigate effective ways for endowing the interaction environment with appropriate representations of some meta-information, particularly about content, in order to provide users with proper cues for locating the desired data. The various paradigms for representing content range from a textual description of what is stored in the information source to structured representations using some knowledge representation language. Our choice is to exploit visual techniques, whose main advantage is the capability of shifting load from user’s cognitive system to the perceptual system. We describe the web-based visual environment we have developed for IDL, in order to improve the user-library interaction. We will essentially focus the presentation on some novel visual tools which allow the representation of metainformation that can facilitate the users in retrieving data of interest. Moreover, we will discuss the user modeling module, that is the basic component for providing user interface adaptivity. This feature is achieved by automatically classifying the user exploiting machine learning techniques based on decision trees. The paper is organized as follows. An overview of the main features and architecture of IDL is in Section 2. Section 3 presents the visual interaction environment of IDL, while Section 4 illustrates how the adaptivity of the interface is achieved through machine learning techniques. Related work is reported in Section 5, while Section 6 concludes the paper and outlines the future work.

2 IDL: A General View According to Lesk, "a digital library is not merely a collection of electronic information" [19]. It is "a distributed technology environment that dramatically

Interacting with IDL: The Adaptive Visual Interface

517

reduces barriers to the creation, dissemination, manipulation, storage, integration and reuse of information by individuals and groups" [18]. On the ground of this definition, we developed IDL as a prototypical digital library service, whose primary goal is to provide a common infrastructure that makes easy the process of creating, updating, searching, and managing corporate digital libraries. Here, the word corporate means that the different libraries are not necessarily perceived by the user as a single federated library, as in the Illinois Digital Library Initiative Project [27]. Nevertheless, all the libraries share common mechanisms for searching information, updating content, controlling user access, charging users, etc., independently of the meaning and the internal representation of information items in each digital library. Indeed, IDL project focuses on the development of effective middleware services for digital libraries, and on their interoperability across heterogeneous hardware and software platforms [3].The main features of IDL are strictly related to the library functions of 1. collection, 2. organization, 3. access: 1. support for information capture - supervised learning systems are used to overcome the problem of setting cheaply and effectively information items free of the physical medium on which they are stored; 2. support for semantic indexing - again, supervised learning systems are used to automatically perform the tasks of document classification and document understanding (reconstruction of the logical structure of a document), that are necessary steps to index information items according to their content; 3. support for content understanding and interface adaptivity - IDL provides users with an added value service, which helps novice users to understand the content and the organization of a digital library through a suitable visual environment, and supports skilled users (supposed to be familiar with the digital library) in making an easy and fast retrieval of desired information items by means of an appropriate interface modality. As to the interface adaptivity, it is achieved through automated user classification based on machine learning techniques. IDL is a digital library service. Its architecture is the typical client/server architecture of a hypertextual service on the Internet. More specifically, the current version of IDL adopts a thin-client stateful architecture [21]. In such a model, the application runs with only one program resident on the personal computer of the user: a Web browser. Moreover, there is no need of storing data locally, therefore no DBMS is present on the client-side of the architecture. This justifies the attribute thinclient given to that model. Data are simply grabbed from the library host, and then presented to the user through HTML screens. These screens are dynamically generated by means of Java applets, since their content must mirror the current content of the repository in the library host or simply they must be generated according to the user choices. Furthermore, the architecture is called stateful since it is characterized by the presence of a Learning Server that, besides the other services related to document management, is able to infer interaction models concerning several classes of users from data collected in log files and managed by the IDL Application Server. A description of how the IDL Learning Server performs this task is given in Section 4.

518

M.F. Costabile et al.

The reasons that led us to adopt a thin-client stateful architecture, rather than a fatclient model, are several: • the cost of using IDL is just that of a telephone call to the Internet Service Provider (ISP), and not to the remote host of the library, for users that do not have a permanent Internet connection; • there is no software - to be downloaded, maintained, updated - on the PC of the user, with the exception of the client browser system; • there is no need to download/upload data from/to the server of the library; • the user can enter IDL from any personal computer connected to Internet through either a permanent or a dial-up connection. A thorough description of the architecture of IDL is reported in [28]. IDL is programmed in several languages, ranging from C, C++ to Java, and it exploits the various services offered by the World Wide Web.

3 Web-Based Interaction Environment of IDL In this section we present the interaction environment of IDL, to which remote users may have access on the WWW. Such an environment has evolved with respect to the web interface described in [28], and has been enriched with some new visual tools, namely the topic map and the tree-based interface, that have been incorporated, in order to help a wide variety of users to search, browse, and select information from the library sources. The first interface developed was essentially form-based, and allowed users to carry out the following general activities: 1) creation/deletion of a digital library; 2) management of specific digital libraries; 3) browsing/querying of a selected digital library. The above activities can be performed by three different kinds of persons, that interact with IDL according to the different role they play in the system and, as a consequence, according to the different access rights they own. Thus, we can identify and define a hierarchy of roles, whose prerogatives range from the mere usage of the digital libraries (e.g., querying and/or retrieving by content the documents they are interested in) up to the global management of the whole corporate system, possibly performing structural modifications in it. At the top level of this hierarchy we find the Library Administrator, who is unique and is at the head of the system in its entirety. The Administrator’s fundamental task consists in managing and supervising the access to the various libraries involved in the system. The Administrator, in particular, is the only person who has the power of allowing a new library to join the service, or, conversely, of eliminating an already existing one (activity 1. above). Each digital library involved in the system has its own manager, which constitutes the second role in the hierarchy and is called Library’s Custodian or Librarian (in the whole paper we suppose the user is male). Each Librarian is responsible for his own digital library, and receives from the Library Administrator a proper password, by which he is able to enter the system in order to perform his tasks. This is necessary for

Interacting with IDL: The Adaptive Visual Interface

519

the sake of security since Librarians have the power of modifying both the content and the structure of the libraries they manage, i.e. to add, delete or update not only documents belonging to any class in the library, but even the classes themselves, with all their search indexes (attributes) or the definition of each search index. At the bottom of the hierarchy we find the Generic User (user for short), who is any person entering the system through Internet with the aim of consulting the available digital libraries. The user can query the library in a number of ways, in order to retrieve the documents he is interested in. Then, if it is the case, the user can display/view, in a digital format, any of the found documents. Of course, the user cannot change anything in the system, except for local copies of the documents. Each user will automatically get a personal identification code when having access to the system for the first time, and this id will identify such a user in all the future interaction sessions. Even though the form-based interface is Web-based and turns out to be powerful and flexible in that it permits a search by a combination of index fields, it is more appropriate for users who are already acquainted with the library structure and also have some information about the library content. By observing casual users interacting with our prototype, we realized that often users performed queries whose result was null, just because they did not have any idea of the kind of documents stored in the library. Therefore, we decided to enrich the IDL interaction environment by developing some novel visual tools, that aim at allowing users to easily grasp the nature of the information stored in the available sources and the possible patterns among the objects, so that they can make rapid decisions about what they really need and how to get it.

3.1 The Topic Map One of the new features of the IDL environment, that users appreciate most, is the possibility of getting a rapid overview of the content of the data stored in a library through the topic map. Such a visualization is actually an interactive dynamic map (interactive map for short), as it has been proposed in [33]. An interactive map gives a global view of either the semantic content of a set of documents or the set of documents themselves. The semantic content reflects the topics contained within the set of documents and the way they are organized to relate to each other; it is represented by a thesaurus that is built automatically from a full-text analysis. Interactive maps exploit the metaphor of exploring a geographic territory. A collection of topics, as well as a collection of documents, is considered to be a geographical territory that contains resources, which metaphorically represent either topics or documents; maps of these territories can be drawn, where regions, cities, and roads are used to convey the structure of the set of documents: a region represents a set of topics (documents), and the size of the region reflects the number of topics (documents) in that region. Similarly, the distance between two cities reflects the similarity relationship between them: if two cities are close to each other, then the

520

M.F. Costabile et al.

topics (documents) are strongly related (for example, documents have related contents). Topic maps are very effective since they provide an overview of the topics identified in a collection of documents, their importance, and similarities and correlations among them. The regions of the map are the classes of the thesaurus, each class contains a set of topics represented by cities on the map. Roads between cities represent relationships between topics. In this way, topic maps provide at a glance the semantic information about a large number of documents. Moreover, they allow users to perform some queries by direct manipulation of the visual representation. Document maps represent collections of documents generated from a user query, that may be issued on the topic map by selecting regions, cities, and roads. The cities of these maps are documents, and they are laid out such that similar or highly correlated documents are placed close to each other. In order to generate the topic map in IDL, we need to identify the set of topics or descriptors defining the semantic content of the documents stored in one of the corporate digital libraries; such topics constitute the library thesaurus. There are several thesauri used in the information retrieval literature; most of them are built manually and their descriptors are selected depending on specific goals. An example is the Roget’s thesaurus, that contains general descriptors. When building the library thesaurus, we have used standard techniques, also taking into account the type of documents currently stored in the library [32]. In AI_in_DL, one of the libraries in the current IDL prototype for which we provide a topic map, the documents are scientific papers that have been published in the journal IEEE Transactions on Pattern Analysis and Machine Intelligence (pami), in the Proceedings of the International Symposium on Methodologies for Intelligent Systems (ismis), and in the Proceedings of the International Conference on Machine Learning (icml). Therefore, we have used the INSPEC thesaurus containing specific terms in the field of Artificial Intelligence. This thesaurus contains 629 keywords, that are either single words or expressions made up of more words (up to five). We have represented documents and keywords (topics) by vectors, that is a common practice in information retrieval [17; 25]. The coordinates of the document vectors and those of the topic vectors are computed in the following way: the coordinate di of the vector representing document D is 1 if the topic Ti was found in D, and 0 otherwise; the coordinate ti of the vector representing topic T is 1 if document Di contains T, and 0 otherwise. A number of correlations can be easily computed from these vectors, and then visualized in the topic or document map. In particular, for the topic map we are interested in the number of documents to which a topic is assigned (the so called term frequency), and also in the correlation between pairs of topics, that is the number of documents to which both topics of the pair are assigned. Moreover, clustering techniques are applied to the descriptors of the thesaurus in order to generate classes of similar descriptors, that will be visualized close together in a region of the topic map. Like in [25; 33], the similarity between two topics is computed by the following formula:

Interacting with IDL: The Adaptive Visual Interface

521

Sim (Ti, Tk) = NTiTk / (NTi + NTk - NTiTk) where: NTiTk is the number of documents to which both topics Ti and Tk are assigned, is the number of documents to which topic Ti is assigned, NTi is the number of documents to which topic Tk is assigned. NTk The thesaurus is then partitioned in a set of classes A1, A2, ..., Ap, where each Ai contains descriptors that are similar, and p is a user-settable parameter. In the AI_in_DL library of the current IDL prototype, p has been set to 5, and the partition has been computed very simply. As centroid of each one of the five classes, we have chosen the five topics in our thesaurus with maximum term frequency; they are: learning, classification, framework, noise, and training. For any other topic in the thesaurus, by using the above formula we have computed its similarity with the centroid of each class, and we have assigned the topic to the class with the highest similarity. Then, we have added a sixth class, that is the special class "miscellaneous", gathering topics dissimilar to any other topic. Two remarks are worth making about the generation of the library thesaurus and its class partitioning. The first is that, if the documents stored in the library were of general nature, we could have used other well-known techniques for building the thesaurus as well as for computing the relevance of documents with respect to topics. In our current research, the main interest is in effectively representing the topics and the related documents once a classification has been somehow performed, rather than in identifying new document classification techniques, for which we rely on already ascertained research. The other remark is that the generation of the thesaurus is computationally expensive, but it needs to be performed only once from scratch. If the number of documents is large, it is unlikely that a new document will change the classes of the thesaurus. Therefore, adding a new document to the initial collection will only imply the re-computation of the correlations. The classes will be re-computed only after adding a large number of documents.

3.2 Interacting with the Topic Map As already mentioned, our design of topic maps borrows some ideas from the interactive maps proposed in [33]. However, in the IDL environment we have used some color-based coding techniques and added several widgets to that initial design, that make more effective the overall visualization and provide adequate mechanisms for flexible interaction in a data intensive context, such as that one of an online digital library. In the IDL topic map, cities represent topics of the thesaurus, and a region represents a set of topics, i.e., the topics in a class Ai. The distance between two cities reflects the similarity relationship between them: if two cities are close to each other, then the topics are strongly related. As a novel feature, we have adopted a color-based technique to code the importance of a topic, that depends on the number of documents

522

M.F. Costabile et al.

that topic has been assigned to. Therefore, the rectangle used to represent a city will be drawn in an appropriate color. Fig. 1 shows the topic map for the AI_in_DL library of the current version of IDL. As we can see, there are six regions on the map, in which topics are concentrated around the region centroid. Topics are visualized in ten colors, that range from light blue to red, where light blue is used to represent the less important topic, and red the most important one (unfortunately colors are not distinguishable in the grey level figures included in this paper). The color scale reproduced in the widget with label TOPICS shown at the top left corner in Fig. 1 has two important functions: 1) it shows the used colors and the progression from the less to the most important one (the less important is on the left of the scale, represented by the lower column); 2) it is a very useful interaction mechanism, that works as a range slider [30], giving users the possibility of quickly filtering the information on the map. Indeed, the user can filter out from the map the less important topics by simply moving the slider from left to right with the mouse. Such kind of filters is very useful, especially when the map is

Fig. 1. Topic Map

Interacting with IDL: The Adaptive Visual Interface

523

very cluttered up, as it often happens in digital libraries that, by definition, contain huge information. Fig. 2 is similar to Fig. 1, but the TOPICS range slider has been moved of 3 positions to the right, thus eliminating less important topics (and all their links); the resulting map is much less cluttered. A similar color-based technique has been adopted for coding a relation between pairs of topics through a road (i.e., a link) connecting two topics on the map. In our design, the link between two topics has a different color, depending on the importance of the link, that is the number of documents assigned to both linked topics. The colors are similar to those used for the topic importance and a similar widget with label LINKS, also visible in Fig. 1, is used to act as a filter on the map. This reduces the number of links shown on the map, so that users may concentrate on the most important links. The topic map is visualized in a small window on the screen; hence, in a general

Fig. 2. A topic map in which less important topics have been hidden in order to make it less cluttered

524

M.F. Costabile et al.

overview many topics are hidden. A zoom mechanism is essential for allowing users a proper browsing. We designed a widget which is capable of giving a feedback about what portion of the map is currently zoomed with reference to the whole map, in a way that is similar to a moving lens. Such a widget is located next to the LINKS widget in the area above the map. It is made of two rectangles, a black one which represents the whole map and a red one, that indicates the portion of the map currently visualized. In the situation in Fig. 1, the whole map is shown, so that the border of the red rectangle overlaps the border of the rectangle representing the whole map. By shifting to right the slider below this rectangle, it is possible to reduce the area of the graph that will be visualized. If we look at Fig. 3, we see that the range slider has been shifted to right, and consequently the red rectangle is much smaller than the rectangle representing the whole map. Indeed, the topic map now visualizes a zoomed portion of the whole map shown in Fig. 1. As indicated by the red rectangle

Fig. 3. A zoomed portion of the topic map with the composition of a query

Interacting with IDL: The Adaptive Visual Interface

525

in the widget, the visualized area is part of the top left region of the map in Fig. 1. The topics of such a region are now much better visible than in Fig. 1, and the red rectangle provides a useful feedback of where we are in the context of the whole map. The user can browse the zoomed map and visualize the area of interest by acting on the scroll bars at the bottom and at the right of the window showing the topic map. Fig. 3 also illustrates the query facility that has been implemented in the topic map. We allow users to perform some queries by direct manipulation of the map. In the area above the topic map, we see six buttons next to the zoom widget. Such buttons, together with the pull down menu, are used for performing a query. Comparing this area with the same area in Fig. 1, we see that in Fig. 1 only the button "Compose Query" is enabled. By clicking on this button, the user can compose a query. Indeed, all the other buttons are now enabled, and the user can compose the query by simply clicking on a topic in the map to select it and on the buttons "AND", "OR", "(", ")", if they are needed. The pull-down menu to the right of the buttons is to be used for choosing the document class to be queried. The default item is "All", standing for all classes. The classes are listed in a pull-down menu, since they can be modified by the Librarian and the menu is dynamically generated. During its composition, the query is shown in a proper area, as we can see in Fig. 3, where the string "MACHINE LEARNING" OR "DECISION TREE" is visible between the widget area and the map. Such a string is also editable. The user is interested in retrieving all documents containing any of the above mentioned keywords. The user can now submit the query by clicking on the button "Submit Query". The results will be visualized in a document map where the documents are shown through icons whose color indicates the relevance of the retrieved documents concerning the query, in a way similar to what has been done for the topics. By clicking on a document icon, the user can see the details of that document.

3.3 Tree-Based Interface The tree-based interface provides another visual modality to both browse IDL and perform queries. The user navigates into IDL along a tree structure, starting from the root and expanding the tree step by step, so that at each node of the tree the user can make a decision whether to further explore that path or not. Initially, the system shows the root tree, namely the root node IDL. By selecting each node, a pop-up menu appears with two items: the first item explodes the selected node, the second item provides an explanation of the meaning of the node, in order to orient the user in his choice. When expanding the root node IDL, its offsprings will show the digital libraries available in IDL.

526

M.F. Costabile et al.

Node IDL in Fig. 4 is not expandable anymore; the expandable nodes are shown in a blue rectangle with the label inside. If the user now selects another node among those that are still expandable, the pop-up menu appears again and the user can explode the selected node. Fig. 4 shows a situation in which the user has expanded the node AI_in_DL. In this way he has implicitly selected such a library, and the classes of documents have been displayed as a further level of the tree. Then, the user has selected the class "icml" and expanded this node, so that all available indexes on this class of documents are now displayed. The user may now perform a query by entering appropriate values for one or more of such indexes. The search values are input through a pop-up window, that appears once the user clicks on a specific index node. Once the user has inserted the search values for the selected indexes, he can now submit the query by clicking on the button "Submit" at the bottom right corner of the window.

Fig. 4. Query submission with search values for Title index

Interacting with IDL: The Adaptive Visual Interface

527

Several options are available for improving the usability of this interface. As an example, the query can be visualized differently by clicking on some buttons at the bottom of the window. Such a query visualization is a useful feedback for the user. The three different interfaces above, namely the form-based, the topic map, and the tree-based, make up the IDL interaction environment. Thanks to the IDL adaptivity feature, that will be described in the next section, the system proposes to the users the interface that it considers the more appropriate for them. Of course, users have the freedom to shift anytime to another interface, at will.

4 Interface Adaptivity through IDL Learning Server A fundamental problem to cope with when developing a system exploited by several users is to make it adaptive to the various kinds of users that can be recognized, with the aim of improving the overall usability of the system. A prototype of an intelligent component, working in a client/server way and able to automatically classify a user, is currently embedded in the Application Server of IDL. In the overall architecture of IDL, such a prototype is part of the Learning Server. IDL Learning Server can be defined as a suite of learning systems that can be exploited concurrently by clients for performing several tasks, such as document analysis/classification/understanding and inference of user models. Here we focus on this last task, intended as the task of inferring user profiles by means of supervised learning methods. In fact, each user of a system has special capabilities, skills, knowledge, preferences and goals. This is particularly true in the case of a service meant to be publicly available on the WWW like IDL. The reasons why users consult IDL can be the most disparate ones, from real needs of bibliographic search to checking the orthography of a word. After all, each user has his own profile, thus, when using the system, he will behave differently from any other user. Of course it is impossible, even for an intelligent system, to recognize each single user in order to adapt its behavior to him. Nevertheless, it is desirable that an intelligent system is able to understand which kind of user it is interacting with and tries to help him by making the accomplishment of his goal easier (through contextual helps, explanations, suitable interactions modalities etc.). As a consequence, one of the main problems concerns the definition of classes of users meaningful for the system, and the identification of the features that properly describe each user and characterize the kind of interaction. As to the task of inferring user models, the main function of the Learning Server is to automatically assign each IDL user to one of some predefined classes, on the ground of information drawn from real interaction sessions with IDL. In the literature of human-computer interaction, this activity is known as interaction modeling [2]. This approach takes advantage of Machine Learning methods [20], since interaction modeling can be cast as a supervised learning problem by considering some user interactions with IDL as training examples for a learning system, whose goal is to induce a theory for classifying IDL users.

528

M.F. Costabile et al.

The classification performed by the learning server can be exploited in several ways. In IDL, it is used to associate each class of users with an interface being adequate to the user’s degree of familiarity with the system, aiming at speeding up the process of understanding the organization and the content of the chosen digital library and properly assisting the user in retrieving the desired information. Among all the possible IDL generic users, we defined three classes of users, namely Novice, Expert and Teacher. It is possible that, during the time, the user acquires familiarity in the use of the system, that must be able to track out potential changes of the class the user belongs to. This problem requires the ability of the system to register and identify the user. Each new user is required to fill in a digital form with personal data, and after that he receives an identity code - User ID - that he will use whenever he will enter IDL again. Correspondingly, IDL Application Server provides to create and associate each User ID with a log file, in which all the interactions of that user with IDL are stored. Examining the data stored in the log file generated for each user during the interaction session, it is possible to extract some characteristics useful to recognize the users. Most of the identified characteristics turned out to be application dependent, while only few turned out to be system dependent. For instance, relevant characteristics are those concerning the way in which users exploit the capabilities of IDL search engine, such as date and time of session beginning, class of documents chosen, search indexes chosen, criterion for sorting the search results, number of documents obtained as results of the search, types of errors performed during the interaction with IDL. Data stored in the log files are then exploited to train a learning system in order to induce a decision tree and a set of rules that makes up the theory used by the system to autonomously perform a classification of the users interacting with IDL. Fig. 5 illustrates IDL Learning Server that, when a user connects to IDL, consults the available rules and compares them to the set of characteristics extracted from the log file. Fig. 6 shows the scheme of IDL Learning Server, that is based on C4.5/C4.5RULES [23]. It has been customized in order to work in a batch way to infer the classification theory from the log files, concerning the set of users whose

USER log (example)

Learning Server

consult

RULE SET

USER CLASS

Fig. 5. IDL Learning Server

Interacting with IDL: The Adaptive Visual Interface

529

TRAINING SET log1

log2

...

logn

+

C4.5/C4.5Rules

RULE SET

Attribute Description File

Fig. 6. The scheme of IDL Learning Server

interactions are selected to train the system. Furthermore, we are currently investigating the possibility of using incremental learning systems [13] that avoid the drawback of starting over from scratch the learning process each time new log examples become available. A preliminary experiment concerning the classification of IDL users consisted in collecting 500 examples of interactions and generating from them a training set for the system component C4.5. As previously told, we identified three fundamental classes of users, namely Novice, Expert and Teacher. Each log file is used to draw the values taken by the attributes that describe user interactions. We considered 126 attributes. Therefore, each training example is made up of a set of 126 values and is labeled with one of the three classes Novice, Expert, Teacher. The details of the classification experiment are in [14] and not reported here for lack of space. The system component C4.5RULES examines the decision tree produced by C4.5 (depicted in Fig. 7) and generates a set of production rules (Fig. 8) of the form L Å R, where the left-hand side L is a conjunction of boolean tests, based on the attributes, while the right-hand side R denotes a class. One of the classes is also designated as a default, in order to classify those examples for which no rule’s lefthand side is satisfied. Rule generation is done with the aim of improving the comprehensibility of the induced results. Indeed, in our experiment, C4.5RULES C4.5 [release 5] decision tree generator Sun Jun 2 16:04:09 1997 ------------------------------------------------------------------Read 500 cases (126 attributes) from DF.data Decision Tree: Average daily connection 0.09 : | Freq of BadQuery on class Springer of DB AI_in_DL > 0.5 : Expert (122.0) | Freq of BadQuery on class Springer of DB AI_in_DL 0.09 Freq of BadQuery on class Springer of DB AI_in_DL > 0.5 -> class Expert [98.9%] Average daily connection > 0.09 Freq of BadQuery on class Pami of DB AI_in_DL > 0.5 -> class Expert [98.8%] Average daily connection > 0.09 Freq of BadQuery on class Pami of DB AI_in_DL