An Overview of Practical Applications of ... - Semantic Scholar

An Overview of Practical Applications of Information Filtering1 Alan F. Smeaton School of Computer Applications, Dublin City University, Dublin 9, IRELAND. [email protected]

1.

Humphrey Sorensen, Computer Science Department, University College, Cork, IRELAND. [email protected]

Introduction

Information filtering represents a field of study which has undergone a relatively recent upsurge of interest, this being primarily necessitated by the increasing volumes of electronic information arising on the Internet and on intranets. The main concerns of information filtering (IF) lie with the routing of transient electronic documents to relevant users (and the filtering of irrelevant data from same), prioritising of such information, document categorisation and document sharing among collaborating users. Information filtering systems operate on a dynamic stream of information and while research into IF techniques itself is relatively recent, the techniques employed in operational systems are not necessarily so. This arises because IF has several important precedents. Specifically, it has been acknowledged that the longer-established field of information retrieval (IR) has much to offer in terms of strategies, representations and evaluation techniques employable within IF. Also, techniques of learning, language processing and user modelling which were independently developed are finding use within IF as indeed they are finding use within IR. Thus, IF developers have not begun from a standing start and many practical examples of IF software systems are currently in existence. In the workshop on Practical Applications of Information Filtering held as part of the First International Workshop on Practical Aspects of Knowledge Management (PAKM) in Basel, Switzerland (October 1996), a collection of papers was presented which describes some of these IF systems. While the techniques employed for user and document representation differ from system to system, while the architectures of the systems may differ, while the application domains may differ, the systems described herein offer collective proof that information filtering has reached a level of considerable practical applicability. In the next section of this paper we present a brief summary of each presented system and in section 3 we draw together some overall conclusions.

2.

The IF Systems

The workshop was opened by Andreas Goppold who in his paper, “Data Processing Infrastructures for Knowledge Filtering: The TLSI Approach” examined the problems faced by knowledge workers in accessing information, particularly on the Internet and World Wide Web. These problems apply specifically to time devoted to access, structure (or lack thereof) of the stored knowledge versus ease of requirement specification and the difficulty of content 1

Presented as part of the First International Workshop on Practical Applications of Knowldge Management, Basel, Switzerland, October 30-31 1996.

1

extraction from stored data. The paper briefly examined some current solution strategies to these problems, before moving on to the TLSI approach. This approach is based on a virtual machine (VM) model similar to the Java machine. More importantly, it utilises many existing approaches and utilities in producing a working filter. The paper went on to investigate and evaluate the use of such utilities - many of which have been disregarded for filtering applications - for tasks such as string and language processing, pattern matching, data visualisation, process monitoring and backtracking, all of which Goppold argues are of central relevance to the task at hand. In “Adaptive Filtering of Electronic Mail”, Joachim Sodeberg and Bernard Merialdo described a component of the MISTRAL system which was used for mail filtering. The filtering involves prioritising incoming mail, so that important messages can be flagged for user benefit. The technique used is that of Latent Semantic Indexing (LSI), which aims to match concepts between text documents (rather than the keyword-based strategy) by statistically defining abstract concepts from term occurrences. The LSI approach was described and experimental evaluations given. In fact, two distinct experiments were covered, differing in the mode of learning through user feedback. In the first experiment, it is shown that with 'active' learning - where the user profile is continuously updated due to user evaluation of filtered documents - the system represents a 10% improvement on random filtering for a typical set of users. The second experiment showed that, once a useful concept space is built from a user profile, then learning at more discrete intervals does not greatly reduce system accuracy. “Personalized Information Filtering” by Albert Schappert and Jurgen Kleinhans covered the InfoSphere system and InfoBricks, developed at Siemens AG. The former is an in-house personal filter, which can be applied to mail and news filtering, as well as having a number of other potential uses. One of the strengths of the Siemens approach - as embodied in the use of InfoBricks - is that a collection of simple software components can be combined for easy construction of new applications and modification or extension of existing systems. Thus, architectural clarity, as opposed to innovative filtering or retrieval methods, is an overriding factor. Thus, for example the vector space (VS) filtering approach to filtering and retrieval is adopted, using weighted keywords, required/reject markers, regular expressions and source information. Similarly, simple relevance feedback techniques are employed. It is, however, in the use of InfoBrick components to support multiple information sources, processing strategies, presentation styles and information transmission modes that the practical benefits become apparent. In “Information Filtering and Information Retrieval in Engineering - The PRISE Project”, Christiane Foertsch describes the PRISE project which aims to assist automotive engineers in acquisition of knowledge contributing to design strategies. As such, it involves information retrieval (of existing design document) and filtering (of ongoing related design issues). The requirements are that multiple sources and types of document be accessible, that there be low cognitive overhead for the engineers involved and that any access rights policies be enforced. The author describes sophisticated user models and document models to capture the complex information needs and document access mechanisms pertinent within an engineering environment. A novel query model based on the use of query trees is described, together with a convincing case for its applicability to this domain. Finally, the user interface which directly supports the user models, document models and query models mentioned above depicts the practicality of the PRISE system. The emphasis within the INFOrmer filtering project as described in “Personal Profiling with the INFOrmer Filtering Agent” by Humphrey Sorensen, Adrian O'Riordan and Colm O'Riordan is on higher accuracy in modelling an individual's interests within a user profile and

2

in determining the relevance of incoming documents. The authors believe that keyword-based strategies are too error-prone, while knowledge-based or programmed methods involve too much cognitive overhead. They thus employ a semantic network representation of user profile and incoming documents, which emphasises phrases over keywords. Basic linguistic processing is employed, though more elaborate natural language techniques are rejected as unproved for filtering. The paper covers issues of profile / document representation with semantic networks, relevance rating of documents and network updating due to relevance feedback. User interface issues using the WWW are discussed, as is the employment of the Usenet domain for system evaluation. “An Integrated System for Filtering News and Managing Distributed Data” by Gianni Amati, Daniela D'Aloisi, Vittorio Giannini, Flavio Ubaldini describes two related systems for the capture of user interests and the use of same for information filtering. The first system, ProFile, is used for filtering Usenet NEWS and utilities and utilises a generalised probabilistic model with the user profile being a vector of positive/negative weighted terms extracted from relevance assessments performed by the user on training data. The system embodies learning via relevance feedback. The second system, InfoAgent, embodies an agent-oriented approach to retrieving data from distributed and heterogeneous archives and repositories. The paper describes both systems, emphasising that they provide a single point of access to multiple information sources, while being distinct in several important issues - such as learning of retrieval strategies. The representation and learning models of ProFile are formally defined, as are the query and surfing modes employed in InfoAgent. In “Filtering News and WWW Pages with The BORGES Information Filtering Tool”, Alan F. Smeaton describes the BORGES project, a multi-partner filtering project aimed at users within a library setting, the application domain being Usenet News and WWW pages. The paper described the experiences of one of the project partners in enhancing an earlier version of the system. BORGES utilises the SMART retrieval engine, thus employing weighted keywords and phrases, together with inverted file structures for document representation. The author described several extensions which he, as a researcher, considered worthwhile for extending the original system: relevance feedback for term re-weighting, manual user term weight assignment; Boolean profile specification; phrases and/or adjacency constraints and usercontrolled or user-transparent profile expansion. He found, however, that users were more concerned with usable features than leading-edge technology. The compromise extensions profile term/phrase specification with term disambiguation - is described, and use of the system is assessed. “Automatic Attribution of ACP 127 Messages” by Diana Marras, Laurent Enel and Bernard Merialdo described mechanisms for filtering in the Sachem message-handling system used by the French Navy. In this system, the ACP 127 message format for military electronic mail is utilised, with an MCA field specifying which among 17 categories of users should receive the message. The filtering system was to automate the allocation of MCA field based on message content, deal with incomplete message classifications, and possibly to validate existing MCA settings. Two filtering methods were tested. The vector space method, while reasonably accurate, was computationally intensive. The Latent Semantic Indexing (LSI) method used earlier by Soneberg and Merialdo, reduced the concept space, and the computation involved, while yielding classification of 70% for category classification and 60% for MCA classification. “Hierarchical News Filtering” by Stuart Keane, Viranga Ratnaike and Ross Wilkinson emphasised a user model for information filtering where a user's information interests are viewed as distinct concepts competing for attention. The model is thus perceived as having

3

potential in developing an electronic newspaper, with a fixed number of articles being chosen from a predefined subject set. The paper defines a three-level hierarchy of concepts, concept indicators and terms (words/phrases), with weights acting as association measures. Thus, a neural network model is constructed and used to evaluate the appropriateness of incoming articles (once again from Usenet) to concepts of interest to a user. The paper covers issues of profile network construction and update, together with message evaluation. This system - with and without learning - is compared with a flat neural model where no distinct concepts exist. It is found that the hierarchical model outperforms the flat one with high consistency, though the mode of learning employed produced no significant improvements. “Application of Custom Computing Hardware to Internet News Filtering” by Bernard Gunther and George Milne outlined a novel technique of employing custom hardware, using field-programmable gate arrays (FPGAs), which is used to cope with the computationally intensive processes of word matching and score accumulation arising in text filtering applications. Also, using programmable components permits easy modification of the filter engine through user feedback. The authors describe a SPACE search machine built upon these principles. A user profile is specified as a hierarchy (up to four levels) of terms, organised by term significance. The specifics of the evaluation algorithm, both for individual profile search terms and for a document as a whole, are given. Also, the circuits for word composition are detailed. The performance of the system is impressive when compared to pure software implementations, while the authors recognise mechanisms for overcoming communication bottlenecks.

3.

Summary

The systems summarised in this paper encompass a broad spectrum of current information filtering work, with an emphasis on its practical applicability to end-users. Part of the call for submissions to the PAIF workshop emphasised the scope to concentrate on practical applications of filtering technology. From the papers, several general themes may be identified. First, all but one of the applications involve text filtering - this is a reflection of the fact that, at present, text has proved to be the medium most amenable to content-based analysis. A second feature is that many of the applications relate to Usenet News or electronic mail, clearly because these represent the primary dynamic information sources for academic researchers. Unfortunately, because the signal-to-noise ratio tends to be very low for these sources - a fact acknowledged by Smeaton, the results tend to be diminished. All the systems reported here are mono-lingual and all ignore the structure of the documents being filtered, either within-document structure or document-document links. This is generally true of IR technology as well. However, despite the multiplicity of text filtering techniques described here, the accuracy attainable by filtering technology is still quite limited though filtering evaluation is almost exclusively in terms of precision which is difficult to estimate accurately outside the environment of test collections. Another feature of information filtering systems is that many systems described attempt to incorporate existing technologies - from the text retrieval domain, general text processing, etc. rather than build specific filtering architectures and attempt to find optimal filtering techniques. This, of course, is the primary reason that practical information filtering software has come into the public domain in such a short time though some information filtering research also has influences from other areas besides IR such as AI, NLP and databases.

4

There are, of course, several novel technologies also described herein. These include novel filtering models, special-purpose architectures and interactive multimedia filters. Most of the systems we describe suffer from the “ranked list syndrome” and do not structure their filtered output in any way other than as a ranked list, again as is done in most IR systems, and most have a WWW interface. It is interesting to observe the influence that the WWW has had on IF research and these are testament to the fact that research in filtering is a broadening field, with valuable contributions coming from many areas of computing. If we were to make a plea for where we believe IF research should head we would summarise it as follows: • Sources of information to be filtered … concentrate on cross-lingual information and on sources where the “documents” are structured or linked in some way, or have a natural “stream” rather than always defaulting to the document metaphor. • Address “why” people want IF … make progress on evaluation moving beyond precisionrecall • How should IF be done … matching should be based on a complex and dynamic user model where the profile is only part of the input and this will cause efficiency to become an issue in IF • The output of the IF process should consist of more than a ranked list by structuring and visualising in some way While we recognise that these are desiderata we believe they are well-motivated and achievable for information filtering systems.

References The proceedings of the Workshop on Practical Applications of Information Filtering will be published during 1997.

5