Conceptual Relevance Feedback for the Retrieval of Business Cases

0 downloads 0 Views 57KB Size Report
effective use for the retrieval of indexed business cases requires extensive ... are defined in a business ontology that groups the business vocabulary into.
Conceptual Relevance Feedback for the Retrieval of Business Cases Catherine Baudin Price Waterhouse Technology Centre [email protected]

Scott Waterman Price Waterhouse Technology Centre [email protected]

Adriana Mutolo Price Waterhouse Global Core Competencies [email protected]

Abstract Sharing business experience, such as client engagements, proposals or best practices are an important part of the knowledge management task within large consulting organizations. The goal of a conceptual indexing language such as the International Business Language™ is to provide a common language for the identification of business improvement ideas . Such domain languages are based on an ontology of business concepts designed to highlight similarities between companies so as to facilitate the comparison of different business practices. One problem with using such language is that it requires extensive training to be able to map a query from plain English to the conceptual model. We describe Conceptual Relevance Feedback, a technique to help users retrieve and compare business cases. Rather than trying to translate a user query in terms of the language from scratch, conceptual relevance feedback starts from a set of query terms and from (at least) one relevant case. It extends the search by finding those concepts from the business ontology in the description of the relevant documents that best map to the terms used in the query. This focuses the retrieval of similar documents to those that are relevant in the context of specific information needs. Conceptual relevance feedback combines term categorization techniques based on term frequencies and conceptual retrieval based on high level descriptions of cases in the business domain.

1. Introduction Sharing business experience, such as client engagements, proposals or best practices is an important part of the knowledge management task within a consulting firm. In the business domain, cases are usually stories about how a corporation reengineered one or several business processes, what methodology or technology was key to the success (or failure) of the operation, and what criteria were used to measure the results. Managing and sharing this type of information certainly requires filing electronic versions of these stories, but also requires a high level indexing scheme to highlight important aspects of the cases and their similarities. Conceptual languages, such as the International Business Language ™ (IBL) (Emerson,1996), provide a common language for the identification of business improvement ideas. For instance, the IBL language breaks business processes into activities that are shared by all companies, regardless of industries, thus enabling the comparison between companies that may be structured or operated differently. Although such a conceptual language is a powerful tool to facilitate information sharing among professionals in a large organization such as Price Waterhouse, its effective use for the retrieval of indexed business cases requires extensive training. In particular, it requires users to understand the structure of the language and most of all to be able to formulate their information needs using the controlled vocabulary in the large (more than 3000 categories in the IBL language) ontology that provides the common ground for describing business information. There are two issues associated with using a domain ontology to index documents. The first is that the content of each document added to the business case repository must be described using this language. This process is currently performed manually by information specialists in charge of building and maintaining this shared information corpus. Although we are looking into

categorization techniques to facilitate the indexing task, this paper does not address this problem. Instead, we focus on the information access problem and on how to effectively use a powerful domain language to retrieve cases describing complex business reengineering problems. One way to facilitate the formulation of a user information request is to let him/her browse the domain ontology and interactively select terms. However, this is a difficult process for the average user, given that a detailed ontology contains many hundreds of terms and average queries necessitate selecting and combining several concepts. Another way of facilitating information access, would be to automatically analyze the text of the query and translate it into the conceptual language. Although this is an interesting perspective, such direct mapping is hard to implement. Instead we are taking a mid way approach. Rather than trying to translate a user query in terms of the language from scratch, we are using Conceptual Relevance Feedback, a method that starts from a set of query terms and from (at least) one relevant case, and then finds documents similar to the relevant case based on its conceptual description. The method has three steps. First the user finds one or more relevant documents using full text search. The system then finds what part of the relevant(s) conceptual document description is most related to the words used in the query. Finally it extends the search using the conceptual description from the relevant(s) document to retrieve similar cases. This technique combines conventional information retrieval technology, term categorization techniques based on word frequencies, and conceptual retrieval of cases based on the a high level domain language. Section 2 briefly describes the International Business Language ™ and how it can be used to describe business cases. Section 3 discusses an example showing how conceptual relevance feedback is used to search a corporate database containing more than 6000 business cases. Section 4 describes a term categorization method used to create a bridge between the words in a query and the terms in the business ontology. Section 5 discusses key issues and points to directions for future research. 2. Using the IBL Business Ontology to Describe Business Cases: Examples The IBL domain language is based on the “Value Chain” concept (Porter, 1990). The concepts of this language are defined in a business ontology that groups the business vocabulary into different classes of knowledge: business processes (what type of activity is being referred to in a given business case), enablers (methodology or technology used to reengineer the process), and measures(what criteria were used to measure the results of a reengineering process). For instance, let’s consider the following summarized business stories: Case1: “Company X, a petroleum company, had a problem with its purchasing operation and implemented a new order tracking system using an Intranet. This resulted in a 60% improvement in order processing time”. Case2: “Company Y, a high-tech company, implemented a new electronic requisition system and found that it reduced the purchase cycle time by 50%”. Although these cases use a different vocabulary (purchasing system vs. Requisition system) and refer to companies in different industries, they are similar in that they both refer to the same business process (Orders and Purchasing process) and in that they used similar enablers to reengineer this process. One way to describe these cases using the IBL ontology is to index them with concepts related to the business process “Process Orders” (PO), and with the key

technologies used to reengineer these practices: “EDI (Electronic Data Interchange)” and “Centralization/Decentralization” enablers. Similarly, the criteria used to measure the results in each case fall under the “Purchase Orders (Volume/Frequency)” measure. This leads to the following conceptual description common to both cases: Processes: “PO- Order Materials/Supplies”,”PO-Receive Materials/Supplies” Enablers: “S-Centralization/Decentralization”,”T-EDI (Electronic Data Interchange) Measures: “PO- Purchase Orders (Volume/Frequency)”

Starting from such a description is a key to retrieving similar cases that might be missed by a search based on the words in a query. However, most documents in our collection of business cases are rarely as concise and focused as the two examples presented above. They can be large documents describing how several business processes were reengineered over a period of several months. In this case, a document refers to more than one subject and maps to different parts of the business ontology. Therefore, the notion of similarities between different conceptual descriptions must be considered in the context of a specific information need. With this perspective, matching a document’s conceptual description to other indexed documents is first a matter of defining what portion of the description in the relevant document really relates to the current query. The next section presents an overview of Conceptual Relevance Feedback and shows an example. Conceptual Relevance Feedback In the Information Retrieval literature, Relevance Feedback is a technique by which a user first searches a set of documents using a conventional search method, usually full text search, then provides feedback to the system by pointing to a relevant document. The retrieval system then extends the search by finding documents similar to the relevant document. In these systems, the document representation is a set of words from the text (Rocchio, 1971). Conceptual Relevance Feedback operates on the same principle, however, instead of using the words in the text to extend the search, the system uses (parts of) the conceptual description associated with the relevant document. The resulting retrieval system has three components: a corpus of indexed documents that can be accessed by professionals in different locations, in our case at Price Waterhouse these documents are grouped in a Lotus Notes Database; a domain ontology representing the conceptual language, in our case this is the IBL ontology; a dictionary relating important domain concepts to the domain ontology, this dictionary is automatically generated from a training set of indexed documents (see section 4). The following scenario illustrates the Relevance Feedback process. Original request: “Our client, company X, is looking for case examples where an Intranet has been implemented and used successfully to facilitate electronic requisitioning. They would like to learn from and benchmark other companies who have already been through this process and have realized the benefits”. 1. Terms Selection: The user selects a set of terms to search a corporate repository of proposals, best practices, benchmarking studies and client engagements. These documents are grouped in a shared Lotus Notes database that can be searched using a full text search engine. In this case, the query terms: “electronic requisitioning” and “Intranet” are submitted to the full text

search engine. No document is retrieved. The user then tries: “electronic requisitioning” and gets 2 documents. 2. User Feedback: One of these hits seems relevant, it describes a project whose goal was to redesign all business processes in a given company. Part of this document describes the redesign of the general purchasing process and tells how this company used a networking system to implement an electronic requisition system. The other part describes the implementation of new financing and logistics modules. The conceptual description of this document using the IBL ontology is shown in Figure 1. Industry: MN-Mfg (Food/Beverage) Processes: FM-Process Accounts Payable PO- Manage Vendors/Contractor Relationships PO- Order Materials/Supplies PO- Procure Materials/Supplies PO- Qualify & Select Vendors/Contractors LD- Receive Materials/Supplies Enablers: C-Empowerment C-Training S-Centralization/Decentralization T-EDI (Electronic Data Interchange) T-SAP Measures: PO- Headcount PO- Process Steps (Number) PO- Purchase Order (Volume/Frequency) PO- Purchasing (Cost) PO- Suppliers (Number)

Figure 1: IBL description of a process reengineering case.

3. Contextualization: The question is: what part of the retrieved conceptual description is related to the user query? More specifically in this example, what part of the relevant conceptual description is most relevant to the query term: “electronic requisitioning”? At this point, the system uses a dictionary of categorized domain terms. This dictionary records relations between the domain terms and the concepts in the IBL ontology. Each term is associated with a weight that indicates how important this term is to a particular IBL concept. For instance, the term “electronic requisitioning” is associated with a list of IBL business processes, enablers and measures ordered by decreasing weight. One “right” generalization of the query is obtained by intersecting the concepts related to the query terms with the IBL concepts in the relevant document’s description. In this case, it indicates that only two processes and two enablers (shown in bold in Figure 1) are related to the query term. 4. Reformulation: The system looks for cases that matches the following description: Processes: PO- Manage Vendors/Contractor Relationships (Weight 13) PO- Order Materials/Supplies (Weight 24) and Enablers: S-Centralization/Decentralization (Weight 13) T-EDI (Electronic Data Interchange) (Weight 11) and

Terms: electronic requisition

Submitting a new query based on this description to the retrieval engine retrieves documents that are described with at least one of the two selected processes (with a preference for the one that have both processes or for the one associated with the process that has the highest weight), and with at least one of the selected enablers (with a preference for the cases that have both enablers). The result of this new retrieval is 41 documents ordered by decreasing degrees of similarities with the representation associated with the relevant document. Out of these, the first 10 documents are client engagements and benchmarking studies describing electronic purchasing and procurement systems that use the various networking technologies and are thus relevant. The conceptual representation enables users to access relevant cases even though they might not contain any of the query terms. The key phase in this scenario is selecting the part of the case description that relates to the user query. This selection is done using a dictionary containing the main domain terms and their links to the domain Ontology. The next section describes how this dictionary is automatically generated using a training sample of categorized documents. 4. Term Categorization: Bridging the Gap Between Words and the Domain Ontology The goal of the Term Categorizer (called CONIX for conceptual indexer), is to learn a set of descriptive terms for each concept in the domain ontology, using a corpus of pre-indexed documents as training examples. The term/category relation dictionary that is generated with this method is currently used in a stand-alone system to help users find IBL concepts in our database of business cases. For instance, for the term “knowledge management”, the output is a ranked list of categories in which this term occurs significantly more frequently than by chance. In this example, the system shows that the term “knowledge management” is associated with the business process “Develop & Maintain Knowledge Management Process” and with the concepts “Knowledge Sharing” and “Knowledge Creation” (see figure 2).

Figure 2. A set of categories related to the term “knowledge management”

The stars shown next to the retrieved categories in figure 2 give an idea of the degree of relation between the query term and the category (four stars is very related, one star is weakly related). The term/category dictionary is trained on a set of pre-indexed documents. The output is a table of categorized terms, where an entry is created for every pair consisting of a term, the category it was found in, and a number of weights indicating the importance of the relationship between the term and a category. In order to be more specific than a word-by-word dictionary could be, the terms used here are phrasal terms as well as significant single words. Phrasal terms are used because multi-word expressions are more specific in relating to a given topic in textual material than single terms. For instance, the individual words “world”, ”wide” or “web”, do not exceptionally bring to mind the internet, http, electronic communication or the like, but the phrase “World Wide Web” does. The creation of the term/category dictionary has two phases. In the first phase, the documents are processed through a technical term extractor to identify candidate groups of words that might constitute a term. The second phase takes as input the set of identified terms and computes a measure of how discriminant each term is for each concept of the domain ontology. Finally, the terms which have a very low frequency (frequency one) or for which the degree of relation to the concepts in the ontology are too low are eliminated, while the other concepts are integrated to the term/category dictionary. Sections a. and b. describe these two steps, section c. presents a rough evaluation of the term categorization component. a. Technical Term Extraction: Recent years have seen various approaches for extracting word collocations and technical terms from a document. Some of these approaches are based on statistical tests, particularly mutual information (Hindle & Rooth, 1993), while others mostly rely on the detection of syntactic patterns in a text (Justeson & Katz, 1993). Some hybrid methods have also been tried (Chen & Chen, 1994). For this application, we use a simple two-step hybrid method. The first step overgenerates considerably, while the second step filters spurious terms based on the strength of their relationship to categories in the domain ontology. This guarantees that the selected phrases are significant to the category retrieval task. For the goal of finding good descriptive terms for the domain ontology concepts, we first use a text analysis approach to detect possible groups of words based on syntactic patterns. The text is first part of speech tagged (Brill, 1993). Then a simple set of NP-like patterns is matched against the tags (as in Justeson and Katz,1993). In addition, we also use a set of orthographic patterns to recognize phrasal terms. For instance, company names, product names, acronyms and the like are usually represented in a text as Capitalized Phrases. This capitalization serves to identify potential index terms that might not be recognized by the POS tag patterns. Each of the resulting phrases is then segmented in all possible right-branching subphrases. This is an overgenerating, expedient approach, which avoids the complex task of analysing the internal NP structure (McDonald, 1982; Levi, 1978), but which has proved sufficient for the task, due to the second stage of filtering. b. Term Categorization

The second stage of building the index accomplishes two tasks: it both computes the degree of relevance each term has to the category set, and it allows us to filter out those terms which are not highly relevant to the categories. For each pair consisting of a potential index term and a category, we compute the degree to which the association is greater than could be expected by chance. The association implies that the term and the category are conceptually related. Three different statistics are used to test the association between a term T and a category C. The first is a student-t test, applied to the counts of documents in which T appears. Since the student-t is a good test for small sample sizes, it is well suited for the extremely sparse distributions one finds in linguistic data. This student-t test performs for us as a base cutoff, indicating whether the distribution of the term T in C differs significantly from the norm. Our cuttoff corresponds to a 95 % level of significance that the distribution is not uniform between C and ~C. We have found, however, that for low frequency terms the student-t does not perform well in ranking the results in terms of the relevance of the association. For this task, we have found the that log-likelihood ratio, G2, has better performance (see (Dunning, 1993) for a discussion of log-likelihood versus other χ2 distributed tests). We use two different versions of the loglikelihood ratio, one based on the count of documents in which T has occurred, Ld, and the other based on the count of T itself, Lf. The two statistics give complementary information about the use of T. The document count measure gives us a general picture of the distinct use of a word – if T is used in many or all of the documents in C, and in few or none outside of C then we can clearly say T is relevant to C. However, if T appears in only a few documents of C, it is interesting to know whether it appears many times in those documents, or simply once or twice. This separation is related to the work in other IR tasks dealing with document length normalization for term frequency (Singhal, Buckley and Mitra, 1996). In general, we trust the Ld statistic more. Because a document about a single subject may be of varying length, containing the relevant vocabulary a number of times, the Lf statistic may be unfairly biased. However, in those cases where Ld is marginal or low, and Lf remains high, we know that at least some documents in the relevant set used the term T much more than usual, indication that it should be considered relevant to the category. c. Relevance Evaluations We used a training set of 3000 documents from our best-practices database of business cases. Each document had been previously indexed using the IBL categories (a document may be associated with 10 to 15 concepts from the domain ontology). The resulting term/category dictionary, after relevance determination and filtering, contained 939,000 term/category pairs for 1200 concepts from the IBL language, averaging 783 relevant terms per category. In order to test the adequacy of the computed relevance judgements, we tested a set of query results against human judgement. For each category retrieved by the system, each judge was asked to rate it as being highly relevant, somewhat related, or not relevant at all. The graph displayed in figure 3 compares these judgements with the computed relevance scores, averaged across all judges and queries. It shows that the average score for the highly relevant results is well above the score of the moderately relevant set.

Relevance score

Not-at-all

OK

High

Figure 3. Distributions of the scores of high and moderate relevance categories, for all queries combined. Figure 4. Distribution of the difference in the means of high and moderate relevance categories, per query.

One can see from the figure that there is a large variation in the relevance scores for the various queries. Because of this large variation between the result sets for separate queries, it is important to separate the difference between good and ok categories for each query. Figure 4 shows a distribution of the difference in the means between the highly relevant scores and the moderate scores as reported for each judged query. A high value here shows that the highly relevant categories for a query usually score well above those judged moderately relevant. These results clearly show that the relevance functions we have chosen correspond quite well with human judgement of the relation between term and category. This result supports the validity of the selection of relevant concepts in the contextualization phase of the conceptual relevance feedback. 5. Discussion and Future Work Unlike Relevance Feedback techniques that rely on similarities between documents based on the words used in the text (Salton and Buckley, 1990) (Rocchio, 1971), our method relies on abstract document representations based on a high level language from a business ontology. To make this approach work, conceptual similarities between documents must be considered in the context of a specific query. To this end we combined a text categorization technique with a conceptual retrieval technique that matches high level document descriptions with documents in our corporate corpus of on line business information. We have tested the ability of the system to find relationships between single technical terms and our domain ontology (see previous section) but we still need to conduct more tests to evaluate the quality of the documents retrieved by the conceptual relevance feedback mechanism. Our concept retrieval method still needs to be refined. In particular, in the current version user queries are matched to the term/categorization dictionary using exact matches. For instance in our example, “electronic requisition” was part of the training set of documents and was included in the dictionary but the term “online requisitioning” was not. However both terms “online” and “requisitioning” are in the dictionary and both have in common the “EDI (Electronic Document Interchange) enabler that is a relevant concept in the domain ontology to describe “electronic

requisitioning” systems. In the future we will need to be able to segment user queries so as to determine what parts of the ontology are relevant to different segments. Another critical area of improvement is in the matching of conceptual document representations. In the example presented in section 3 the system builds a new query using the “relevant” part of a document’s conceptual description. However, in our domain ontology many concepts are highly related. For instance, the concept “Electronic Fund Transfer” and “Electronic Data Interchange” have many documents in common. Once the relevant terms have been defined, the query matcher should be able to extend the search for similar concepts by extending the reformulated query with concepts from the domain ontology that are related to the selected concepts (baudin et al.,1992). We are also investigating a tighter integration between word-based document similarities and conceptual document similarities. E. Rissland and all (Rissland and J. Daniels 1995) use a combination of case-based reasoning technique and word-based relevance feedback to facilitate document retrieval. Our approach is somewhat complementary to Rissland’s approach. Rissland first has users formulate their information needs using a conceptual language for case-based retrieval in a small collection of indexed cases, the content of the relevant documents retrieved is then used by the INQUERY IR engine (Allan et al., 1995) to retrieve similar cases based on words from the relevant documents in larger repositories of non indexed documents. However, the problem with conceptual languages based on large domain ontologies, is the formulation of the query in terms of combinations of abstract concepts. In complex cases, particularly in the business domain, this is a knowledge representation task which requires extensive training. Finally, we have also been experimenting with helping users navigate the domain ontology using the term/category dictionary generated by the CONIX term categorizer. Users can enter a technical term and the system points to the IBL concepts that are related to this term based on the term frequency in documents indexed with these concepts. We have also used this technique to access categories in a number of other analyst databases, which use different ontologies. This has been working well for single term queries that map to one or two domain concepts, but this approach still needs to be adapted to deal with complex queries where results from the category retrieval need to be combined. 6.Conclusion We have presented a method to facilitate the use of detailed conceptual languages to describe the content of business documents. Our method combines term categorization techniques to focus the search for documents similar to a given case in the context of a given information need. The conceptual description of the target relevant document is used to focus the selection of relevant concepts in our domain ontology. Once the system circumscribes a region of the business ontology that relates to the user question, documents similar to a given case can be retrieved. Our repository of business cases currently contains over 6000 indexed cases shared through a Lotus Notes Database. The approach we describe treats the conceptual language as an internal representation to be used by the retrieval engine rather than directly by a user. This has the advantage of shielding the user from the high level concepts in the ontology. While this has obvious benefits for the simplicity of use of the retrieval system, whether to hide or educate the user about the language is a debatable issue that we are still investigating

References Allan, James, Lisa Ballesteros, James P. Callan, W. Bruce Croft and Zhihong Lu. 1995. “Recent experiments with INQUERY,” in Fourth Text Retrieval Conference (TREC-4). Baudin,C.,Gevins,J. Baya,V., Mabogunje, A. 1992. “Dedal:Using Domain Concepts to Index Engineering Design Information”. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society. E. Brill, 1993. A Corpus-based approach to language learning. Ph.D. thesis. Emerson, J.,C. 1996. “Price Waterhouse’s Knowledge View: Setting the Stage For Serving Clients In The Year 2000”, in Emerson’s professional services review. March/April issue. Chen, Kuang-Hua, and Hsin-Hsi Chen, 1994. “Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and its Automatic Evaluation”. In Proceedings of the 32 nd Annual Meeting of the ACL, Las Cruces, New Mexico. Ken Church and Ido Dagan. 1994. "Termight: Identifying and Translating Technical Terminology". In: Fourth Conference on Applied Language Processing, Stuttgart, Germany: Association for Computational Linguistics, October 13-15, 1994. Dunning, Ted, 1993. “Accurate methods for the statistics of surprise and coincidence.” Computational Linguistics, 19-1. Justeson, John, and Slava Katz, 1995. “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. In Natural Language Engineering 1.1. Larkey, L.S. and W.B. Croft, 1995. “Combining classifiers in text categorization,” in Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland. Levi, Judith N. 1978. “The Syntax and Semantics of Complex Nominals”. Academic Press, New York. McDonald, David B. 1982. “Understanding Noun Compounds”, CMU Technical Report, CS-82-102. Porter, Michael,E., 1990. “Competitive Advantage”. The Free Press. Collier Macmillan Publishers London. J.J. Rocchio. Relevance Feedback in Information Retrieval, chapter 14, pages 313-323. Prentice-Hall Inc., 1971. in The SMART Retrieval System: Experiments in Automatic Document Processing. Rissland, E.L. and Daniels, J.J. (1995). "Using CBR to Drive IR." In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95), 400-407. Montreal, Canada. Salton, Gerald, Buckley, Chris. “Improving retrieval performance by relevance feedback”. Journal of the American Society for Information Science, 41(4):288-297, 1990. Singhal, Amit, Chris Buckley and Mandar Mitra, 1996. “Pivoted document length normalization,” in Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland. Su, Key-Yih, Ming-Wen Wu, and Jing-Shin Chang, 1994. “A Corpus-Based Approach to Automatic Compound Extraction”. In Proceedings of the 32nd Annual Meeting of the ACL, Las Cruces, New Mexico. Voutilainen, Atro, 1993. "NPtool. A detector of English noun phrases". In: Proceedings of the Workshop on Very Large Corpora, Columbus.