NZ-LogicProofs in QA

4 downloads 0 Views 854KB Size Report
In an open-domain question answering (QA) system, we can have any type of questions (e.g. factoid, list, definition, etc) on any subject topic (starting from sports ...
Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Using Logic Proofs to justify answers in Open-Domain Question-Answering system Naushad UzZaman [email protected]

Abstract: In an open-domain question answering (QA) system, we can have any type of questions (e.g. factoid, list, definition, etc) on any subject topic (starting from sports to medical, news, programming language, tea, weather, cooking, literally anything!). To understand the complex questions and to find the accurate answers for these questions, we need deeper text understanding among question, answer and world knowledge to justify if the answer is suitable in the context of the given question. Even though some QA system can answer simple fact seeking questions, but when it turns into a complex question, which needs deeper understanding of texts, then these systems cannot answer very accurately [Voorhees 1999]. D Moldovan and his group at Language Computer Cooperation [LCC] tried to take the state-of-the art QA systems to one step ahead by introducing the Logic Proof system in QA. For the logic proof they convert the world knowledge, question and prospective answer into logic forms and then do the rigorous logic proofs to justify the answer. In this survey we start by describing what Question Answering system is. After that we describe how a question answering system works and also give some examples of different question answering systems. At that point it should be clear to the reader that a question answering system need deeper text understanding to generate the answers. Then we present the question answering system by LCC [Moldovan et al 2002a] and [Moldovan et al 2003], with focus on their logic proof system that is used to justify the answer. We describe the result on their performance issues and error analysis of the system [Moldovan et al 2002b], more specifically why it cannot give answers to some questions, i.e. the reason behind failing in some cases. Text REtrieval Conference [TREC], co-sponsored by National Institute of Standards and Technology (NIST) and U.S. Department of Defense, introduced the evaluation of open-domain question answering system. Most of the research and result in the open domain question answering are based on the evaluation system by TREC [TREC]. Every year they also give an overview of what type of question the QA systems have to answer that year and which team performed how well and a comparative study is also available in those overview papers. D. Moldovan and his group introduced logic proof in their question answering system in 2002/2003 [Moldovan et al 2002 and 2003]. To compare how their system works compare to the other systems, we present the evaluation of different QA systems at TREC 2003 [Voorhees 2003]. We also present their recent performance compare to other groups [Dang et al 2006]. Finally we give some comments on what extra benefits LCC’s QA system achieves for adding the logic proof module in their QA system and we also give comments on how this addition make the system suitable as a next generation question answering system!

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Question Answering Question Answering is an information retrieval application that gives specific answers to user's question rather than retrieving documents (or webpages) that matches keywords of given query (like search engine, e.g. google, yahoo). QA (Question Answering) can be divided into two types of problems, i. Closeddomain QA and ii. Open-domain QA. Active research on closed-domain QA had started in 60s [wiki-QA] and the closed-domain QA is relatively easy and manageable in scale, compare to open-domain QA, as it answers question from a specific domain. So in a closed domain QA we can train the system with expected type of questions. We can compare the closed-domain QA system with the expert systems1. Now the closed-domain QA also came out from being just the expert systems, it has different advanced text processing tools like semantic analysis and many other tools to pinpoint the question and then answers to the user more accurately. There has been a very good appeal for the closed domain QA in the medical domain or for a specific domains where human-being is expensive or it is hard for human-being to answer the questions in short time or it is expensive to have many human experts to serve the clients. On the other hand, the prospect of open-domain QA is more appealing and more challenging. In the opendomain QA systems, questions can be on any topic. So for open-domain QA it is not feasible to make it expert systems as closed-domain QA. We have to deal with any type of questions (factoid2, list, definition, etc) on any subject topic (starting from sports to medical, news, programming language, tea, weather, cooking, literally everything!). For the open-domain QA the source of information is usually the web, which has the maximum collection of resource in the world. State-of-the art open-domain Question Answering systems will retrieve documents from the web that could have the answer of the given query and then will try to extract the answer from the retrieved pages and then the system will rank the answers and finally will give the user a specific answer or may be list of answers.

How does a Question-Answering system work? There are many ways a question-answering system can be implemented, but the system depends on the type of question. For examples, if the requirement is to answer the factoid questions (e.g. When is the international mother language day?), then it will be handled differently than list questions (e.g. How many villages in Bangladesh got affected for Sidr?) or the definition question (e.g. Where is Dhaka?). But the basic steps for most of the question answering system is, processing the question, retrieve documents based on the question and finally answer extraction from retrieved documents. From abstract level the task of search engine is to process the query (which is question here) and retrieve documents based on the

1

Expert systems: An expert system, also known as a knowledge based system, is a computer program that contains the knowledge and analytical skills of one or more human experts, related to a specific subject. This class of program was first developed by researchers in artificial intelligence during the 1960s and 1970s and applied commercially throughout the 1980s. The most common form of expert system is a computer program, with a set of rules, that analyzes information (usually supplied by the user of the system) about a specific class of problems, and recommends one or more courses of user action. The expert system may also provide mathematical analysis of the problem(s). The expert system utilizes what appears to be reasoning capabilities to reach conclusions. 2

Definition of factoid: a brief or trivial item of news or information. In the QA competition of TREC, factoid questions usually have to return the answers within limited number of words, e.g. 50 or 200!

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

question. Even though it seems that in a QA system we have just an extra step of extracting the answers from the retrieved documents, but these are actually lot more complex than it seems. You need deeper text understanding to answer a question in a QA system, so you cannot go without understanding the text. But in search engine, the system can successfully return documents that have keywords from the query without even understanding what the query completely. To handle this challenge a QA system has to design the modules very carefully from the first step, i.e. question processing. We will show how a QA system works by describing one particular system.

Current Work on Question Answering “TREC introduced the first question-answering track in TREC-8 (1999). The goal of the track is to foster research on systems that retrieves answers rather than documents in response to a question, with particular emphasis on system that can function in unrestricted domains.” [Voorhees 2003]. TREC has been successful to foster research in this area and the state-of-the art technologies on question answering can be found from TREC conference papers on question-answering track. TREC also organizes competitions on question-answering every year, and most of the open-domain question answering systems participate in that competition and from their overview and evaluation we can get idea of stateof-the art techniques of QA, and also which system is performing better than others. We have picked few systems randomly searching the web. Here are comments on few of them before going into the detail of our targeted system. [Zheng 2002] uses few search engines to retrieve the documents and then uses some technique to extract the answers from the retrieved documents. They also use online machine translation systems to give service to few other languages. A demo is available online at [answerbus]. [answerbus] returns results very fast and can answer trivial questions very easily, but it cannot always answer reasonably well for complex questions, because it extracts answers from retrieved documents from search engines without having a deep understanding of what the texts means. In [Ittycheriah and Roukos 2006] they extract answers by searching the occurrence of the answer candidate on the web, then re-ranking answer candidate window using statistical machine translation dictionary and finally using the lexical patterns from supervised training pairs. [Soricut and Brill 2004] uses the FAQ pages to extract the FAQ related questions and it tries to answer the non-factoid answers, which has not been in the focus of TREC QA competitions. There are many other QA system and they approach to solve the problem very differently from each other. In this survey paper our focus is to describe benefit of adding logic proof system in the question answering system. So, we won’t describe much about many different QA systems, rather we will focus on the detail of the system that uses the logic proof system and also what benefit it can get for adding the logic proof module. Coming back to different QA systems and their evaluation at different QA overview papers [Voorhees 1999, 2003 and Dang et al 2006]. If we check the results in these papers or the online demos, (e.g. a commercial QA system Ask.com [Ask.com], Language Corporation’s QA system [LCC], MIT’s QA system START [start-mit], Arizona State University’s QA system [asu-qa], or [inferret]), then we can realize that recently there have been a lot of advancements in this field but there still remain many problems to be solved! Some of these unsolved problems are, understanding the relation between the

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

question and answers, extracting the exact answer correctly, considering the semantic and syntactic roles of the words (i.e. understanding what it means in this specific context), answer justifications, are just a few problems to name. In this survey paper we describe one QA system that introduces new technique to solve one of the above problems. Language Corporation [LCC]’s QA system is the focus of this paper, which introduced the logic proof system in the QA system to justify if the answer is correct in the context. In the next two sections, we will explain the system in detail with all modules it has and also errors it has in each module.

LCC’s Question-Answering System In this section we will describe LCC’s Question-Answering system, based on the description of [Moldovan et al 2002a], [Moldovan et al 2002b] and [Moldovan et al 2003]. The QA system, named as PowerAnswer, has three main parts: question processing, document retrieval and answer extraction. Each of these three parts has smaller modules that work collectively to generate the answers for a given question. They use the same modules with two different architectures, one with serial system architecture and another with feedback. In the serial system architecture it will pass through each module sequentially. In feedback module, it uses three loops (i.e. passage retrieval loop, lexico-semantic loop and logic proving loop). It gets the feedback from the output, if it is below threshold it runs the modules again with relaxation (with more keywords, adding few more world knowledge, etc) to find a better answer. We will describe how the feedback module is used and how it improves the performance in parallel to the serial system, i.e. after the modules where the feedback is used in the feedback based system. Before that we will describe the different modules in the system together with the evaluation of these modules in respect of the entire system.

Figure 1: Architecture of baseline serial system (without feedback)

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

As shown in the Figure 1, the baseline system has 10 modules to perform different tasks for the complete question-answering system. The first five modules are used for question processing, next two modules perform document and passage processing and the last three modules perform answer processing. M1: This module is used for correcting the spelling mistakes, if exists any. If the wh terms are not in the beginning then this module rephrase the question but putting the wh terms in the beginning, e.g. “Rotary engine cars were made by what company?” will be changed to “What company were rotary engine cars made by?” This module generates the 1.9% error of the entire system. M2: The input question is parsed and transformed into internal representation capturing question concepts and binary dependencies between the concepts, e.g. “How much could you rent a Volkswagen bug for in 1966?” captures the binary dependencies between the concepts rent and 1966. This module also identifies the stop words and removes them. This module constitutes the 5.2% error of the entire system, where the most of the errors are due to incorrect parsing (4.5%). So the total error from the pre-processing (module M1 and M2) is 7.1%. M3: The mapping of certain question dependencies on a WordNet based answer type hierarchy disambiguates the semantic category of the expected answers. For example, the dependency between How much and rent in the previous example are used to derive the expected answer type Money. The answer type is passed to subsequent modules to identify the possible answers (all monetary values in this case). This is one of the most important modules in the whole system, because if you look for the name of the cars instead of monetary values in the above example, then even if you retrieve the correct document, you will end up answering incorrectly. This module has the maximum error in the system, i.e. 36.4%. The problem is when this module fails it is nearly impossible to extract the correct answer. M4: Depending on the part of speech information, a subset of question concepts are selected as keywords for accessing the document collection. A passage retrieval engine accepts Boolean queries built from the selected keywords, e.g. Volkswagen AND bug. The retrieval engine returns passages that contain all the keywords specified in the Boolean query. If there are errors in keyword selection then the error of this reflects in the following steps, because it retrieves only those documents that have all the keywords. This module constitutes the 8.9% of the total errors. M5: Before sending the keywords with Boolean queries for retrieval, the keywords are expanded with morphological, lexical and semantic alternation. This alternation is done so that we can retrieve the documents with the same keywords in other forms, which it can occur. For example, rented is expanded into rent. This is a crucial module in the system for the reason stated in M4. This module actually has the second highest error in the system, i.e. 25.7%, which is huge and which gets the system performance low too. These keywords are used for Boolean queries, so if we cannot chose the right keywords then we are not even getting the documents that have answer! The other problem is if we have too many keywords then the retrieved documents need to have all the keywords in the documents, which make it hard to get many retrieved documents to extract the original answer. It is clear that it’s hard to get the keywords right in serial system. In the feedback system, they take the opportunity of feedback and reselect the keywords. For example, if the system cannot retrieve the number of documents over a threshold, i.e. its so small then they will drop few keywords and if it’s above the threshold then they will add more keywords and will pass it to next step. This keywords dropping and adding is done in the M4 module loop, but if it still cannot manage to get enough results then in this module it gets the alternative keywords based on the WordNet and then tries again.

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

M6: The retrieval engine returns the documents containing all keywords specified in the Boolean queries. The documents are then further restricted to smaller text passages where all keywords are located in the proximity of one another. Each retrieved passages from the documents includes additional texts (extra line) before the earliest and after the latest line of the text. In this way we can capture the expected answer, otherwise we might miss the answer. For example, if we try the query, “What is the name of the CEO of Apple computers?” then we might extract text passages like, “Steve Jobs, CEO of Apple Computers, … …”. So if we extract the text passage containing these two keywords then we could have missed the name in the retrieved text passage. This module constitutes the 1.6% of error, lowest, together with the next module. This module is responsible to give the first feedback in the feedback system. Depending on the number of text passages it retrieves, it adds/removes few more keywords again and resends the Boolean query to the retrieval engine. M7: This module is to refine the retrieved result for enhanced precision. We discard the passages that do not satisfy the semantic constraints specified in the question. More specifically, we assume what should be the answer type, e.g. money, temperature, names, etc. So, we extract the text passages from the output of previous module that is of our expected type. This module has 1.6% of error, which is the lowest together with the previous module. It shows that their Name Entity Recognizer must work very nicely that it can recognize very accurately which word is of which kind. But the problem is in selecting what will be the answer type, which is M3 and constitutes the maximum error in the entire system! M8: The searches for answers within the retrieved passages are limited to those candidates that contain the passage with the expected answer type. This module then identifies the candidate answers. This module has 8% error of the total system. M9: Now they have the expected answers, but they need to rank them. They give each candidate answer a relevance score according to lexical and proximity features such as distance between keywords, or the occurrence of candidate answer within the text passage. The candidates are then sorted based on the scores in decreasing order. This module constitutes 6.3% errors. M10: Then the answer is formulated. The system selects the candidate answer with the highest relevance scores. The final answers are either generated internally or fragment of texts that are extracted from the passages around the best candidate answer. The error of answer formulation is 4.4%. These are the general modules that are used by a state-of-the art question answering system. In the LCC’s QA system, Question processing constitutes M1-M5 modules and then the Document and passage processing (retrieval) is M6, M7 and finally the Answer processing (extraction) is M8-M10. But in the feedback system explained in [Moldovan et al 2002b] and later adopted in general system [Moldovan et al 2002a] an extra module is used, i.e. Logic Prover. The task of this module is to justify the answer by rigorous proof system that the expected answer in justified in this context. Details of this system will follow in the next section and that is also the focus of this survey paper. The Figure of feedback system is given below, which used all feedback modules. In the final feedback module, it tries to prove the answer, if it fails then it go backs to previous modules and expand the keywords with semantically related alternations and then follows the next modules.

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Figure 2: Architecture with feedback system Here we have two tables from [Moldovan et al 2002b] that show the errors of each module and also how much improvement they had after adding the feedback modules. These two tables summarize what we have described in this section.

Table 1: Distribution of errors per system module

Table 2: Impact of feedback on precision.

Need for Deeper Text Understanding! At this point we can understand that a general question-answering system will have Question processing, Document and passage processing (retrieval) and the Answer processing (extraction) systems. The described modules (M1-M10) are types of basic modules that exist in the state-of-the art of the literature.

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

But we need deeper text understanding, deeper relationship between question and answers to understand if our answer is correct in this context or not. Given the complexity of questions asked in the TREC competitions, it is hard to answer all the questions with these state-of-the art techniques. So each team tries to come up with new techniques that can boost up their performance. Bottom line to improve the performance is to understand the text, understanding the relationship between question and answers. LCC’s QA system tried to introduce a new novel system into the QA system that can help to understand the relationship between question and answers more and helps to find the best answers! That technique is Logic Prover3. We lightly introduced the logic prover in the previous section and in the next section we will describe it in more details. LCC named their Logic Prover as “COGEX (from the permutation of first two syllables of the verb excogitate4), which uniformly codifies the question and answer text, as well as world knowledge resources, in order to use its inference engine to verify and extract any lexical relationships between the question and its candidate answers.” [Modoval at el 2003]. Since this paper is mostly on the logic prover of LCC, i.e. COGEX, we will use the name COGEX in most of the places to refer to their logic prover.

LCC’s Logic Prover, COGEX The LCC Logic Prover, COGEX, renders a deep understanding of the relationship between the question text and the candidate answer text. COGEX uses the Logic Form of question, answer and world knowledge, and the LF (Logic Form) captures the syntax-based relationships such as the syntactic objects, syntactic subjects, prepositional attachments, complex nominals, and adverbial / adjectival adjuncts provided by the logic representations of text. In addition to LF representation of question and answers, COGEX uses Lexical Chains to connect the world knowledge axioms with question and answers. In their XWN (eXtended WordNet), an axiom is the LF expression of a synset and its gloss. The major challenge in QA system is, very often an answer is expressed in words different than question keywords. This module of Lexical Chains resolves the problem and improves the performance by i) increasing the document retrieval recall and ii) improving the answer extraction by providing the much needed world knowledge axioms that link question keywords with answers concepts. With these deep understanding of text COGEX can use the understanding to effectively and efficiently reranks the answers by their correctness and ultimately discards the incorrect answers. With all the given axioms the next step is the automated reasoning system, COGEX. The base of COGEX is Otter, an automated reasoning system developed at Argonne Labs. LCC made required modification to fit that in to the QA environment. Hyperresolution and paramodulation are the base of inference rule sets. Hyperresolution is an inference mechanism that does multiple binary resolution steps in one, where binary resolution is an inference mechanism that looks for a positive literal in one clause and negative form of the same literal in another clause, so that both of them can be cancelled out, resulting a new inferred clause. Paramodulation introduces the notion of equality substitution so that axioms representing equality in the proof do not need to be explicitly included in the

3

Logic Prover and Logic Proof system are interchangeably used in this paper and they meant the same.

4

Meaning of excogitate: think out, plan or devise

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

axiom lists. Additionally similar to hyperresolution, paramodulaiton combines multiple substitution steps into one. The concept of hyperresolution is the same as resolution that we learned in the class and paramodulation is the same paramodulation of class. The difference is they combine multiple steps into one. The search strategy they used are Set of Support Strategy, which is same as what we learned in class and this is a complete proof system, so it is expected to give the answer when we have sufficient information. Following is extracted from [Moldovan et al 2002a] that explains about this strategy. “The search strategy used is the Set of Support Strategy, which partitions the axioms used during the course of a proof into those that have support and those that are considered auxiliary. The axioms with support are placed in the Set of Support (SOS) list and are intended to guide the proof. The auxiliary axioms are placed in the Usable list and are used to help the SOS infer new clauses. This strategy restricts the search such that a new clause is inferred if and only if one of its parent clauses comes from the Set of Support. The axioms that are placed in the SOS are the candidate answers, the question negated (to invoke the proof by contradiction), axioms related to linking named entities to answer types, and axioms related to decomposing conjunctions, possessives, and complex nominals. Axioms placed in the Usable list are the WordNet axioms and other axioms based outside world knowledge. The Logic Prover will continue trying to find a proof until one of two conditions is met; either the Set of Support becomes empty of a refutation is found.”

Example of How COGEX works with Explanation The basic architecture for COGEX is shown in the Figure 3. The first step is to run the module M1 to M10. So now they have candidate answers, world knowledge and everything. Change these into Logic Forms. ALF refers to candidate answer in Answer Logic Form; QLF refers to question in Logic Form; NLP axioms refers to axioms that represents equivalence classes of linguistics patterns; XWN axioms and Lexical chains are the World knowledge axioms supplied by extended wordnet converted into logic form.

Figure 3: COGEX Architecture

We will show an example that was given in [Moldovan et al 2002a] with explanation.

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Question: How did Adolf Hitler die?

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Performance improvement by adding COGEX, Logic Prover, to the QA system COGEX was added to the state-of-the art QA system of LCC in 2002. They had the system with and without COGEX. They experimented the results with and without COGEX and found that with COGEX the performance of the entire system improves a lot. Following is the table that shows how it improved performance in the QA system.

Table 3: Performance over 500 TREC 2002 questions

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

The improvement of the performance for adding COGEX in the QA system is 30.9% (98/317), which is impressive. But it can do better and the reason behind failing is primarily due to the lack of linguistic axioms.

Larger Picture: Comparison of LCC’s QA system with other QA systems We have seen how the QA system of LCC works in detail and also saw how the addition of one module, Logic Prover, improved the performance of whole system. But every evaluation we have seen so far was inside the LCC’s QA system. Question remains, how the LCC’s QA system performs compare to other QA systems. Our focus in this paper is Logic Proof system and LCC added the COGEX Logic Prover at 2002, so in this section we will show the evaluation of all QA systems in the TREC 2003 [Voorhees 2003]. The TREC 2003 question-answering track [Voorhees 2003] contained two tasks, the passage tasks and the main task. In the passage task the system had to return a single text snippet in response to factoid questions; the evaluation metric was the number of snippets that contained the correct answer. The main task contained three different types of questions, factoid questions (e.g. When is the international mother language day?), list questions (e.g. How many villages in Bangladesh got affected for Sidr?) and the definition question (e.g. Where is Dhaka?). Evaluation of Passage Task The following Table 4 shows that LCC’s QA system performs far better than other systems in the competition.

Table 5: Evaluation scores for the best passages task run from each group that submitted a passage run [Voorhees 2003] Evaluation of Main Task As explained before the main task constitutes of three different types of categories, where each category is evaluated in different ways. The final score was calculated by the following formula. Final score of Main task = 1/2 * Factoid score + 1/4 * List score + 1/4 * Definition score

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Table 6: Component scores and the final combine scores for main task runs. Scores are given for best run from the top 15 groups [Voorhees 2003] From the evaluations it is clear that in both tasks LCC’s QA system performed better than other systems. We also checked the recent performance of LCC’s QA system in TREC 2006 [Dang et al 2006] to justify that LCC is still performing better than other systems. The evaluations showed in [Dang et al 2006] shows that LCC’s QA system is still performing better than other systems. The evaluations from third party papers [Voorhees 2003, Dang et al 2006] show that LCC’s QA system performs very well. We tried to experiment the system with few complex examples (e.g. list questions like, “List 20 countries that produces coffee”, but it didn’t give expected results, it was surprising to us as [Voorhees 2003] claims that teams gave answers to these type of questions. We assumed that the system that competed in the TREC and online demo are not same or may be the online demo just returns the factoids questions.

Comment on adding a Logic Prover in a QA system Automated logic proof systems are accurate, if implemented correctly, but it is expensive both in terms of processing time and high failure rate due to insufficient input axioms. If we try the online demo, we can see that sometime it takes extra time than other QA system to return an answer. Assuming that with the improvement of computing power, this time can be minimized in future; but the question remains, is it worth it! The performance in the previous system showed that it improves performance by 30% in the TREC 2002 questions. In the previous section we also showed how the QA system by LCC performs compare to other systems in the TREC competitions. Conclusion from previous section is, LCC performs consistently better than other QA system in TREC [Voorhees 2003, Dang et al 2006]. Even though it is performing better than other systems and the questions in TREC are very complex and of different types but it is tested on a limited domain of 3GB data, which has only 1 million documents.

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Initial work of successful search engine Google is explained in [Brin and Page 1998]. During late 90s, TREC focused on information retrieval and their benchmark text corpus was only 20GB compared to Google’s that time’s 147GB corpus with 24 million web pages. So system that worked well on TREC system often does not produce good results on the web, because it is too small compare to the real world. The challenge in the web is, there are many junks and spams in the web, so if a generic QA system just retrieves the documents that have only the keywords and extracts answer from those documents then it is most likely that the QA system will end up answering something that might not mean what the user is expecting. A Logic Prover exactly solves this problem! Because of Logic Prover’s deep understanding of texts and rigorous proof system, it can discard the answers that are unexpected and also re-rank the exact answer. Based on these reasoning, we believe, the inclusion of Logic Prover in QA system was a novel idea and it will help the LCC’s QA system to perform far more better than other QA system in future, in the real open-domain QA on web, unless other QA systems adopts something similar to it.

Conclusion In this survey paper we started by explaining about the question answering system, open-domain QA in detail, then we briefly explained LCC’s QA system’s modules. After that we focused on the Logic Prover, an integral module of LCC’s QA system that is used to justify the answer by doing the rigorous logic proof with the help of question, candidate answer and world knowledge in Logic Form. We also showed how LCC’s QA system performed compare to other QA systems in TREC competitions. One comment we will like to make in the conclusion that LCC's QA system is very modularized that help them to add different modules easily and also do different experiments with or without a particular module, which was very helpful for adding or removing any particular component. Finally we commented specifically on their Logic Prover and benefit of adopting it in a QA system. We think that for real opendomain QA on web, Logic Prover will have notable contribution in LCC’s QA system to give better answers and also in discarding junks of the web as answers.

References: [Voorhees 1999] E M Voorhees, The TREC-8 question answering track report, Proceedings of TREC-8, 1999, pages 77-82, Gaithersburg, Maryland, NIST. [LCC] Language Computer Cooperation [Moldovan et al 2002a] D Moldovan, S Harabagiu, R Girju, P Morarescu, F Lacatusu, A Novischi, A Badulescu and O Bolohan, LCC Tools for Question Answering, Proceedings of the TREC-2002 Conference, 2002. [Moldovan et al 2002b] Dan Moldovan, Marius Paşca, Sanda Harabagiu and Mihai Surdeanu, Performance issues and error analysis in an open-domain Question Answering system,

Survey Paper for CSC 444 (Logical Foundation of AI), University of Rochester, Fall 2007 Instructed By: Lenhart K. Schubert

Proceeding of the 40th Annual Meeting of the Association for Computational Linguistics (ACL) 2002. [Moldovan et al 2003] D Moldovan, C Clark, S Harabagiu, COGEX: A Logic prover for Question Answering, Proceedings of HLT-NAACL 2003. [TREC] Text REtrieval Conference (TREC) [Voorhees 2003] Ellen M. Voorhees, Overview of TREC 2003 Question Answering Track, Text REtrieval Conference (TREC) 2003. [Dang et al 2006] Hoa Trang Dang, Jimmy Lin and Diane Kelly, Overview of TREC 2006 Question Answering Track, Text REtrieval Conference (TREC) 2006. [wiki-QA] Question answering - Wikipedia [Zheng 2002] Zhiping Zheng. AnswerBus Question Answering System. Human Language Technology Conference (HLT 2002). San Diego, CA. March 24-27, 2002. [answerbus] Answerbus Question Answering system online demo [Soricut and Brill 2004] Radu Soricut and Eric Brill, Automatic Question Answering: Beyond the Factoid. [Ittycheriah and Roukos 2006] Abraham Ittycheriah, Salim Roukos, IBM's Statistical Question Answering System - TREC-11, 2006. [Ask.com] Question Answering system Ask.com [start-mit] START, Natural Language Question Answering System [asu-qa] Arizona State University’s QA system [inferret] [Brin and Page 1998] Sergey Brin and Lawrence Page The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 1998.