Resolving Ambiguous Preposition Phrase for Text

0 downloads 0 Views 305KB Size Report
Resolving Ambiguous Preposition Phrase for Text. Mining Applications. Hejab M. Alfawareh. College of Computer Science & Information Systems.
Resolving Ambiguous Preposition Phrase for Text Mining Applications

Hejab M. Alfawareh

Shaidah Jusoh

College of Computer Science & Information Systems Najran University Najran, Saudi Arabia [email protected]

College of Computer Science & Information Systems Najran University Najran, Saudi Arabia [email protected]

Abstract—Text Mining is one of the computational intelligence research areas. The main goal of text mining tool is to discover knowledge which is embedded in unstructured text. The first step of text mining is to extract fact from the texts. However, to build a robust text mining tool is very complex. The first step requires the tool to process a natural language. The major challenging issue in any natural languages is the ambiguity problem. The problem may occur at lexical and phrase levels. This paper addresses ambiguity problem which occur in the preposition phrase, and presents a new technique for resolving the problem. The technique has been developed by applying possibility theory, fuzzy set, and context knowledge. The technique has been implemented and tested using a set of test cases and promising results are obtained. (Abstract) Keywords-text mining; ambiguity; context knowledge(key words)

I.

INTRODUCTION

Text mining is the use of automated methods for exploiting the enormous amount of knowledge available in text documents. Text Mining represents a significant step forward from text retrieval. It is a relatively new and vibrant research area that is changing the emphasis in text-based information technologies from the level of retrieval and extraction to the level of analysis and exploration. Text mining tools could be technologies which are capable of answering sophisticated questions and performing text searches with an element of intelligence. Typical text mining tasks include text categorization, text clustering, and production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling [1]. To make text mining tasks successful, the presented texts should be analyzed using natural language processing (NLP) techniques. The use of natural language processing techniques enables text mining tools to get closer to the semantics of a text source [2]. This is important, especially when the text mining tool is expected to discover knowledge from texts. Thus natural language processing (NLP) is a foundation for text mining, and it becomes a critical part of a text mining system. However, the main challenging issue in NLP is the natural

language is always ambiguous. One word may refer to different many meanings and one phrase could be interpreted in many ways. According to [3], one type of grammatical ambiguity occur when a phrase/clause can behave as object complement or adverbial phrase. Let us take an example ‘she put the cookies on the stove’. A phrase ‘on the stove’ can be interpreted as an object complement to the direct object cookies or and as an adverbial phrase to the event ‘put’. The main goal of text mining is to extract unambiguous fact from ambiguous sentences. This paper addresses an ambiguity problem which occurs in a preposition phrase (PP), and presents a new approach for resolving the problem. The approach has been developed by applying possibility theory, fuzzy set, and context knowledge to extract unambiguous fact. This paper is organized as follows. Section II presents the previous work related to resolving ambiguity problem for PP. Section III presents our proposed approach. Section IV presents the implementation of the approach. Results and analysis are discussed in Section V. II.

PREVIOUS WORK

Prepositions are often among the most frequent words in a language. For example, based on the British National Corpus (BNC) [4], four out of the top-ten most-frequent words in English are prepositions (of, to, in, and for). Despite their frequency, however, they are notoriously difficult to master, even for humans [5]. PP attachment research has undergone a number of significant paradigm shifts over the course of the last three decades, and been the target of interest of theoretical syntax, AI, psycholinguistics, statistical NLP, and statistical parsing. [6] Two large areas of research on the syntactic aspects of prepositions are (a) PP attachment and (b) prepositions in multiword expressions. Determining the correct attachment site for PP is one of the major sources of ambiguity in natural language parsing and analysis. Early research on PP attachment focused on the development of heuristics intended to model human processing strategies, based on analysis of the competing parse trees independent of lexical or discourse context [7]. As the research community grow up, many researchers have attempted to resolve PP

attachment ambiguity in many different angles. A significant shift in NLP research on PP attachment was brought about by [8] who were the harbingers of statistical NLP and large-scale empirical evaluation. Researchers have been trying to tackle the problem by a variety of smoothing methods and machine learning algorithms including backedoff estimation, instancebased learning, log-linear models, maximum entropy learning, decision trees, neural networks, boosting as well as corpusbased. However, statistical approaches are not appropriate or adequate in accounting for inferring prepositional phrase attachments in cognitive modeling systems, as human cognition is generally not a completely statistical process [9]. III.

Subject Knowledge Store It is a knowledge-base which is used to store the represented context knowledge of the subject matter. The knowledge is obtained from the human user or preceding sentences. In this work, only two types of knowledge are stored. They are knowledge about entities and knowledge about events or relationships between entities. Possibility theory/Fuzzy Grammar Fuzzy grammar is used during the parsing process. The possibility theory is applied to the context knowledge to select the most possible fact represented by a sentence.

PROPOSED APPROACH

A. Framework The framework of the proposed approach consists of 2 external components and 3 internal components. The external components include a human user and a fact knowledge based. The internal components are subject knowledge store, possibility theory and fuzzy grammar and NLP techniques. The illustration of the components is shown in Figure 1. The internal components are utilized as unambiguous fact extraction processor. An input is a sentence and an output is an unambiguous fact.

NLP Techniques The techniques of NLP that are involved in this work include syntactic processing and semantic processing. Syntactic processing concerns with parsing a sentence and obtaining a syntactic structure or a parse tree. Top down and bottom up parsing techniques have been used in the syntactic processing. Semantic processing concerns with interpreting and attaching semantics to the syntactic structure; Lambda reduction technique has been used for the purpose. Sentence A sentence is treated as an input to the fact extraction processor. Each sentence is processed syntactically and semantically before unambiguous fact can be produced. Fact Knowledge-base It is a conceptually knowledge-base where all unambiguous extracted facts will be stored. The knowledge-base will be used in the next step of a text mining system. B. Knowledge Representation of Unambiguous Fact Extraction As described above, the knowledge store contains knowledge about entities and events. The knowledge (K) is represented as a graph which consists of nodes and edges. The knowledge store can be represented as a graph K in the form of K = {N,E}

Figure 1. The approach framework

Human User A human user is a person who is responsible to represent subject knowledge using knowledge representation schemes. The knowledge representation about the context of subject matter is then stored in the knowledge store. It will be utilized by and the unambiguous fact extraction processor for resolving ambiguity problems.

(1)

where N is a set of nodes which can be represented as N = {n1, n2, ..., nm}, where m is a finite integer. While E is a set of edges which can be represented as E = {e1, e2, ..., em}. Each edge is represented as an arrow indicating relationship between two nodes: a start node and an end node. A node (n) represents an entity. An event or relationship between two nodes is presented by an edge (e). A sample of knowledge about entities and events is given in Figure 2. Every node is a member of nodes in a graph which can be represented as n N. Every edge is a member of edges which can be represented as e E. For every pair of nodes, there may be one or more edges between them to represent possible relationships. Each e E is associated with a pair of (l, p), where l denotes the

relationship and p is the plausibility value of the relationship between the two entities represented by the two nodes. For example using a sentence ‘The police shot the robber in the shop'; there will be nodes for ‘police’, ‘robber’ and shop. There may be two edges, between the node robber and the node shop, representing ‘shot in’ or ‘inside’. Figure 3 illustrates two different relationships for a pair of nodes robber and shop.

Step 2: Handling Multiple Parsing In this work we have modify Earley algorithm to handle multiple parsing, such that whenever a sentence has more than one possible syntactic structure, more than one parse tree will be generated and produced. Technically, this step is integrated into the step of parsing sentence with fuzzy grammar. Step 3: Resolving Ambiguous Fact During Parsing Resolving ambiguous facts is considered as the heart of unambiguous facts extraction technique. In this step, knowledge about the subject context and possibility theory are integrated together to resolve ambiguous facts during the parsing process.

Figure 2. An example of a graph representing nodes and edges in the subject knowledge-based

Figure 3. Two nodes with edges representing there are two possible relationships between the nodes.

C. Methodology

Figure 4. The steps of unambiguous fact extraction technique

The methodology of the proposed approach is segmented into 8 steps. Figure 4 presents the steps graphically. The internal components of the proposed framework are utilized and integrated into the steps.

Step 4: Semantic Attachment An output of the three step previously mentioned will be a parse tree or parse trees. Semantic attachment is then conducted on the obtained parse tree to assign a correct semantic to each of its constituent.

Step1 : Parsing Sentence With Fuzzy Grammar In this step, a sentence is processed to recognize its part-ofspeech and its syntactic structure. To utilize the fuzzy the possibility theory and the subject context knowledge that is stored in the knowledge-base, a fuzzy grammar is created and used. The output of this process is a single parse tree or multiple parse trees.

Step 5: Converting a parse tree into a graph A parse tree with semantic attachment is then converted into a graph. This step is an important step to conduct a pattern matching process with graphs that are stored in the knowledge store.

Step 6: Matching Graphs In this step, pattern matching technique is used. The created graphs from the parse trees are then matched to existing graphs in the subject context knowledge-base. It is achieved by searching a same pattern of graph in the subject context knowledge-base. When the same pattern exist, thus the graph is match. Step 7: Graph Selection After the process of pattern search is success and the match graph is identified, the graph is taken as a solution. When there are more than one parse trees, there will be more than one graph will be matched with the graphs in the subject context knowledge-base. The meaning of a sentence should be represented by only one graph. Therefore the most possible graph will be selected. The process of graph selection is involved with calculating the plausibility value of each graph and selecting the most possible graph by taking the graph that has the highest plausibility value. Step 8: Fact Representation After a graph has been selected, the graph is converted into a formal knowledge representation. In this work, a predicate calculus has been used as a knowledge representation for the unambiguous fact. The predicate calculus is then stored into the fact knowledge-base. the meaning which represents the fact of a sentence has been represented using a graph representation. The most possible graph is then selected. To convert the graph into predicate calculus, two things are important, the nodes and the relationship name that is associated with the edge. The node will be converted into atom of predicate calculus while the relationship name will be converted into relation of predicate calculus. A result of the conversion process from a graph (Figure 5) into a predicate calculus can be represented as shot-in (robber, shop)

Figure 5. An example of a selected graph which will be converted into predicate

IV.

IMPLEMENTATION

Theories presented in the methodology section was implemented and experimented. The experiment’s process was divided into two parts; test case preparation and experiment environment set up. In this work, test case was designed and created. A test case consists of a collection of sentences. As this technique is utilizing human knowledge in subject, the subject must be defined first in creating data sets. The identified subjects context include, police, robbery, business, education, laboratory, market, baby, livestock, housekeeping work, and so on. Each test case consists of ambiguous and

unambiguous sentences and each sentence may contain ambiguous and unambiguous words. The test case may contain sentences in a range between 4 to 8 sentences. Each sentence contains less than ten words. The sentence is free from conjunction words such as 'and', 'but', 'or' and 'so'. The sentence also free form interjection words such as “oohhh” or “ahh” and so on. The proposed approach has been implemented using C language. As discussed in the methodology section, when a sentence is ambiguous more than one parse tree will be generated. The grammatical ambiguity can cause ambiguity in parsing. In this case, more than one rule may be applied when a parser parses a PP as shown below :- Verb [] [] :- Verb [] [] :- Noun Phrase :- Preposition Phrase : - Preposition Phrase From the above grammar rule, a PP can be parsed through object complement or adverbial phrase. This process will produce multiple parse tress from one sentence. Consequently, it can be assumed that, when a sentence contains a PP, the sentence is facing grammatical ambiguity or structural ambiguity. To evaluate the proposed approach, 50 test cases have been tested. In this paper, only 1 test case is presented. The context of the test case presented in this paper is a robbery case. The human knowledge about the crime subject has been stored in the subject knowledge store. The knowledge was represented using a graph representation. The following sentences are used in the test case. Sentence 1: The shopper heard a shot sound. Sentence 2: The shopper saw a robber with a gun. Sentence 3: He called a police station. Sentence 4: The police came to the place. Sentence 5: The police ran into the shop. Sentence 6: The police shot the robber in the shop.

V.

RESULT AND ANALYSIS

Figure 6 represents the results of the test case and extracted unambiguous facts are shown in Figure 7. In column “Type of Ambiguity” presents the results for the steps of parsing with fuzzy grammar and handling multiple parsing. Column ‘Graph /Tree (1)” and “Graph/Tree (2)” present the results of steps for resolving structural ambiguity, converting parse tree into graph and graph matching. Column “Selected Graph” presents the results of the step graph selection. Column unambiguous fact presents the fact representation step. The sentence sequence in the test case is indicated by ‘Sentence No’. Type of Ambiguity represents either the sentence is ambiguous or unambiguous. If the sentence is

unambiguous, it is recorded as ‘none’ and if the sentence is ambiguous, there are two possible types of ambiguity; either structural ambiguity only or structural and fact ambiguity. Each graph is assigned with the possibility value. The technique of assigning possibility value has been described in the previous section. If a sentence is not ambiguous, there is no possibility value will be assigned. For example, for sentence no. 1, the sentence is unambiguous, thus there is no possibility value is assigned to a graph.

Figure 6.

possible facts can be extracted from one sentence. Take a look at Sentence no. 2 in the given test case. The sentence has two types of ambiguity; structural ambiguity and fact ambiguity. Each generated graph is assigned with possibility value, where graph (1) is assigned with 0.25, and graph (2) is assigned with 0.80. A plausibility value for Graph (2) is higher than Graph (1), because by using the rationality and feasibility values from the crime subject knowledge-base, the processor calculated that it is more plausible for a robber to have a gun, rather than for a shopper to see the robber through a gun. Using the max operator in fuzzy sets, it is most possible for a robber to have gun. Thus the Graph (2) is selected. The Graph (2) represents an unambiguous fact of a sentence. As explained in the methodology section, the selected graph is converted into a predicate calculus before storing it into the fact knowledge-base. The unambiguous extracted facts for the Sentence 2, are presented as saw(shopper, robber) ^ has(robber, gun). Note that there two facts extracted from the sentence. Compared to Sentence no. 1, only one fact is extracted that is heard(shopper, shot-sound).

An example of a test case result

VI.

Figure 7. Unambiguous fact of sentences

In analyzing the obtained results, a preposition phrase has been classified into two groups: a certain preposition and uncertain preposition. In this work, an uncertain preposition phrase is as a preposition phrase that contains preposition words such as ‘on’, ‘in’, ‘with’, ‘below’, and ‘behind’. A certain preposition is a preposition phrase that contains preposition words such as ‘onto’, ‘into’, of’ and ‘to’, that are unlikely to cause ambiguity to the facts. By classifying PPs, we can differentiate between a sentence that has both structural ambiguity and fact ambiguity or only structural ambiguity. Obviously, if a preposition phrase in a sentence belongs to the uncertain preposition group, the sentence contains a structural ambiguity and causes a fact ambiguity. If a preposition phrase belongs to the certain preposition group, the sentence contains a structural ambiguity that does not cause a fact ambiguity. Fact ambiguity occurs when a sentence may have more than one meaning. Fact ambiguity is caused by the structural ambiguity. For example, consider the PP ‘in the shop’ in the sentence ‘the police shot the robber in the shop’. The phrase can modify either a verb ‘shot’ or a noun ‘robber’. The parser would generate two parse trees, and two alternative meanings can be interpreted. Consequently there are two

CONCLUSION

This research attempts to seeks a new approach for resolving an ambiguity problem which exists in the preposition phrase. The approach is based on the integration of NLP techniques, possibility theory, fuzzy sets, and context knowledge-based approach. The knowledge-based approach was utilized for the implementation of context knowledge. The possibility theory, fuzzy sets, context knowledge have been utilized in selecting the most possible fact from many possible facts. The proposed approach has been implemented. One of the obtained results has been presented and discussed in this paper. The overall obtained results indicate the approach is viable. REFERENCES [1] [2]

[3] [4] [5]

[6]

[7] [8] [9]

J. Redfearn, “Text mining,” JISC, pp. 1–2, 2006. S. Jusoh and H. M. Alfawareh, “Agent-based knowledge mining architecture,” in Proceedings of the 2009 International Conference on Computer Engineering and Applications, IACSIT. Manila, Phillipphines: World Academic Union, June 2009, pp. 602–606. ] D. Kies, Modern English Grammar, 2009 (accessed Jan 23, 2009),http://papyr.com/hypertextbooks/grammar/. L. Burnard, Reference Guide for the British National Corpus. Oxford,UK: Oxford University Computing Services, 2000. M. Chodorow, J. Tetreault, and N.Han, “Detection of grammatical errors involving prepositions,” in Proceedings of the 4th ACL-SIGSEM Workshop on Prepositions, 2007, pp. 25–30. T. Baldwin, V. Kordoni, and A. Villavicencio, “Prepositions in applications: A survey and introduction to the special issue,” Computational Linguistic, vol. 35, no. 2, pp. 119–149, 2009. L. Frazier, “On comprehending sentences: Syntactic parsing strategies,” Ph.D. dissertation, University of Connecticut, 1979. D. Hindle and M. Rooth, “Structural ambiguity and lexical relations,” Computational Linguistics., vol. 19, no. 1, pp. 103–120, 1993. G. Botterill and P. Carruthers, The philosophy of psychology. Cambridge, UK: Cambridge University Press, 1999.