Question Systematization using Templates

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465 2016 3 International Conference on “Computing for Sustainable Global Development”, 16th - 18th March, 2016 Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) rd

Question Systematization using Templates Komal Pawar Student, G. H. Raisoni College of Engineering Nagpur, India Email Id: [email protected]

Abstract – Question is a crucial construct of natural language. Systematic, error free question is a basic need of different applications of natural language. Many research works have been focused on ‘statement’ formation but the issue of ‘systematic question’ formation is less focused. This research work resolves above issue through systematization process using Template based approach which is accompanied by Dictionary approach and powerful NLP technique like Maximum Entropy based POS Tagging technique. Systematization process aims to reform proper flawless question from the erroneous input question by removing existing errors present in order of words, word spelling and removing ambiguous synonyms of the words. This work deals with domain specific WH-questions of English language. Additionally it also works on imperative questions. Template based approach is supported with a key concept of ‘Question Templates’ which are designed with human intelligence keeping detail knowledge of various lingual constructs, their grammar and domain specific questionnaire. This work is useful in various fields , for example in academics to set question papers, to assist English learners, to produce intermediate output for complex systems like question-answering system to retrieve correct answer from a huge dataset.

Urmila Shrawankar IEEE Member, G. H. Raisoni College of Engineering Nagpur, India Email Id: [email protected]

area but deep reasoning is not possible because it deals with various parameters of language. Closed domain systems deal with questions in a specific domain having limited amount of focused and structured information. Closed domain systems collect questions from any specific domains and thus very flexible to maintain domain specific dataset and dictionaries required by the system.

I. INTRODUCTION

Systematization addresses the above less focused issue. Systematization refers to the process of systematic question reformation when user entered question text comprises of structural irregularities and ambiguities. It is required to remove these errors since such errors affect semantics of a question and makes it difficult to interpret. Misinterpretation of question text may result into malfunctioning of complex systems like question answering (QA) systems, or it may create difficulties in the learning process of new learners. This work considers major three parameters of question text, which usually consist of key errors. These parameters are (i) order of the word (ii) word spelling and (iii) ambiguous word synonyms. Questions are mainly categorized into two types as, WH-type and Yes/No type. This research work particularly focuses on WH-question. It uses template based approach accompanied with NLP techniques, general and domain specific dictionaries and mapping rules for overall implementation of systematization process.

Natural language processing (NLP) is a major research area. NLP drives number of natural language (NL) applications and plays vital role in research field since a language is the most powerful media for smooth and effective communication. Language comprises of different language constructs. Many researches have been paid more attention towards the formation of proper statement which is known as an assertive sentence rather than interrogative construct which is known as a question in English language. Question is the important construct of a language since any particular communication used to begin with question. Correct question results into correct answer. Various systems under natural language processing are mainly divided into two categories as follows: (i) Open domain System and (ii) Closed domain system [1][2][3].Open domain systems are domain independent having large collection of data from various fields. This type of systems include questions from large varieties of domain and focuses on different topics. Advantage of such systems is that it covers large discussion

Template based approach is based on the concept of ‘Question Templates’ those are designed with human intelligence. Template offers effective way to represents different language constructs. Template refers to the standard patterns to be followed by particular language construct [6][7]. Similarly, interrogative constructs have standard grammatical patterns to frame different categories of question in English. On this basis, questions which disobeys these patterns are referred as irregular. Systematization process starts with user question taken as input. This question may contain irregularities like improper word order, spelling errors and ambiguous synonyms which generally leads to the misinterpretation of question. Systematization removes these irregularities and gives output in the form of systematic question which is correct in both syntactic and semantic manner. Also the final output is a simple clear question. The rest of the paper is organized as follows. In Section II we discuss about motivation behind this work and previous

Keywords – POS tagger; question templates; systematization; template based approach; WH-questions

Copy Right © INDIACom-2016; ISSN 0973-7529; ISBN 978-93-80544-20-5

3993

2016 3

rd

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465 International Conference on “Computing for Sustainable Global Development”, 16th - 18th March, 2016

contributions made towards this topic. Section III details the system model for systematization process and explain its components. Section IV presents analysis of the different techniques which are carried out for the implementation of systematization. Section V presents brief explanation of expected result. Finally, section VI concludes the paper and provides roadmap for future work. II. MOTIVATION AND CONTRIBUTIONS Idea of systematization is devised from the verities of work carried out in the field of natural language processing. Categories of systems those are working under natural language processing is given in [1][2]. These papers explains two categories of system as open domain system and close domain system. And also specify characteristics of each of them along with the comparison between them as in[3][4]. Different language constructs and importance of question constructs is explained in [5]. It specifies role of input parameters in the development of WH question starting from child language. It specially focuses the need of structured and systematic question formation. Technical definition of templates and the role they play in the field of natural language processing is explained in detail in [6] with the concept of requirement templates. Work carried out under this paper is to develop automated approach for conformance checking to requirements template using natural language processing. Need of requirement templates with reference to software engineering is also explained. Template based approach with its application for question answering system is explained in [7][8][9]. Comparison of template based approach with other question generation techniques is given in [10]. Role of templates in template based approach is explained in [7]. Importance of different NLP techniques like tokenization and their utilities for language processing systems is specified in [6]. Spelling error correction approach with reference to string transformation is explained in [11]. Similarly [12] explains utilities of synonym checking in application like string transformation, information retrieval. Application of WordNet for synonym finding in semantic based approach is given in [13]. WordNet as a lexical database is explained in [14][15]. Role of word synonyms along with phrase synonym is described in [16]. Concept of POS tagging, its applicability to language constructs is explained by [6] in detail. Various POS tagging approaches and their comparison is given in [17][18]. Maximum entropy based model for part-of-speech tagging with Penn Tree Bank Tag set and its advantages to language processing is explained in [19]. Use of POS tagging in queryto-question translation and FAQ retrieval is explained in [20][21]. POS tagging with application to Twitter is explained in [22].

Fig. 1. System Model for Systematization Process

III. SYSTEM MODEL System model for systematization process is shown in fig.1. It shows three main phases of systematization process and its components. These are explained as follows: 

Question text pre-processing



Question Reformation



Question set updation

All three phases with their corresponding sub-phases are explained in the following subsections. A. Question text pre-processing Pre-processing is the first important phase in the implementation of system model. In the process of systematic question formation, it plays vital role. First important task is to create the dataset and second task is to remove irregularities and errors present in the input question.


3994

Question Systematization using Templates

1) Dataset creation: Dataset creation comprises the task of designing question templates. Question templates are manually designed based on human intelligence and creativity. Intelligent question templates designing requires detailed knowledge of language constructs, mainly the interrogative constructs along with the knowledge of their syntax and semantics. It also needs precise observation of questions present in question set. This question set is domain specific since it is a closed domain system. Example of question and its respective template is shown as follows. Number of templates are created manually to implement template based approach. Question => what are the different services provided by the internet? Template=> WH-word + auxiliary-verb + determiner + adjective +noun (plural) + verb + preposition + noun

1)POS tagging: Finding parts-of-speech of individual words is a need of systematization process. This requirement is fulfilled by POS tagging technique. POS tagging is an important NLP techniques and plays vital role in this research work. For this purpose it uses Stanford POS Tagger[6] which is built on Maximum Entropy algorithm[19] and Penn Tree Bank Tag set. This is a first step in question reformation. It generates tags showing parts-of-speech of individual words for next rearrangement sub phase. 2)Tagged word rearrangement: In this sub-phase, words are getting arranged in different slots of question template as per the requirment, on the basis of their tags given by POS tagger. Finally, this rearrangement forms systematic question. Preciseness of system depends on this arrangement. C. Question set updation

2) Removal of Irregularities: Dataset creation is followed by three sub-phases to remove errors and irregularities present in input question text. These sub phases are explained as below: a) Tokenization: The process of tokenization is used to split input question text into different tokens and to remove useless symbols like extra spaces, punctuation marks etc. This techniques is the basis for all further techniques to be implemented. This is one of the important technique NLP. b) Spelling error correction: Spelling error correction is required to remove spelling mistakes done by user while typing question text. Spelling error correction is required to make the question, semantically correct. For this purpose domain specific as well as common word dictionaries are maintained in the dataset. c)Synonym checking: Synonyms can create ambiguities in the interpretation of question. Systematization aims to frame unambiguous and simple questions those are easy to understand. This is implemented with help of domain specific keyword dictionaries that are located in dataset and built with the help of WordNet [13][14][15]. d) Phrase replacement: Along with WH-questions, systematization process also considers imperative questions. These questions always start with imperative word or phrase. Output of the systematization process has been decided to be a WH-question and thus it is needed to convert imperative word or phrase into WH word to form a WH-type of question. It is implemented with mapping rules present in dataset. These mapping rules are manually designed on the basis of domain specific question set.

At final, the last task is to keep the system updated. For this, basic checking process is performed. According to the yes or no status new questions are added to the question set. If this newly added question is having different pattern than other questions then new template is designed and uses it for further references. This phase helps to make the system more and more intelligent. IV. ANALYSIS OF TECHNIQUES Analysis of the different techniques which are carried out through systematization process is given as follows: 

   

Tokenization – It uses standard technique to remove all special symbols except ‘.’ (full stop) since domain specific questions may contain ‘. ‘. Thus it is needed to keep ‘.’ (full stop) in question. Spelling error correction – It uses discriminative model for correcting spelling errors. Here it work on candidate generation at character level. Synonym checking – It is implemented with dictionary based approach and it makes use of domain specific dictionaries for listing out different synonyms. Phrase replacement – It uses rule based approach and makes use of mapping rules. POS Tagging – This standard NLP technique is implemented with Stanford POS Tagger [6] which is built on Maximum Entropy algorithm [19] and Penn Tree Bank Tag set and found to be effective to tag individual words. V. EXPECTED RESULTS

B. Question Reformation Question reformation is a combination of two sub phases. One is POS tagging and second is tagged words rearrangement. This phase actually forms a systematic question from improper question text. Sub phases are explained as below:

Following Fig. 2 gives brief idea about expected result of the system.


3995

2016 3

rd

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465 International Conference on “Computing for Sustainable Global Development”, 16th - 18th March, 2016 VI. CONCLUSION This research work provides a system model for the reformation of systematic error free questions from an erroneous questions. Systematization is a combined approach which can remove three major types of irregularities in stepwise manner. Two irregularities are the incorrect order of words and spelling errors present within the input question and third one is the presence of an ambiguous synonyms that may create misinterpretation of question. Template based approach is the basic idea behind this research work and is found to be effective since it is a closed domain system and deals with domain specific questions. Further, the system can be effectively trained using varieties of templates designed with human intelligence. Stanford POS tagger correctly tags individual words with their parts-of-speech and thus increases the accuracy of the system. Further, WordNet and domain specific dictionaries helps to extend usability of this work to multiple domains. In future, this work can be further extended to cover Yes/No type of the questions in addition to WHquestions. Additionally, it will be able to operate questions consisting lingo words which are commonly used by users while typing input text. REFERENCES [1]

[2] Fig. 2 Example of Systematization Process

Input of the system will be taken in the form of single WH or imperative question. This input question may consist of an error regarding word order that means, words are incorrectly arranged and does not follow grammatical rules of interrogative constructs. If such type of error exists in the input then this system will reconstruct proper question by placing individual word at its correct position on the basis of tags given by POS tagger in the form of their parts-of-speech. Rearrangement is achieved by using manually designed question templates. Further if the words of given input shows spelling errors then all errors will get corrected by using general as well as domain specific dictionaries which are maintained by system on the basis of detail analysis of domain specific questionnaire. Further, input may also consists of ambiguous synonyms, especially of nouns which may leads to the misinterpretation of given question. This work is expected to replace such misleading synonyms with simple words using domain specific dictionary. This system will be able to remove all the above errors effectively with maximum accuracy. Further imperative questions will also be converted into WH-questions using mapping rules. Finally this system generates output as a single WH-question which is a flawless systematic question.

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Itziar Aldabe, Montse Maritxalar, “Semantic Similarity Measures for the Generation of Science Tests in Basque” , IEEE transactions on Learning Technologies, vol. 7, no. 4, pp. 375-387, December 2014. Payal Biswas, Aditi Sharan, Nidhi Malik, “A framework for restricted domain question answering system” , IEEE International Issues and Challenges in Intelligent Computing Techniques (ICICT) conference, , pp. 613-620, 2014. Shubhangi tirpude, Dr.A.S.Alvi, “Closed Domain Keyword based Question Answering System for Legal Documents of IPC Sections & Indian Laws” , International Journal of Innovative Research in Computer and Communication, pp. 5299-5311, 2015. Mohammad Reza Kangavari, Samira Ghandchi, Manak Golpour, “Information Retrieval : Improving Question Answering Systems by Query Reformulation and Answer Validation” , International conference on Scholarly and Scientific Research & Innovation, pp. 215-222, 2008. Anthony Goodwin, Deborah Fein and Letitia Naigles, “The role of maternal input in the development of wh-question comprehension in autism and typical development”, Journal of child language, pp. 32 – 63, 2015. Chetan Arora, Mehrdad Sabetzadeh, Lionel Briand and Frank Zimmer, “Automated Checking of Conformance to Requirements Templates using Natural Language Processing”, IEEE transactions on Software Engineering, vol. 2, no.7, May 2015. Tilani Gunawardena, Nishara Pathirana, Medhavi Lokuhetti, Roshan Ragel, and Sampath Deegalla, “Performance Evaluation Techniques for an Automatic Question Answering System”, International Journal of Machine Learning and Computing, pp. 294-300, 2015. Andrea Andrenucci, Eriks Sneiders, “Automated Question Answering: Review of the Main Approaches”, IEEE Third International Conference on Information Technology and Applications Proceedings (ICITA), 2005. Eriks Sneiders, “ Automated Question Answering Using Question Templates that Cover the Conceptual Model of the


3996

Question Systematization using Templates

[10]

[11]

[12]

[13]

[14] [15] [16]

[17]

[18]

[19]

[20]

[21]

[22]

Database” , Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems, pp. 235-239, , 2002. Sheetal Rakangor, Dr. Y. R. Ghodasara,” Literature Review of Automatic Question Generation Systems”, ACM International Journal of Scientific and Research Publications, pp. 1-5, January 2015. Ziqi Wang, Gu Xu, Hang Li, and Ming Zhang, “A Probabilistic Approach to String Transformation”, IEEE Transactions on Knowledge And Data Engineering, vol. 26, no. 5, May 2014, pp.1063 – 1075. Santosh Kumar Ray , Shailendra Singh , B. P. Joshi, “A semantic approach for question classification using WordNet and Wikipedia”, Elsevier International Journal of Pattern Recognition Letters, pp. 1935-1943, October 2010. Magnini, B., Negri, M., Prevete, R., Tanev, “A WordNet-Based Approach to Named Entities Recognition”, Proceedings of the workshop on Building and using semantic networks, pp. 1-7, 2002. Fellbaum C. ed., “WordNet – An Electronic Lexical Database” , MIT Press, 1998. G.A. Miller, “WordNet: A Lexical Database for English”, Magazine of Comm. of the ACM, pp. 39-41, 1995. Tao Cheng, Hady W. Lauw, and Stelios Paparizos,” Entity Synonyms for Structured Web Search”, IEEE Transactions On Knowledge And Data Engineering, vol. 24, no. 10, pp. 18621875, October 2012. Jeffrey C. Reynar and Adwait Ratnaparkhi, “A Maximum Entropy Approach to Identifying Sentence Boundries” , ACM Proceedings of the fifth conference on Applied natural language processing ANLC '97, pp. 16-19, 1997. Sanjay K Dwivedi, Vaishali Singh, “Research and reviews in question answering system” , Elsevier International Conference on Computational Intelligence: Modeling Techniques and Applications (CIMTA), pp. 417 – 424, 2013. Huang Heyan, Zhang Xiaofei, “ Part-of-Speech Tagger Based on Maximum Entropy Model ” , 2nd IEEE International Conference on Computer Science and Information Technology ICCSIT, pp. 26 – 29, 2009. Maryam Nafari, Chris Weaver, “Query2Question: Translating Visualization Interaction into Natural Language” , IEEE Transactions on Visualization and Computer Graphics, vol. 21, no. 6, pp. 756-769, June 2015. Chung-Hsien Wu, Jui-Feng Yeh, and Yu-Sheng Lai, “ Semantic Segment Extraction and Matching for Internet FAQ Retrieval” , IEEE Transactions On Knowledge And Data Engineering, vol. 18, no. 7, pp. p.930-940, July 2006. Leon Derczynski, Alan Ritter, Sam Clark, Kalina Bontcheva, “Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data” , Proceedings of the International Conference on Recent Advances in Natural Language Processing, , pp. 198–206, September 2013.


3997