Information Retrieval System and Machine Translation - Science Direct

19 downloads 31581 Views 166KB Size Report
Published by Elsevier B.V. This is an open access article under the CC .... If query words not found in the bi-lingual dictionary then transliteration must use.
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 78 (2016) 845 – 850

International Conference on Information Security and Privacy (ICISP2015), 11-12 December 2015, Nagpur, INDIA

Information Retrieval System and Machine Translation: A Review Mangala Madankara, Dr.M.B.Chandakb, Nekita Chavhanc a

Research Scholar, G.H.Raisoni College of Engineering Nagpur, Maharastra, India b Professor, Ramdeobaba College of Engineering Nagpur, Maharastra, India c Assistant Professor, G.H.Raisoni College of Engineering Nagpur, Maharastra, India

Abstract In this paper we introduce some of the most important areas of information retrieval i.e. Cross-lingual Information Retrieval (CLIR), Multi-lingual Information Retrieval (MLIR), Machine translation approaches and techniques. In today’s growing nation, local database storage and retrieval is essential for the developing countries. CLIR deals with asking queries in one language and demand for retrieving documents in another language. MLIR deals with asking questions in one or more languages and retrieving documents in one or more different languages. Machine translation plays an important role to achieve the CLIR and MLIR system. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license © 2016 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-reviewunder under responsibility of organizing committee the ICISP2015. Peer-review responsibility of organizing committee of theofICISP2015 Keywords: Information Retrieval; Cross Language Information Retrieval; Multilanguage Information Retrieval; Multilingual Indian Language; Machine Translation ____________________________________________________________________________________________________________________

1. Introduction Information-Retrieval [IR] is the act of database storing, searching and retrieving process of information that matches a user’s request1. Internet (digital world) is no longer monolingual as the non- English content (Hindi, Marathi) is growing rapidly. With an increasingly globalized economy, the ability to find information in other languages is becoming a mandatory task. The diversity of languages is becoming barrier to understand and acquainted in digital era. So the information-retrieval (IR) has been an essential area of research in these days. It has been found that when user gets the services in local languages, it has been majorly accepted and used. ________ *Mangala S. Madankar. Tel.:+91 9156494042. E-mail address: [email protected]

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the ICISP2015 doi:10.1016/j.procs.2016.02.071

846

Mangala Madankar et al. / Procedia Computer Science 78 (2016) 845 – 850

In future internet (digital world) is no more monolingual, as the Hindi, Marathi contents are also growing rapidly. Most crucial challenges in Cross lingual information retrieval and Multilingual Information Retrieval are the retrieval of relevant information for a query fired in native language. As World Wide Web is expanding and the content on internet of languages other than English is also increasing rapidly compared to English. From the past few years Hindi and Marathi content has also increased rapidly on the Digital Web. All major Government departments, news papers, publication houses designed their web sites in Hindi or Marathi Language. The Significance of national borders in terms of trade and information exchange is reducing by globalization. The third most widely-spoken language in the world is Hindi. And in Maharashtra Marathi is also most widely spoken language. As our country is diversified by various languages and only 12% of people is aware of about English language. Information Retrieval in Hindi, Marathi and English language is getting popularity. These days Google provides transliteration in Hindi, Kannada, Bengali, Nepali, Punjabi, Gujarati, Kannada, Urdu, Malayalam, Marathi, Tamil and Telugu, it offers searching in 13 languages. Through multi-lingual Information-Retrieval process society get the benefit such that user can try to access the information in this native language and retrieve the information in same language without knowing in which language the information is stored in database, and hence it will be very effective research area. IR system can be effectively used where people from all walks of life are involved in application areas like e-governance agriculture, rural health, education, national resource planning, disaster management, information kiosks etc. As far as development in IR with respect to Indian languages is concerned, a lot work is going on particularly in the field of information retrieval. Research is also going on in other related areas as well such as NLP, machine translation etc. Various regional languages have been taken into consideration by researchers for IR. Even government organization like TDIL (Technology Development for Indian Languages) has made significant contributions for standardization of Indian Languages on web2. In the proceeding section, we present the various developments in Indian Information Retrieval, Cross language Information Retrieval, Multilanguage Information Retrieval, Query Processing and NLP system. Literature survey is classified according to different areas of NLP. 2. Cross Lingual Information Retrieval CLIR enables the users to recover the set of documents different than the language of the query. It allows the user to enter their query in one language and regain the set of documents in the other languages. The main advantage of CLIR is that the user can search the information without limited by the linguistic barriers. In Cross-language information retrieval the language of the query is different from the language of the document. CLIR system is a system in which a user is not restricted to only one language, it can formulate query in one language and then system returns the documents in the other language, Since in CLIR the language of query and the documents both can be translated. CLIR same as the bilingual system simplifies the searching process for multi-lingual users and enables those peoples who know only single language to provide queries in their language and then get benefits from translators for retrieving the documents of the other languages3. Followings are the terms in CLIR system. There are various methods to translate query, document or both. There are three primary tools for translations are dictionaries, machine translation systems and parallel corpora. Query translation, typically uses either dictionary based or corpus based translation. Document translation for the most part only uses machine translation. Morphological Analyzer, Transliteration and Word sense disambiguation are the major part of machine translation.

2.1. Machine Translation Machine Translation is one of the parts of language processing within Computational Linguistic. The machinetranslation method translates either the document or query by using a machine translation system. The use of computers to automate some of the tasks or the entire task of translating between human languages machine Translation (MT) cab be refers. Google currently supports searching in 78 languages as well as provide machine translation services for certain languages. However, from the end user’s perspective these search engines are essentially a database of monolingual search engines. None of the big search engines have incorporated MLIR technology as a service. The main disadvantage of Machine Translation is computationally expensive. In situations

Mangala Madankar et al. / Procedia Computer Science 78 (2016) 845 – 850

847

where there is a large collection of documents or when searching for documents on the web, machine translation is impractical1.

2.2. Bilingual dictionary For translating text and words from one language to another, bilingual dictionary can be used. Bilingual dictionaries are used in a dictionary-based approach. By looking up terms in a bilingual dictionary, queries are translated. Some queries are also translated using some or all of the translated terms. Because of its simplicity and the wide availability of machine-readable dictionaries this is the most popular approach4. In paper 5 query translation using bi-lingual dictionaries is discussed. By using a simple rule based approach transliteration of words which are not found in the dictionary is done. 2.3. Parallel Corpora Corpus based translation typically gives much better performance, as compared to dictionary based. The formation of parallel corpora is complicated and quite expensive. It can be enormously complex to find parallelcorpora for certain languages or that are large enough to be of use. The main problems with both corpus-based and dictionary- based translation are coverage and quality. Poor class corpora and dictionaries can greatly reduce the performance of a system1. Bilingual machine readable dictionaries are more widely available than parallel-corpora.

2.4. Morphological Analyzer Analyzing morphology of given text is called as Morphological Analyzer, which is a software component. It senses or generates morphemes of an input word. A. Muley et al. introduced Ruled Bases Approach for morphological analysis for Marathi Language6. This system can be used in Gender Recognition and has been developed to find a root word of a given word. P. Gawade et al. designed the morphological analyzer for Marathi, an inflectional language and also a grammatical structure i.e. a parsed tree7. The morphological analyzer checks its impact on their performance by combining with statistical POS tagger and Chunker so as to confirm its usability as a foundation for NLP applications. 2.5. Transliteration If query words not found in the bi-lingual dictionary then transliteration must use. For the transliteration, rulebased approach can be used for the language like Devanagari as it is a phonetic script, here in paper9 transliteration from Hindi/Marathi to English is done. Again one can use a segment based transliteration approach for transliteration from English to Hindi. For resultant transliteration/translation for query, iterative page rank style algorithm which is based on term co-occurrence information, produce the most feasible translation. In many Indian language documents the actual Indian word is used as-it is without translation. For ex. to explain Vaishnavdevi Travel, it is quite common to also use Vaishnavdevi Yatra in the database i.e, the actual Hindi letter transliterated in English. So, this motivated us to try a full transliteration of the source query without translation. 2.6. Word sense disambiguation In word sense disambiguation, the sense of a word is inferred based on the company it kept i.e based on the words with which it co-occurs. Similarly, the words in a query provide important idea for choosing the right translations/ transliterations, although it is less in number. For example, for a query “nadi jal”, here the translation for nadi refer as {river} and the translations for jal refer to {water, to burn}. Here, based on the context, we can see that the choice of translation for the second word is water since it is more likely to co-occur with river.

848

Mangala Madankar et al. / Procedia Computer Science 78 (2016) 845 – 850

3. Machine Translation Approaches MT approaches are classified into three categories: rule-based, corpus-based, Dictionary based and Example based 3.1. Rule-based By using a simple rule based approach, transliteration of words which are not found in the dictionary can be translated. Rule Based Machine Translation (RBMT) has much to do with the syntactic, morphological, and semantic information about the source and target language. Linguistic rules are made over this information. Also millions of bilingual dictionaries for the language pair are used. RBMT is deal to provide wide variety of linguistic phenomena and is extensible and maintainable. However, add difficulty to the system exceptions in grammar and hence, the research process requires high investment. Anglabharati and Anubharati are some examples of rule based machine translation system from English to Hindi and other Indian Languages. The main task of RBMT is to convert source language (semantic and syntactic) structures to target language (semantic and syntactic) structures 10. The methodology could have several approaches shown in figure 1.

Fig. 1. Different methods of Rule Based Machine Translation.

3.2. Corpus-based A corpus-based approach analyzes large document collections comparable or parallel-corpus to construct a statistical translation model. To overcome the problem of knowledge acquisition problem of rule based machine translation, Corpus based machine translation also refers as data driven machine translation is an alternative approach for machine translation. Corpus-based MT uses, as it name points, a bilingual parallel corpus to obtain knowledge for new incoming translation. A large amount of raw data in the form of parallel corpora is used in CBMT. This raw data contains text and their translations. These corpora are used for acquiring translation knowledge. Example-based Machine Translation Approach is one kind of Corpus based approach11. 3.3. Dictionary based machine translation This method of translation is based on entries of a language dictionary. To develop the translated verse the word’s equivalent is used. Machine-readable or electronic dictionaries are the base of the first generation of machine translation. To some extent this method is still can translate of phrases but not sentences fully. Finally, on the basis

Mangala Madankar et al. / Procedia Computer Science 78 (2016) 845 – 850

849

of more or less utilizes bilingual dictionaries with grammatical rules most of the translation approaches are developed. 3.4. Example based machine translation Example-based machine translation is achieved by its use of bilingual-corpus with parallel texts as its main knowledge, in which translation by analogy is the main idea. An EBMT system point to point mapping is done. It takes a group of sentences in the source language and corresponding translations produce of each sentence in the target language. These examples are used to translate similar types of sentences of source language to the target language. In EBMT, there are four tasks: example acquisition, example base and management, example application and synthesis. At the basic of example-based machine translation is the idea of translation by analogy. The rule of translation by analogy is encoded to example-based machine translation11. 4. Conclusion Above are the various techniques of machine translation for information retrieval in multilingual and cross lingual information retrieval. Cross-lingual and Multi-lingual IR provides new paradigms in searching documents through different varieties of languages across the world and this can be the baseline for searching not only among two languages but also in multiple. Machine translation has been an active research subfield of artificial intelligence and information retrieval system for years. Machine translation (MT) is a hard problem, because natural languages are highly complex. It is difficult to say that one approach would be sufficient to handle the translation process as the languages are evolutionary in nature. Above are the machine translation approaches carried out of various CLIR and MLIR systems. This paper summaries for reviewing not all but some of the latest researches in the area of crosslingual and multi-lingual IR and Machine Translation. References 1. PothulaSujatha and P. Dhavachelvan. A Review on the Cross and Multilingual Information Retrieval. International Journal of Web & Semantic Technology (IJWesT) Vol 2 Issue 4; October 2011. 2. M.S.Madankar. A Review on Information Retrieval in Indian Multilingual Languages. International Journal of Advanced Research in Computer Science and Software Engineering. Volume 5 Issue 3; March 2015. 3. Monika Sharma, Dr. Sudha Morwal. Refinement of Search Results of the Google using Cross Lingual Reference Technique and GPS. International Journal of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-4); April 2015. 4. Savita C. Mayanale, S. S. Pawar. Survey on Indian CLIR and MT systems in Marathi Language. International Journal of Computer Applications Technology and Research. Volume 4– Issue 7, 579 - 583, ISSN: 2319–8656; 2015. 5. Tan Xu1 and Douglas W. Oard. Maryland: English-Hindi CLIR. FIRE; 2008. 6. Aditi Muley et al. Morphological Analysis for a given text In Marathi language. International Journal of Computer Science & Communication Network, Vol 4(1),13-17; 2014. 7. Gaikwad, Pratiksha Gawade Deepika Madhavi Jayshree, and Sharvari Jadhav Rahul Ambekar. "Morphological Analyzer for Marathi using NLP; 2013. 8. Yilu Zhou, Jialun Qin, Hsinchun Chen, Jay F. Nunamaker. Multilingual Web Retrieval: An Experiment on a Multilingual Business Intelligence Portal. Proceedings of the 38th Hawaii International Conference on System Sciences pp-1-10; 2005. 9. Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh, Om P. Damani. Evaluation of Hindi to English, Marathi to English and English to Hindi CLIR at FIRE; 2008. 10. Sneha Tripathi1 and Juran Krishna Sarkhel. Approaches to machine translation. Annals of Library and Information Studies Vol. 57, pp. 388-393; December 2010. 11.M. D. Okpor. Machine Translation Approaches: Issues and Challenges. IJCSI International Journal of Computer Science Issues, Vol. 11, Issue 5, No 2, ISSN (Print): 1694-0814 | ISSN (Online): 1694-0784; September 2014. 12. Antony P. J. Machine Translation Approaches and Survey for Indian Languages. Computational Linguistics and Chinese Language Processing Vol. 18, No. 1, pp. 47-78 ,© The Association for Computational Linguistics and Chinese Language Processing; March 2013. 13. Nikolaos Ampazis, Helen Iakovaki. Cross-Language Information Retrieval using Latent Semantic Indexing and Self-organizing Maps. IEEE, pp-751-755; 2004. 14. Mohammad Shamsul Arefin*, Yasu. Multilingual Content Management in Web Environment. Chittagong University of Engineering & Technology, Bangladesh, IEEE; 2011. 15. B.Ashwin Kumar. Profound Survey on Cross Language Information Retrieval Methods (CLIR). Second International Conference on Advanced Computing & Communication Technologies, IEEE Computer Society, pp-64-68; 2012. 16. Bao-Quoc Ho, Van B. Dang, Minh V. Luong and Thuy T.B. Dong. English-Vietnamese Cross-Language Information. IEEE, pp-107-113; 2008.

850

Mangala Madankar et al. / Procedia Computer Science 78 (2016) 845 – 850

17. Zhao Rongying, “Visual analysis on the research of cross-language information retrieval”, IEEE, pp-107-113; 2008. 18. Sumam Mary Idicula, David Peter, S. A Multilingual Query Processing System using Software Agents. Journal of Digital Information Management _ Volume 5 Number 6 pp-385-390; December 2007. 19. Bin Xue. Research on Multi-agents Information Retrieval System Based on Intelligent Evolution. 2nd International Conference on Computer Science and Network Technology, pp-1042-1045. 20. Ashish Almeida, Pushpak Bhattacharyya. Using Morphology to Improve Marathi Monolingual Information Retrieval. FIRE; 2008.