transformation of multiple english text sentences to

ISSN: 2278-5183 www.ijcdsonline.com

International Journal of Computers and Distributed Systems Vol. No.2, Issue 1, December 2012

TRANSFORMATION OF MULTIPLE ENGLISH TEXT SENTENCES TO VOCAL SANSKRIT USING RULE BASED TECHNIQUE Mr.Uday C. Patkar Research Scholar, Dept. of IT

Prof. Prakash R. Devale

Prof.Dr.Suhas. H. Patil

Professor & Head, IT Dept.

Professor & Head, Comp. Dept.

Bharati Vidyapeeth Deemed



University College of



Engineering Pune-India



ABSTRACT Language is very important part of the communication. It is very useful to share the ideas, knowledge, feelings etc. There are many different languages spoken in this world among which English is the global language. The most of the information is available in English. In the India several regional languages are spoken. Sanskrit is the mother of all native language of India. A great storage of knowledge with subjects like medicine, mathematics, Geography, Geology, Astronomy, philosophy and many others is kept alive and fresh Sanskrit lore for thousands of years. The area of Artificial Intelligence is very useful in providing people with a machine, which understands diverse languages spoken by the common man. It presents the user with an interface, with which he feels more comfortable. It was deployed to solve real-world problems. Researchers have been building and utilizing computer systems that can translate from one language to another without requiring any extensive human intervention since the early 1960‟s. It is a helpful translator which enables people speaking different languages to share ideas and communicate with one another. In this paper we are introducing mechanism which converts multi sentences, question sentences of English to Sanskrit text to speech conversion. General Terms Rule based approach Keywords Rule based dictionary approach, Parser, bilingual dictionary, Formant Synthesizer 1. INTRODUCTION In the India many languages are spoken by different people from different states. English is global language but it is understood by very less people in India, it continues to be the de-facto link language for administration & education system. Internet is media for information retrieval & information is available in English on internet, so there is need of translator which can convert English sentences in to native languages. Sanskrit act as an interlingua for translation to and from Indian languages. Sanskrit is the mother language from which all other Indian languages evolved. The very ancient language in India is Sanskrit. Sanskrit language is used in some other states of India as Haryana, Delhi, Rajasthan, Jammu and Kashmir and the out of India in worldwide countries as UK, USA, Bangladesh, Canada, UAE, Singapore, Kenya, Fiji and Malaysia. Sanskrit language belongs to the Indo-Iranian family of languages. It‟s an Indian historical language. So we are introducing mechanism which converts multi sentences, question sentences of English to Sanskrit text to speech conversion. Machine translation is one of the most important applications of Natural Language Processing. Machine translation helps people from different places to understand an unknown language without the aid of a human translator. The module present concerns with the Machine Translation domain of Natural Language Processing. This area of Artificial Intelligence is very useful in providing people with a machine, which understands diverse languages spoken by common people. The Source Language (SL) is the language which is to be translated & the Target Language (TL) is language in which it is translated. While translating, the syntactic structure and semantics structure of both source language and target language should be considered. There are different techniques for machine translation. For translation we decode the meaning of the source input text in its entirety, the translator must interpret and analyse the text, a process that requires deep knowledge of the grammar, semantics, syntax, idioms, etc., of the source language and Target Language. So it needs the same in-depth knowledge to re-encode the meaning in the target language. The major machine translation techniques are Statistical Machine Translation Technique (SMT)

37 |Pa g e

www.cirworld.c om



Example-based machine translation (EBMT) Rule Based Machine Translation Technique 2. Statistical Machine Translation According to [6], the statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The SMT is a corpus based approach, where a massive parallel corpus is required for training the SMT systems. The SMT systems are built based on two probabilistic models: language model and translation model. The advantage of SMT system is that linguistic knowledge is not required for building them. The difficulty in SMT system is creating massive parallel corpus. SMT systems work well for machine translation of English to European languages because the word order is almost preserved in such translations. For machine translation of English to Indian languages, the parallel corpora have to be pre-processed (changing word-order) and trained in SMT. 3. Example Based Machine Translation The example based machine translation (EBMT) is the corpus based approach without any statistical models. The example based systems are trained with the parallel corpus of example sentences, similar to SMT systems. The example based systems generally don‟t learn from the corpus. They store the parallel corpus and uses matching algorithms to search and retrieve the sentences. The translation memory is one of the example based 21machine translation systems. The translation memories (TM) are built to aid the human translators by serving as an assisting tool for translation. The advantage of translation memories are easy to implement and linguistic knowledge is not required. The translation memories are not used for translation purposes but they are also useful for dictionary search of words, idioms and proverbs, etc. 4. Rule Based Machine Translation The rule based machine translation system translates the source text into target text by a set of linguistic rules. Three techniques of machine translation – Direct, Interlingua and Transfer based are applicable to rule based machine translation system. The rule based machine translation system is developed by hand coded rules for translation. The system requires good linguistic knowledge to write the rules and a bilingual dictionary is also needed. Other MT systems like SMT and EBMT requires huge parallel corpus for training, which is not readily available for Indian languages. The source of parallel corpus is internet and texts. The parallel texts are not widely available in internet and in multi-lingual text books, the alignment of sentences vastly vary. On the other hand, the rule based systems are highly suited for translation of English to Indian Languages because the bilingual dictionary could be collected easily compared to parallel corpus and the rules could also be written well with the help of linguists. The rule based system which has been developed follows the transfer based approach of reordering rules. The drawback of rule based system is that the system is confined with the rules and the rules will evolve with the language over time. 4.1 Proposed System We compare and analyze differences between the two languages which are pre-requisite before getting into translation technique. There are four major parameters namely, essence, tense, number and translational equivalence, that are needed to be considered for the translation of this language pair. The essence of English is that it is evolved; therefore, it is a natural language. Sanskrit is formulated by sages like Panini hence it is an Artificial or Synthetic language. The English language has twelve tenses in all primarily Past, Present and Future. All three have a Perfect, Indefinite, Continuous and Perfect Continuous and it makes twelve forms of tenses. Sanskrit has primarily six tenses, Present, Past, Future, Order, Blessing and Inspiration. The English have two numbers i.e., Singular and Plural whereas, Sanskrit has three numbers Singular, Dual and Plural [2, 3, 4, 5]. In general, we can state that the model consists of array of translation rules to translate from source to target sentence, which is the frame of Rule based Machine Translation System. The approach is simple and effective. The rules are framed, keeping in view the grammar of the source and the target language (Translational Equivalence) [1]. In this paper we use the dictionary rule based approach. In this rule based technique, firstly, the words from the source sentence are taken separately. The POS (Part Of Speech) tag information and dependency information of these words is obtained with the help of a parser. Using this information, source morph features are assigned to each word. Then the corresponding target structure is generated with help of transfer link rule file. Finally, with the help of word dictionary and Morphological dictionary, the target sentence is generated. System Design

38 |Pa g e

www.cirworld.c om



Fig 2: System design diagram 4.2 Parser Parser is an algorithm which produces a syntactic structure for a given input. The parser is the first component of the rule based machine translation system and it is used on the source (English) side. The Parser is used for four main purposes in the machine translation system. The parser is used for syntactic analysis of the English sentence in order to give the parse tree structure of the English sentence by context free grammar. The example of parsing is shown in fig. below:

Fig 1: parser tree of source sentence. This tree structure is required for re-ordering the source (English) sentence with respect to the target (Sanskrit) sentence by transfer rules. The parser is used for Parts of Speech (POS) tagging of the English sentence to give English words and their corresponding POS tags. These POS tagged words are used to search the target equivalent of English word in bilingual dictionary, to synthesize morphology of Sanskrit words and also to reorder the English text with respect to the Sanskrit text. The parser is used for stemming the words of English sentence, to get their corresponding root words. The root words of English obtained after stemming are used to find the equivalents of Sanskrit words from bilingual dictionary. The parser is used for the morphological analysis of words in the English sentence, to get the morphology of English words. The morphology information of English is used in the morphological synthesizing for equivalent Sanskrit words. 5. English to Sanskrit Transliteration (Labelling) The transliteration is the process of labelling the text in one language with other. In English to Sanskrit transliteration, the English text is replaced with the Sanskrit text by preserving the spell. In machine translation, the proper nouns like person names and place names, named entities, may not have the equivalent Sanskrit words in the bilingual dictionary. In such cases, the translation system will not produce good output. Such words should not be translated but these words had

39 |Pa g e

www.cirworld.c om



to be transliterated.. The transliteration is invoked after parsing the English sentence. Because only after parsing, the proper nouns could be identified (easy way is to identify the word with Capital case letters), by POS tagging them with NNP (proper noun singular) or NNPS (proper noun plural). Any word with „NNP‟ or „NNPS‟ POS category will be directly transliterated without entering into other translation modules. 6. English- Sanskrit Bilingual Dictionary The dictionary contains words and their corresponding meanings. The bilingual dictionary has the words in one language and their meanings in the other. English-Sanskrit bilingual dictionary is used in the machine translation system for translating the English words to equivalent Sanskrit words. The pre-processed bilingual dictionary is loaded into the database and MySQL server is used. Based on the POS categories, the dictionary is separated into seven different databases: Noun, Verb, Adjective, Adverb, Pronoun, Preposition and General. Their names suggest the content of the database, while the general category stores all other POS categories such as conjunctions, interjections, determiner, particle, cardinal, etc. 7. Sanskrit Morphological Generator The morphological synthesizer adds morphology to the words. A bi-directional morphological Generator cum Morphological Analyzer has been developed for Sanskrit, for synthesizing morph to the Sanskrit words. Finite State Transducer is used to model the morphology and orthographic rules of Sanskrit are written. The FST based Sanskrit morphological synthesizer is used in the machine translation system. The morph synthesizer is used in the rule based machine translation system for English to Sanskrit translation. The morphological information about the English words will be transferred to Sanskrit words. The Parser is used to stem the English words from the input sentence and also to get the morphologically analyzed information. The equivalent Sanskrit words are extracted from the English to Sanskrit Bilingual Dictionary. 7.1 Reordering by Transfer Rules In machine translation, the reordering denotes the change in syntactic structure of source text with respect to the target text. The reordering can be machine learned or executed by rules or both. The reordering by rules is followed in the machine translation system to reorder the English sentence in the order of Sanskrit sentence. The reordering forms the last component of the machine translation system. The syntactic information of English sentence from Parser is checked for the match in the database of reordering rules. If the syntactic pattern of English sentence matches with the source rule, then the corresponding Sanskrit rule is taken and the source tree structure of parser is modified with respect to the target rule. If the pattern matches, the transfer rule is applied to the child nodes of all branches in parse tree. Now the system output would be syntactically reordered to suit Sanskrit language. The reordered output after morphological generation of Sanskrit words is displayed as the final output of the machine translation system. After getting the target text it converts into the wave & it plays using some output devices such as speakers. Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database. 8. Conclusion & Future scope Our main aim is to translate English sentence, question sentences to Sanskrit sentence, & speech synthesis. The dictionary based approach of rule based method is possible; it required dictionaries of both the languages along with morphological databases. The system can be utilized as such for developing speech systems. A smart and intelligent speech to speech translation system with human computer interaction will be producing a high impact on the society. The system can be further enhanced by using a massive database of bilingual dictionary for better choice of words. The major part of morphology is covered and more morphological categories can be handled and the reordering rules can also be further added. 9. REFERENCES [1]Shachi Dave etal, “Interlingua-based English-Hindi Machine Translation and Language Divergence”,Machine Translation, Vol. 16, Issue 4, pp: 251 - 304, 2001. [2] Promila Bahadur, A.K Jain,D.S Chauhan, “EtranS-English to Sanskrit Machine Translation” ICWET 2012, Bombay, ACM 2012 [3]English to Sanskrit machine translation semantic mapper, International Journal of Engineering Science and Technology Vol. 2(10), 2010, 5313-5318 [4]English to Sanskrit Machine Translator Lexical Parser and Semantic Mapper, Vaishali Barkade, National Conference On "Information and Communication Technology", NCICT-IOJ, 2010.

40 |Pa g e

www.cirworld.c om



[5]Approach of English to Sanskrit machine translation based on case-based reasoning, artificial neural networks and translation rules, Vimal Mishra, International Journal of Knowledge Engineering and Soft Data Paradigms, 2010-12-01 [6] R. Harshawardhan, Mridula Sara Augustine, K. P. Soman, “Advanced English – Malayalam Translation Memory for Natural Language Processing Applications”, in Proc. of Nat. Conf. on Indian Language Computing (NCILC), February, 2011.

41 |Pa g e

www.cirworld.c om