Voice Translation on Windows & Linux machine

International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 3, Issue 10 December 2014

Voice Translation on Windows & Linux machine Mukesh Yadav

Gayatri Hegde

Computer Engg. Department PIIT, New Panvel India

Computer Engg. Department PIIT, New Panvel India

ABSTRACT Voice translation tools provide services to convert words to target language using natural language processing. There are parsing methods which concentrate on capturing keywords & translating them to target language. Current techniques to optimize processing time are template matching, indexing the frequently used words using probability search and sessionbased cache. In this paper, we are proposing a model which optimize processing time and increasing the throughput of voice translation services using these techniques and developing a project to translate one language to another. The languages used are Hindi and English. The input taken is user voice in any one language.

Keywords Sphinx, Text-to-speech, Natural Language Processing, voice translation, speech recognition, speech synthesis, template matching, probability search, session-based cache.

INTRODUCTION Language is a means of communication between two individuals with the use of words and sentences understood recollected and reciprocated in the same format by the other individual. Language is a basic form of written or speech communication. Especially, in a nation like India where the language and dialect changes with region. So we need a translation layer that can eliminate the linguistic barrier. Improvements in existing system include recognition rate which is expressed by word or by sentence, system accommodating time for new speakers, dependence or independence on the speaker, dimension of recognizable vocabulary, decision and recognition time. The methods and algorithms till now are discriminant analysis methods based on Bayesian discrimination, Hidden Markov Models & Dynamic Programming – Dynamic Time algorithm (DTW) & Neural Networks. Alternative of dynamic programming DTW algorithm implementation in speech recognition is also referred in development of this project. In this paper, we propose a model for voice translation on laptop or personal computer systems. This model is implemented on windows operating system and linux operating system.

PROBLEM STATEMENT During communication it is essential to recognize a language correctly. Since many times recognition is not just the goal, the result of recognition is an intermediate of the system. The result is further used as input for another module. Therefore if the task of recognition is not correct then modules which rely

58

Mukesh Yadav, Gayatri Hegde

on the result of recognition may not perform the further operation correctly or may produce partially correct result.

PROPOSED MODEL In the proposed system the user voice is taken as the input using microphone in the windows machine, it is analyzed by the speech recognizer in which acoustic signals captured by the micro phone are converted to a set of meaningful words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words. After that words obtained are parsed in Natural Language Parsing model. Then information is extracted from the open source library. Then the input text is matched with database text. Then the output text is converted to voice and sent to speech synthesizer and output voice is heard by the user. Source Speech Speech Recognizer

Target Speech

Speech Synthesizer NLU Parsing Model

Information Extractor

Statistical Language Generation

Phrase/Word Translator Semantics Fig 1: Architecture diagram for voice translation


HISTORY In 1983, NEC Corporation demonstrated speech translation as a concept exhibit at the ITU Telecom World (Telecom '83). The first individual generally credited with developing and deploying a commercialized speech translation system capable of translating continuous free speech is Robert Palmquist, with his release of an English-Spanish large vocabulary system in 1997. This effort was funded in part by the Office of Naval Research. To further develop and deploy speech translation systems, in 2001 he formed SpeechGear, which has broad patents covering speech translation systems. In 1999, the C-Star-2 consortium demonstrated speech-tospeech translation of 5 languages including English, Japanese, Italian, Korean, and German. In 2003, SpeechGear developed and deployed the world's first commercial mobile device with on-board Japanese-to-English speech translation. One of the first translation systems using a mobile phone, "Interpreter", was released by SpeechGear in 2004. In 2006, NEC developed another mobile device with on-board Japanese-to-English speech translation. Another speech translation service using a mobile phone, “shabette honyaku”, was released by ATR-Trek in 2007. In 2009 SpeechGear released version 4.0 of their Compadre: Interact speech translation product. This version provides instant translation of conversations between English and approximately 35 other languages. Today, there are a number of speech translation applications for smart phones, e.g. Jibbigo which offers a self-contained mobile app in eight language pairs for Apple's AppStore and the Android Market.

TEXT-TO-SPEECH SYSTEMS The first synthetic speech was produced as early as in the late 18th century. The machine was built in wood and leather and was very complicated to use generating audible speech. It was constructed by Wolfgang von Kempelen and had great importance in the early studies of Phonetics. In the early 20th century when it was possible to use electricity to create synthetic speech, the first known electric speech synthesis was “Voder” and its creator Homer Dudley showed it to a broader audience in 1939 on the world fair in New York. One of the pioneers of the development of speech synthesis in Sweden was Gunnar Fant. During the 1950s he was responsible for the development of the first Swedish speech synthesis OVE (Orator Verbis Electris.) By that time it was only Walter Lawrences Parametric Artificial Talker (PAT) that could compete with OVE in speech quality. OVE and PAT were text-to-speech systems using Formant synthesis. The greatest improvements when it comes to natural speech were during the last 10 years. The first voices we used for ReadSpeaker back in 2001 were produced using Diphone synthesis. The voices are sampled from real recorded speech

59


and split into phonemes, a small unit of human speech. This was the first example of Concatenation synthesis. However, they still have an artificial/synthetic sound. We still use di phone voices for some smaller languages and they are widely used to speech-enable handheld computers and mobile phones due to their limited resource consumption, both memory and CPU.

NATURAL LANGUAGE PROCESSING The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published his famous article "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. This criterion depends on the ability of a computer program to impersonate a human in a real-time written conversation with a human judge, sufficiently well that the judge is unable to distinguish reliably on the basis of the conversational content alone between the program and a real human. Some notably successful NLP systems developed in the 1960s were SHRDLU, a natural language system working in restricted "blocks worlds" with restricted vocabularies, and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum between 1964 to 1966. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction. When the "patient" exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to "My head hurts" with "Why do you say your head hurts?". During the 70's many programmers began to write 'conceptual ontologies', which structured real-world information into computer-understandable data. Examples are MARGIE (Schank, 1975), SAM (Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976), QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units (Lehnert 1981). During this time, many chatterbots were written including PARRY, Racter, and Jabberwacky. Up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing. This was due both to the steady increase in computational power resulting from Moore's Law and the gradual lessening of the dominance of Chomskyan theories of linguistics (e.g. transformational grammar), whose theoretical underpinnings discouraged the sort of corpus linguistics that underlies the machine-learning approach to language processing.[3] Some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Such models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world


data), and produce more reliable results when integrated into a larger system comprising multiple subtasks. Many of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, which was (and often continues to be) a major limitation in the success of these systems. As a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has increasingly focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, or using a combination of annotated and nonannotated data. Generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web), which can often make up for the inferior results.

COMPARISON While AT&T Bell Laboratories developed a primitive device that could recognize speech in the 1940s, researchers knew that the widespread use of speech recognition would depend on the ability to accurately and consistently perceive subtle and complex verbal input. Existing System lack of quality of service attributes in the voice translation services .It is possible to implement voice translation services. However, such services are few and slow. This is because most of these models concentrate mainly on language interpretation and language generation. They fail to take into consideration the large amount of back-end processing that takes place while translation. Most translation methods make use of customized dictionaries to find the translated words .However, searching for relevant words and synonyms from such large dictionaries is slow and timeconsuming. More so it also depends on the content of the sentence being translated. Voice translation combines technologies in the areas of automatic speech recognition, understanding and text-tospeech synthesis. The tight coupling of speech recognition and understanding effectively mitigates the effects of speech recognition errors and non-grammatical inputs common in conversational colloquial speech on the quality of the translated output, resulting in a robust system for limited domains. Currently, speech translation technology is available as product that instantly translates free form multi-lingual

60


conversations. These systems instantly translate continuous speech. Challenges in accomplishing this include overcoming Speaker dependent variations in style of speaking or pronunciation are issues that have to be dealt with in order to provide high quality translation for all users. Moreover, speech recognition systems must be able to remedy external factors such as acoustic noise or speech by other speakers in real-world use of speech translation systems. For the reason that the user does not understand the target language when speech translation is used, a method "must be provided for the user to check whether the translation is correct, by such means as translating it again back into the user's language". In order to achieve the goal of erasing the language barrier worldwide, multiple languages have to be supported. This requires speech corpora, bilingual corpora and text corpora for each of the estimated 6,000 languages said to exist on our planet today.

REQUIREMENT ANALYSIS Feasibility analysis The feasibility of the project is analyzed in this phase and business proposal is put forth with a very general plan for the project and some cost estimates. During system analysis the feasibility study of the proposed system is to be carried out. This is to ensure that the proposed system is not a burden to the company. For feasibility analysis, some understanding of the major requirements for the system is essential. Three key considerations involved in the feasibility analysis are   

ECONOMICAL FEASIBILITY TECHNICAL FEASIBILITY SOCIAL FEASIBILITY

Economical feasibility This study is carried out to check the economic impact that the system will have on the organization. The amount of fund that the company can pour into the research and development of the system is limited. The expenditures must be justified. Thus the developed system as well within the budget and this was achieved because most of the technologies used are freely available. Only the customized products had to be purchased.

Technical feasibility This study is carried out to check the technical feasibility, that is, the technical requirements of the system. Any system developed must not have a high demand on the available technical resources. This will lead to high demands on the available technical resources. This will lead to high demands being placed on the client. The developed system must have a modest requirement, as only minimal or null changes are required for implementing this system.

Social feasibility The aspect of study is to check the level of acceptance of the system by the user. This includes the process of training the user to use the system efficiently. The user must not feel threatened by the system, instead must accept it as a necessity.


The level of acceptance by the users solely depends on the methods that are employed to educate the user about the system and to make him familiar with it. His level of confidence must be raised so that he is also able to make some constructive criticism, which is welcomed, as he is the final user of the system.

HARDWARE USED Client Side: Processor: Intel (R) Core (TM) i3 CPU M370 Processor Clock Speed : 2.40 GHz Memory: 8GB RAM ( 6.43 GB usable) System Type: 64-bit Operating System Server Side: Processor: Genuine Intel Centrino® Duo CPU T2250 Processor Clock Speed : 2.73 GHz Disk Memory: 53 GB System Type: 32-bit Operating System

SOFTWARE USED Client System Model: Dell Studio 1550 Client Operating system: Windows7 Home Premium SP1 Server System Model: LGP1 Server Operating system: Linux (Ubuntu 12.10) Server Device name: Mukesh-P1-5XXXE1 Integrated development environment: Eclipse IDE JDK: Java Development Toolkit Voice output software: Dhwani

word, used to store the pronounciations of the different words used and a dictionary file. Windows 7, an OS installed at client side.

OVERALL SYSTEM ARCHITECTURE The voice translation model consists of four main components namely: speech recognition, natural language interpretation and analysis, sentence generation and text-to-speech synthesis. The optimization is provided by the natural language interpretation and analysis module is which further divided into four parts namely: template matching, indexing frequently used words, session-based cache and translation to target language.[3][6][7]

Speech recognition component Speech recognition is the process of converting an acoustic signal captured by the microphone to a set of meaningful words. Speech recognition systems can be characterized by many parameters like speaking mode, speaking style, speaking accents and signal-to-noise ratio. It takes into consideration the speaking accents of the caller. Recognition is more difficult when vocabularies are large or have many similar-sounding words. To implement this speech recognition module, Sphinx, a speech recognition system is used. It is written entirely in the JavaTM programming language.[3][6][7]

API USED Sphinx 4 - Speech recognizer written in Java. Sphinx 4 is a refurbished version of the Sphinx engine which provides a more flexible framework for speech recognition, written entirely in the Java programming language. FreeTTS - Text to speech synthesizer in Java FreeTTS is a speech synthesis system written entirely in the Java programming language. It is based upon Flite: a small run-time speech synthesis engine developed at Carnegie Mellon University.

SYSTEM ANALYSIS A system is characterized by how it responds to input signals. In general, a system has one or more input signals and one or more output signals. We use MISO (Multiple Inputs, Single Output). Our software will provide easy to use graphical user interface (GUI). It will provide option to select source language. It will recognize the language and display the spoken words and then go for further process.

TECHNOLOGIES USED Eclipse, a java based platform. Sphinx, an application programming interface for translation voice to text. TTS, an application programming interface used for translation of text to voice. Dhwani, a software used to speak the output in text in language selected.Ubuntu 12.10, an OS installed at server side used to support and run Dhwani software.Microsoft

61


Fig 2: System Model [3][6][7]


Template matching Template matching checks the source input for commonly used phrases or sentences. Every language consists of a set of commonly spoken words or sentences. Converting such sentences to the target language and replacing when required drastically reduces the processing time needed for translating the sentences. Consider the statement “How are you?” as a commonly used sentence. The translated text output of this sentence is stored in a relational table. If a person speaks this sentence, then it will be directly translated instead of disassembling the words and analyzing them. [3][6][7]

Indexing frequently used words The large lexical database of words will add to the time complexity of the process. The words are indexed based on the number of times a particular word has been used. The probability search algorithm is used to index the words in the database. In probability search the most probable element is brought at the beginning. When a key is found, it is swapped with the previous key. Thus if the key is accessed very often, it is brought at the beginning. Thus the most probable key is brought at the beginning. The efficiency of probability search increases as more and more words are being translated and indexed. The presence of a single while loop, makes the time complexity of this algorithm as O(n). The best case will be when the word to be searched is at the beginning while worst case will be when the word is at the end.[3][6][7]

Session-based cache The system maintains a session-based cache for each user requesting for the service. This works on the lines of a web cache which caches web pages. This is done to reduce to reduce bandwidth usage, server load, and perceived lag. It is assumed that when a user engages in a conversation, there are bound to be multiple repetitions of certain words. Based on this assumption, we cache such words along with their translated text so that server processing time is saved. [3][6][7]

To implement this speech recognition module, Sphinx, a speech recognition system is used. Sphinx-4 is a state-of-theart speech recognition system written entirely in the JavaTM programming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).Sphinx-4 started out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for speech research. [2]  Capabilities Live mode and batch mode speech recognizers, capable of recognizing discrete and continuous speech.  Performance Sphinx-4 is a very flexible system capable of performing many different types of recognition tasks. As such, it is difficult to characterize the performance and accuracy of Sphinx-4 with just a few simple numbers such as speed and accuracy. The diagram above shows the general architecture of Sphinx4, followed by a description of each block: 1) 2)

3)

Input: The process starts with the voice input of the user, from the microphone of the mobile. Configuration Manager: The configuration file is used to set all variables. These options are loaded by the configuration manager as the first step in any program. Front End and feature: When the recognizer starts up, it constructs the front end according to the configuration specified by the user. It generates feature vectors from the input using the same process used during training.

Translation to target language After the sentence passes through the first three phases of language interpretation and analysis, the final phase is translation to target language. [3][6][7]

Sentence generation The collective set of translated word is converted to a meaningful sentence generation for the target language. Sentence generation is a natural language processing task of generating natural language from a logical form (set of translated words). [3][7]

TEXT-TO-SPEECH SYNTHESIS Text-to-speech synthesizer converts the sentence obtained from the sentence generation module into human speech form in the target language. This is done by using freeTTS software.

SPHINX

62


Fig. 3: Basic flow diagram of Sphinx[3][7]


4)

Decoder: The decoder constructs the search manager which in turn initializes the scorer, pruner and active list. The Search Manager uses the Feature and the Search Group to find the best path fit.

processor adds additional items to the utterance structure in an hierarchical and relational manner. For example, one utterance processor creates a relation in the utterance structure consisting of items holding the words for the input text. Another utterance processor creates a relation that consists of items describing the syllables for the words, with each syllable item pointing back to the individual word items created by the other utterance processor.

Sphinx-4 currently implements a token-passing algorithm. Each time the search arrives at the next state in the graph, a token is created. A token points to the previous token, as well as the next state. The active list keeps track of all the current active paths through the search graph by storing the last token of each path. A token has the score of the path at that particular point in the search. To perform pruning, we simply prune the tokens in the active list. When the application asks the recognizer to perform recognition, the search manager will ask the scorer to score each token in the active list against the next feature vector obtained from the front end. This gives a new score for each of the active paths. The pruner will then prune the tokens (i.e., active paths) using certain heuristics. Each surviving paths will then be expanded to the next states, where a new token will be created for each next state. The process repeats itself until no more feature vectors can be obtained from the front end for scoring. This usually means that there is no more input speech data.[3][6][7] Result: In the final step, the result is passed back to the application as a series of recognized words. Once the initial configuration is complete, the recognition process can repeat without reinitializing everything. [3][7]

TEXT-TO-SPEECH SYNTHESIS Text-to-speech synthesizer converts the sentence obtained from the sentence generation module into human speech form in the trget language. This is done by using freeTTS software. TTS is an open source speech synthesis system written entirely in the Java programming language. It is based upon Flite. FreeTTS is an implementation of Sun's Java Speech API. FreeTTS supports end-of-speech markers. Gnopernicus uses these in a number of places: to know when text should and should not be interrupted, to better concatenate speech, and to sequence speech in different voices. Benchmarks conducted by Sun in 2002 on Solaris showed that FreeTTS ran two to three times faster than Flite at the time. FreeTTS was built by the Speech Integration Group of Sun Microsystems Laboratories: Willie Walker, Manager and Principal Investigator; Paul Lamere, Staff Engineer; Philip Kwok, Member of Technical Staff.[1] The core components of FreeTTS architecture are as follows:  

Voice Thread: Voice thread consists of a set of utterance processors. They perform the creation, processing, and annotation of an utterance structure. Utterance processors: The utterance structure is a temporary object which is created by the voice for each audio wave generated by it. The voice initializes the utterance structure with the input text and then passes the utterance structure to a set of utterance processors in sequence. Each utterance

63


Fig. 4: FreeTTS architecture [1]  

Voice Data: Voice data sets are closely linked with the voice thread. These data sets are used by each of the utterance processors. Output thread: The output thread is responsible for synthesizing an utterance into audio data and directing the converted audio data to the appropriate audio playback mechanism.

Possible uses of FreeTTS:  



JSAPI 1.0 Synthesizer. FreeTTS provides partial support for the Java Speech API (JSAPI) 1.0 specification. Remote TTS Server. FreeTTS can serve as a backend text-to-speech engine that works with a speech/telephony system, or does the "heavy lifting" for a wireless PDA. Our client/server demo shows how this can be done. Desktop TTS engine. We can use FreeTTS as your workstation/desktop TTS engine. For example, our Emacspeak demo works right out of the box with Emacspeak.


INTERFACE DESIGN User Interfaces The swing app is the primary user interface. It consists of a box which consists of several command buttons depending on the use required by the user. The user can select the language to be detected and the target language. Hardware Interfaces : Not used any. Software Interfaces The GUI for the application is developed using Java Swing. Communication Interfaces The system’s translation server is accessed via http connection

The Hidden Markov Model (HMM) is a variant of a finite state machine having a set of hidden states, Q, an output alphabet (observations), O, transition probabilities, A, output (emission) probabilities, B, and initial state probabilities, Π. The current state is not observable. Instead, each state produces an output with a certain probability (B). Usually the states, Q, and outputs, O, are understood, so an HMM is said to be a triple, ( A, B, Π ). Hidden states Q = { qi }, i = 1, . . . , N . Transition probabilities A = {aij = P(qj at t +1 | qi at t)}, where P(a | b) is the conditional probability of a given b, t = 1, . . . , T is time, and qi in Q. Informally, A is the probability that the next state is qj given that the current state is qi. Observations (symbols) O = { ok }, k = 1, . . . , M .

ALGORITHM Previously dynamic time warping (DTW) algorithm [Sakoe, H. & S. Chiba-8] and discriminant analysis methods based on Bayesian discrimination was used by the researchers. But DTW algorithm is useful for isolated words recognition in a limited dictionary even though it remains an easy-to implement algorithm, open to improvements, appropriate for applications that need simple word recognition, telephones, car, computer, security system, etc. For fluent speech recognition, Hidden Markov Chains are used. It over comes the drawbacks of the DTW algorithm. [4]

Emission probabilities B = { bik = bi(ok) = P(ok | qi) }, where ok in O. Informally, B is the probability that the output is ok given that the current state is qi. Initial state probabilities Π = {pi = P(qi at t = 1)}.

Hidden Markov Models Modern general-purpose voice translation systems are based on Hidden Markov Models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal.. Speech can be thought of as a Markov model for many stochastic purposes. Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of ndimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes. Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information, and combining it statically beforehand (the finite state transducer, or FST, approach).

64


Fig. 5: Hidden Markov model The model is characterized by the complete set of parameters: Λ = {A, B, Π }. [4][5]

FLOW CHART FOR FUCNTIONS Flow chart for voice translation menu In Fig. 6, the process starts with process voice translation menu. Then loading all recognizing and synthesizing component process begins. After this process, synthesizers are initialized and a message is displayed that the software is ready to use. Then user has to select the option from the options given on the menu i.e. English to Hindi translation or Hindi to English translation of language or close the window. If user selects English to Hindi translation the control goes to connector named as 1.If user selects Hindi to English


Flow chart for english to hindi translation (at server side) In Fig 8, the process takes place at server side. Server is waiting for client on port no 5000 (say). If client is connected then user types a random input from the keyboard, else server is waiting for connection with client. Server then receives the translated English words from client side and is displayed on server side. The Google API translation starts which checks into the Hindi dictionary. If words found then display output in Hindi else show a message that words are not found and process ends. Start Dhwani software in which Hindi is taken as input and is spoken out to the user to hear. Then the process ends.

Flow chart for hindi to english translation In Fig 9, user speaks in Hindi which is recognized by the recognizer. If not recognized then the user has to speak again else search in Hindi dictionary starts. If words are found then they are fetched and displayed on the screen else words not found is displayed and the process ends. Now Hindi to English translator process starts. Display translated Hindi words in English. Then process to initialize speaking unit begins. It speaks out the output language i.e. English and the process ends. Fig

6:

Flow

chart

for

Voice

Translation

menu

Flow chart for english to hindi translation (at client side) In Fig.7, user speaks in English. If voice is recognized then search words in English dictionary, else speak again. If words found in dictionary then they are fetched and displayed on screen else words not found is displayed and process ends. After displaying English words an attempt to connection is made. If connected to server then further process starts else process for attempting connection takes place.

Fig 9: Flow chart for Hindi to English translation Fig 7: Flow chart for English to Hindi translation at Client side

65



Fig 8: Flow chart for English to Hindi translation at Server side

OUTPUT SCREENS Voice (in english) recognition at client side

Fig. 10: Voice (in English) recognition at client side

Output voice (in hindi) at server side

66



Fig. 11: Output voice (in Hindi) at server side

Voice (in Hindi) recognition

Fig. 12: Voice (in Hindi) recognition

Voice (in english) output

Fig. 13: Voice (in English) output

APPLICATIONS Voice translation is applicable to various domains like education, call center services, broadcast channels, telephonic communication. This system can be implemented on mobile

67


phones as well but currently there is no supporting API which supports the current mobile operating systems like android or mac OS.


CONCLUSION We have successfully developed a system which takes user voice as the input and gives machine voice as the output. The input and output may be Hindi or English depending upon the user choice. The dictionary is constructed by adding general as well as local spoken words by the user to ensure processing time and throughput time. This paper used combination of different softwares like sphinx, TTS, Dhwani and various algorithms were studied .The advantage of this model is that is reduces the processing time and thereby increasing the throughput. But the drawback is that it requires active internet connection for transferring data from client side to the server side to translate and give the machine output voice. Considerable progress is required in order to remove the barrier in effective communication between the individuals.

REFERENCES [1] Willie Walker, Paul Lamere, Philip Kwok “Free TTS- A performance case study”, August 2002. [2] Willie Walker, Paul Lamere, Philip Kwok “SphinxA performance case study”, August 2004. [3] Yen Chun Lin “An optimized approach to voice translation on mobile phones”, 2010.

68


[4] Titus Flex FORTUNA ” Dynamix Programming Algorithms in Speech recognition”, 2008,pp 94-99. [5] Srinivas Banglore, Vivek Kumar Rangarajan Sridhar, Prakash KolanLadan Golipur, Aura Jimenez ”Real-time Incremental Speech-to-Speech Translation of dialogs”, 2012,Conference of North American Chapter of the association for computational linguistic ; Human Language Technologies, pp.437-445. [6] Parul Mann, Chaitrali Morde, Madhuri Sankpal, Information Techmology Department, Mumbai University,”Voice Translation on Mobile phones”, International Journal of Engineering Science and Innovative Technology (IJESIT), Vol.2, Issue2, March 2013, pp 517 -520 [7] K.K Devadkar, Radha Shankarmani, Shailesh Kotian, Ashish Khadpe, “An optimized approach to voice translation on mobile phones”,Dept. of Information Technology, Sardar Patel Inst. Of Technology,Mumbai,India,[email protected] m,[email protected]