Transliteration System for English to Sinhala Machine ...

21 downloads 0 Views 445KB Size Report
There are two Sounds in English International phonetic alphabet (IPA). 'I' and 'i' for English but Sinhala uses one 'e' (b) for above both two sounds.
Transliteration System for English to Sinhala Machine Translation

ICIIS07

Transliteration System for English to Sinhala Machine Translation Budditha Hettige Department of Statistics and Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura

&

Asoka S. Karunananda Faculty of Information Technology, University of Moratuwa, Sri Lanka

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Overview

• • • • • • • • •

What is Machine Translation Problems in Machine Translation Machine Transliteration Sinhala & English Language Existing Approaches and Methods Proposed approach: Design Modules Conclusion and further works Demonstration

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

What is Machine Translation? Machine translation (MT) is a translation process that translate one natural language into other.

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

ICIIS07

Machine Translation Process

Source language dictionary

Source language Analysis

Bilingual dictionary

Translation

Target language dictionary

Target language generation

source language sentence

Target language sentence

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Source language analysis

ICIIS07

• Morphological analysis Source language Morphological analyzer analyze word by word in given sentence and returns Morphological information for each word. • Syntax analysis Source language parser identify the syntax of the given source language sentence.

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Translation

ICIIS07

Translator is used to translate source language word in to target language

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Target language generation

ICIIS07

• Morphological generation Source language Morphological analyzer/generator generate appropriate target language words with grammatical information • Syntax generation Target language parser generates the sentences in the target language

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Problems in Machine translation • Out-of-Vocabulary – No words in a dictionary

• Proper noun translation – Example (Mahinda Rajapaksha)

• Handling technical terms – Pentium IV Processor

• Multiword Expression – Oil cake ^lejqï& • Semantic and pragmatic

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

What is Machine Transliteration?

ICIIS07

Machine transliteration is a method for automatic conversion of words in one language in to phonetically equivalent ones in another language. Example the English word ‘machine’ is transliterated into Sinhala as ueIska. ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Why Machine Transliteration

ICIIS07

Machine Transliteration can be used to solve Out-of-Vocabulary problem Translate Proper nouns

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Design: English to Sinhala Machine Translation System English Sentence

English Morphological analyzer

English Dictionary

English Parser

Transliteration

Translator

Bilingual Dictionary

Intermediate Editor

Sinhala Morphological analyzer

Sinhala Dictionary

Sinhala Parser

Sinhala Sentence

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Transliteration Approaches

ICIIS07

• Grapheme-based transliteration – direct orthographical mapping from source graphemes to target graphemes

• Phoneme-based transliteration – based on pronunciation or the source phoneme rather than spelling or source grapheme

• Hybrid and Correspondence-based transliteration – Used above two approaches

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Types of Transliterations

ICIIS07

• Forward Transliteration – Transliteration of a name from its native script to a foreign one

• Backward Transliteration – Restoration of a previously transliterated name to its native scripts

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

English and Sinhala language

ICIIS07

English Language • English contains 26 letters with 5 vowels

Sinhala Language • The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and 2 semiconsonants • Represent 40 sounds: 14 vowel sounds and 26 consonant sounds

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Phonetic Relation between English and Sinhala

ICIIS07

• These two languages are fundamentally different from each other • There are no stokes in English language • Spoken and written English are equivalent. But there is a difference between written and spoken Sinhala language • Also Diphthongs are not used in written Sinhala language

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Disambiguation

ICIIS07



Two English sounds ‘^’ and ‘ә’ is represented in one Sinhala letter ‘a’ (w)



There are two Sounds in English International phonetic alphabet (IPA) ‘I’ and ‘i’ for English but Sinhala uses one ‘e’ (b) for above both two sounds



No Diphthongs are used in Sinhala Language. Therefore these sound representations have some difficulties.



Two sounds ‘v’ and ‘w’ are represented in one Sinhalese letter ‘w’ (j)



No Direct Sound for English Letters q, x, z in Sinhala



Also large numbers of irregular word pronunciations are difficult

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Available Approaches

ICIIS07

• Dictionary writers have used numbers of methods for English to Sinhala transliteration • phonetic-based transliteration method – based on International Phonetic Alphabet (IPA) sounds

• non-phonetic-based transliteration method – Based on letters

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Transliteration Approaches

English

Malalasekara

Rathna

Godage

Aback

D’nela

w[D]nela

tnela

Binocular

nb’fkdlahq,aD

Ìfk[d]lHq,[¾]

nhsfkdlahq,¾

Quota

laõDWÜD

lafjdag

lafjdagd

Volcono

fjd,a’flbkaDW

fj[d],,aflafkda

fjd,aflafkda

xenophobia

’fizkaD*aDWì

fi[z]k[d]f*daìh

fifkdaf*daìwd

Zero

’iazbD¾DW

[z]iSfrda

isfrda

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Proposed Approach to English to Sinhala Transliteration • • •

Letter-based transliteration approach Use Finite State Automaton (FSA) Two types of transliteration models are developed –

Type 1 :

Original English text E.g Computer



Type 2 :

Sinhala words written using English letters e.g. Ambepussa

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

English to Sinhala Transliteration for Original English Text (Type1)

ICIIS07

• Letter-based transliteration approach • Use Finite State Automaton (FSA)

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

IPA Chart for English Vowels

IPA English

English

Sinhala

Examples

a:

a

wd

Father

ɪ

i

b

Sit

ɪ

y

b

City

i:

ee

B

See

ɛ

e

t

Bed

ε:

ir

ta

Bird

æ

a

we

lad, cat, ran

ʌ

U, ou

w (jsjD;)

run, enough

ɒ

o, a

T

not, wasp

ɔ:

aw, au

´

law, caught

ʊ

U, oo

W

put, wood



oo, ou

W!

soon, through

ə

a

w(ixjD;)

About

ə

er

w(ixjD;)

Winner

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

IPA Chart for English Consonants

IPA

English

Sinhala

Examples

P

p

ma

pen, spin, tip

B

b

í

but, web

T

t

Ü

two, sting, bet

D

d

â

do, odd



ch, t

É

chair, nature, teach



d,j,dge

ca

gin, joy, edge

K

c,k,q,ck

la

cat, kill, skin, queen, thick

ɡ

g

.a

go, get, beg

F

f,gh

*a

fool, enough, leaf

V

v, ve

õ

voice, have

Θ

th

;a

thing, teeth

Ð

th, the

oa

this, breathe, father

S

s, c, ss

i

see, city, pass

Z

z , se

i

zoo, rose

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

IPA Chart for English Consonants contd.. ʒ

s, ge

i

pleasure, beige

H

h

ya

ham

M

m

ï

man, ham

N

n

ka

no, tin

Ŋ

ng

x

singer, ring

L

l, ll

,

left, bell

ɹ

r

r

run, very

W

j

j

we

J

y

h

yes

ʍ

j

j

what

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

FST for Types 1 transliteration

D1

g

e

C1 i

C2

V1 r

d

V2 e

C3

e e, r

C4

v

a, e, i, o, u, y

A

B a

k

c

w, u

e

e C5

t, e, s,c ,g

h

C

D

l0

V3 t o

V4

o, u

D

C6

e D2

h

Vowels

n

C7

g C8

l

Consonants l0 = {b,c,d,f,g,h,j,k,l,m,n,p,q,r,s,t,v,w,x,y,z}

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

English to Sinhala Transliteration for Sinhala words written using English Letters (Type2)

ICIIS07

• Letter-based transliteration approach • Use Finite State Automaton (FSA)

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Sinhala Transliteration alphabet for Type 2 Sinhala

Eng

Sinhala

Eng

Sinhala

Eng

w

a

X

nga

M

pa

wd

aa

Õ

nnga

M

pha

we

ae

p

ca

N

ba

wE

aee

P

cha

N

bha

b

i

c

ja

U

ma

B

ii

CO

jha

U

mba

W

u

[

nya

H

ya

W!

uu

{

jnya

R

ra

Ì



`P

ndja

,

la

Ï



g

tta

j

va

iD

ŗ

G

ttha

Y

sha

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Sinhala Transliteration alphabet cont… Sinhala

Eng

Sinhala

Eng

Sinhala

Eng

iDD

ŗ

v

daa

I

ssa

T

e

V

daha

i

sa

Ta

ee

K

nna

y

ha

Ft

ai

~

nnda




gha

T!

au

o

da

|

nda

L

ka

O

dha

.

ga

L

kha

k

na

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

FST for Types 2 transliteration C7

b

i V1

C1

I

I

l

r

s

C2

l

V2

r

e

s

e

D1

t t

L2

A

B

t

C

D

L1

L1

C4

L2

V4

i

e

h C5

d

d

e

V5

n

d

V6 C6

o

h

D2

u

u

h

C3

V3

a

i

D1

V7

o, u

n, d, y d, j

n D3

d

j D4

Vowels L1 = { a, e, ,i, o, u, Ǐ, ŕ }, L2 = { a, e, i }

Consonants L1 = { k, g, c, j, t d ,b, m, y, r, f, v, s, h, l, n, p } L2 = { k, g, c, j, t, d, b, s, p}

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Approach in Practice

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Demonstration

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Conclusion

ICIIS07

• Handling of Pronunciations of an English word is a critical problem in English to Sinhala transliteration. – English letter ‘a’ represent different sound ‘w’, ‘we’ and ‘wE’ (ago – wf.da, America – wefursld and antwEkaá) in Sinhala

• English word contains different pronunciations – two word ‘father’ and ‘fathom’ has different pronunciation for ‘fath’

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

Further work

ICIIS07

Incorporating English IPA into the system

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation

ICIIS07

Thank you!

ICIIS-2007:Transliteration System for English to Sinhala Machine Translation