Systematic Processing of Long Sentences in Rule Based Portuguese

Systematic Processing of Long Sentences in Rule Based Portuguese-Chinese Machine Translation Francisco Oliveira, Fai Wong, and Iok-Sai Hong Faculty of Science and Technology, University of Macau. Av. Padre Tomás Pereira, Taipa, Macao {olifran,derekfw,ma66536}@umac.mo

Abstract. The translation quality and parsing efficiency are often disappointed when Rule based Machine Translation systems deal with long sentences. Due to the complicated syntactic structure of the language, many ambiguous parse trees can be generated during the translation process, and it is not easy to select the most suitable parse tree for generating the correct translation. This paper presents an approach to parse and translate long sentences efficiently in application to Rule based Portuguese-Chinese Machine Translation. A systematic approach to break down the length of the sentences based on patterns, clauses, conjunctions, and punctuation is considered to improve the performance of the parsing analysis. On the other hand, Constraint Synchronous Grammar is used to model both source and target languages simultaneously at the parsing stage to further reduce ambiguities and the parsing efficiency. Keywords: Rule based Machine Translation, Sentence Partitioning, Constraint Synchronous Grammar.

1 Introduction Most Rule based Machine Translation (MT) systems [1] can generate reasonable translations with short sentences. However, when sentences are long in length, the story is quite different. The parsing time is directly affected by the analysis required in determining the correct syntactic parse tree structure from several ambiguous trees. Moreover, MT systems have a higher probability to fail in the analysis, and produce poor translation results. Based on an experiment conducted in studying over 2000 Portuguese sentences extracted online from a government department [2], we found that the average length is 19 words per sentence. Furthermore, many of them are very long in length. They don’t have any punctuation except at the end with a full stop, or they have several fragments separated by too much punctuation although they are related to each other. This really shows that in most cases, MT systems need to deal with long sentences. Recently, existing literature provides different approaches to overcome the problem of efficiency and to improve the translation quality. Researchers focused on breaking down complex and long sentences into several fragments based on a set of defined criteria. A. Gelbukh (Ed.): CICLing 2010, LNCS 6008, pp. 417–426, 2010. © Springer-Verlag Berlin Heidelberg 2010

418

F. Oliveira, F. Wong, and I.-S. Hong

Some proposed the use of punctuations and conjunction words as the partition delimiter. Jin et al. [3], Xiong et al. [4] focused on partitioning Chinese long sentences based on comma. Li et al. [5] considered more types of punctuations in conjunction with a hierarchical parsing approach to tackle the problem. Although this concept is simple to implement, it is very easy to get wrong partitioning of fragments and lead to poor translation results. Some proposed specific sequences of words that can be grouped together into grammatical constituents (noun, verb, adjective, clause, and other phrases) in splitting long sentences. Shallow parsing is then applied for each group of words identified, denoted as a chunk by Abney [6], instead of full parsing. The main purpose is to reduce the analysis in deciding the correct syntactic structure of a sentence, remove ambiguous cases in advance, and increase the efficiency as well as the translation quality. Different authors considered different types of chunks according to their interests. Garrido-Alenda et al. [7] proposed a MT system based on a partial transfer translation engine that relies on shallow parsing for structure transfer between the language pair. Yang [8] proposed a preprocessing module for chunking phrases in Chinese-Korean MT system. Some defined syntactic patterns in the sentence partitioning. Kim et al. [9] defined a set of manually constructed pattern information to accomplish the task. To better acquire patterns automatically, Kim et al. [10] applied Support Vector Machines, and Kim et al. [11] used Maximum Entropy to learn and identify fragments of long sentences. Each of these approaches has its strength and weakness in application to sentence partitioning. The combination of these methods seems the way to go in order to avoid the intrinsic obstacles of each approach. This paper presents different criteria defined for partitioning long sentences and their systematic execution in application to Rule based Portuguese-Chinese MT system. Our strategy divides sentence partitioning into three stages. Patterns including date, time, numbering, and phrases that have a specific sequence order are considered as the starting point to identify special fragments. In the second stage, partitioning is accomplished based on the punctuation, conjunction words and phrases delimiters. At last, all the fragments are shallow parsed in the identification of Noun Phrases (NP) before the full parsing, and generation of the target language. The parsing of the MT system is based on Constraint Synchronous Grammar (CSG) [12], which is used to model syntactic structures of two languages simultaneously. In order to perform necessary disambiguation during the parsing stage, feature constraints are defined for each CSG rule. Due to its characteristics, our MT system does not require another set of conversion rules to change the source parse tree into the target one. As a consequence, it can reduce errors during the transfer process, increase the parsing time, and strengthen the relationship between the parser and the generation modules. This paper is organized as follows. Section 2 gives the details of each criterion considered in the identification of suitable fragments in long sentences. The whole process for partitioning long sentences in application to Rule based PortugueseChinese MT system is presented in Section 3. The evaluation is discussed in Section 4, and a conclusion is followed in Section 5.

Systematic Processing of Long Sentences in Rule

419

2 Criteria in the Identification of Fragments in Long Sentences In order to improve the quality of sentence partitioning for Portuguese language, three criteria have been studied and concluded. 2.1 Specific Pattern Rules The first criterion is based on the identification of patterns. If there is an exact matching pattern, it is believed that high quality translation can be guaranteed for the fragment identified. Each pattern is written in Constraint Synchronous Grammar [12], a variation of synchronous grammar based on Context Free Grammar. Each production rule models both the source and the target sentential pattern for describing their relationships. An example of a CSG pattern is shown below. Time Æ Number1 Symbol1 Number2 { [Number1 Symbol1 Number2] ; Symbol1 = “:” & 0