An Integrated Deterministic and Nondeterministic

An Integrated Deterministic and Nondeterministic Inference Algorithm for Sequential Labeling Yu-Chieh Wu1, Yue-Shi Lee3, Jie-Chi Yang2, and Show-Jane Yen3 1

Finance Department and School of Communication, Ming Chuan University, No. 250 Zhong Shan N. Rd., Sec. 5, Taipei 111, Taiwan [email protected] 2 Graduate Institute of Network Learning Technology, National Central University, No.300, Jhong-Da Rd., Jhongli City, Taoyuan County 32001, Taiwan, R.O.C. [email protected] 3 Department of Computer Science and Information Engineering, Ming Chuan University, No.5, De-Ming Rd, Gweishan District, Taoyuan 333, Taiwan, R.O.C. {leeys, sjyen}@mcu.edu.tw

Abstract. In this paper, we present a new search algorithm for sequential labeling tasks based on the conditional Markov models (CMMs) frameworks. Unlike conventional beam search, our method traverses all possible incoming arcs and also considers the “local best” so-far of each previous node. Furthermore, we propose two heuristics to fit the efficiency requirement. To demonstrate the effect of our method, six variant and large-scale sequential labeling tasks were conducted in the experiment. In addition, we compare our method to Viterbi and Beam search approaches. The experimental results show that our method yields not only substantial improvement in runtime efficiency, but also slightly better accuracy. In short, our method achieves 94.49 F(β) rate in the well-known CoNLL-2000 chunking task Keywords: L2-regularization, part-of-speech tagging, support vector machines, machine learning

1

Introduction

The sequential chunk labeling aims at finding non-recursive chunk fragments in a sentence. Arbitrary phrase chunking and named entity recognition are the well-known instances. Over the past few years, the structural learning methods, like conditional random fields (CRFs) [8] and maximum-margin Markov models (M3N) [16] showed great accuracy in many natural language learning tasks. Structured learners also have the advantage of taking the entire structure into consideration instead of a limited history. On the contrary, the goal of local-classifier-based approaches (e.g. maximum entropy models) is to learn to predict labels with fixed context window features. Those methods need to encode the history to inform the learners explicitly. Although

high order features (i.e., large history) maybe useful for prediction, the runtime is often intractable to the case of large-scale and large category tasks. However, the training time of local-classifier-based methods is very efficient since the training instances can be treated independently. Support vector machines (SVMs) which is one of the state-of-the-art supervised learning algorithms have been widely employed as local classifiers to many sequential labeling tasks [7 5 19]. In particular, the learning time of linear kernel SVM can now be trained in linear time [4]. Even though local classifier-based approaches have the drawbacks of label-bias problems [8], training linear kernel SVM is not difficult to the case of large-scale and large-category data. By means of so-called one-versus-all or one-versus-one multiclass SVM training, the learning process could be decomposed into a set of independent tasks. In this paper, we present a hybrid deterministic and nondeterministic inference algorithm based on conditional Markov model (CMMs) framework. The algorithm makes use of the “so-far” local optimal incoming information and traverses all possible incoming arcs to predict current label. Then a modified Viterbi search method could be adopted to find the optimal label sequence with this manner. To utilize the characteristics of sequential chunk labeling tasks, we propose two heuristics to enhance the efficiency and performance. One is to automatically construct the connections between chunk tags, while the other is designed to centralize the computation efforts. To demonstrate our method, we conduct the experiments with six famous chunking tasks. We also compare our method with different inference strategies.

2

Frameworks of Conditional Markov Models

The goal of conditional Markov models (CMMs) is to assign the tag sequence with maximum conditional probability given the observation sequence, P ( s1 , s 2 ,..., s n | o1 , o 2 ,..., on )

where si is the tag of word i. For the first order left-to-right CMMs, the chain rule decomposes the probabilistic function as: n

P ( s1 , s 2 ,..., s n | o1 , o2 ,..., on ) =

∏ P( s | s i

i −1 , oi )

(1)

i =1

Based on the above setting, one can employ a local classifier to predict P ( si | si −1 , oi ) and the optimal tag sequence can be searched by using conventional Viterbi search. Fig. 1 illustrates the graph of employing variant order of the CMMs (0, 1st, 2nd, and the proposed 2nd order CMMs). The chain probability decompositions of the four CMMs types in Fig. 1 can be shown as follows:

n

P ( s, o) =

∏ P(s | o ) i

(2)

i

i =1 n

P ( s, o) =

∏ P(s | o , s i

i

i −1 )

(3)

i=2 n

P ( s, o) =

∏ P( s

i

| o i , s i −1 , s i − 2 )

i

| oi , si −1 , sî −1 )

(4)

i =3 n

P ( s, o) =

∏ P( s

(5)

i =3

Equation (2), (3), and (4) are merely standard zero, first and second order decompositions, while equation (5) is the proposed hybrid second order CMMs decomposition which will be discussed in next section. The above decompositions merge the transition and emission probability with single function. McCallum et al. [9] further combined the locally trained maximum entropy with the inferred transition score. However, our conditional support vector Markov models make different chain probability. We replace the original transition probability with transition validity score, i.e. n

P ( s, o) =

~

∏ P (s

i

| si −1 ) P ( si | oi )

(6)

i

| si −1 ) P ( si | oi , si −1 , sî −1 )

(7)

i=2 n

P ( s, o) =

∏ Pˆ (s i =3

Fig. 1. Variant order of conditional Markov models: (a) the standard 0(zero) order CMMs, (b) first order CMMs, (c) second order CMMs, and (d) the proposed second order CMMs.

The transition validity score is merely a Boolean flag which indicates the relationships between two neighbor labels. Equation (6) and (7) are zero-order and our second order chain probabilities. We will introduce the proposed inference algorithm and how to obtain the transition validity score automatically without concerning the change of chunk representation.

2.1

Hybrid Deterministic and Nondeterministic Inference Algorithm

In general, incorporating high order information (rich history) could improve the accuracy. However, it is often the case that the inference time of high order CMMs exponentially scales with the number of history, i.e., K. For example, to compute best incoming arc for si in the second order CMM, it needs to enumerate all combinations of si-1 and si-2. To reduce the curse of high order information, we present an approximate inference algorithm that has the same computational time complexity as first-order Markov models. This method considers all combinations between previous nodes and current node while the history is kept as greedy. More specific, for each previous node si-1, we traverse the optimal incoming path to incrementally determine si-2, si-3,…,si-k+1, instead of enumerating all combinations of the history. This can also be viewed as a variant type of traditional Viterbi algorithm. Fig. 2 illustrates the proposed inference algorithm. The same as most local classifier-based approaches, in training phase, the history information is known in advanced. In testing, it determines the optimal label sequence by three factors: the accumulated probability, transition validity score, and the output of local classifier. The local classifier (SVMs) should generate different decisions when the given history changes. The algorithm iteratively keeps and stores the optimal incoming arcs in order to trace the local best history. By following this line, the local classifier can concentrate on comparing all possible label pairs (previous and current nodes). Finally, the algorithm traces back the path to determine the output label sequence.

3

Speed-up Heuristics

3.1

Automatic Chunk Relation Construction

In this paper, we argue that the transition probability is not useful to our CMMs framework. Nevertheless, one important property to sequential chunk labeling is that there is only one phrase type in a chunk. For example, if the previous word is tagged as begin of noun phrase (B-NP), the current word must not be end of the other phrase (E-VP, E-PP, etc). Therefore, we only model relationships between chunk tags to generate valid phrase structure. In other words, we eliminate the influence of tag transition probability. Hence the local classifiers play the critical roles in the Markov process. Wu et al. [19] presented an automatic chunk pair relation construction algorithm which can handle so-called IOB1/IOB2/IOE1/IOE2 [7 17] chunk representation structures with either left-to-right or right-to-left directions. However, the main limitation is that it only works well on IOB or IOE tags rather than more complex phrase structures, for example SBIE. As reported by [20], the use of rich chunk

representation such as second begin and third begin of chunk yielded better accuracy in most Chinese word segmentation tasks. It is very useful to represent the Chinese proper name which usually contains more than 5 characters. To remedy this, we extend the observations from [19] and propose a more general way to automatically build the valid chunk tag relations. The same as previous literature, our method is also applicable to either left-to-right or right-to-left direction. The main concept is to find the valid relationships between leading tags (for example B/I/E/S) and project them to all chunk tags. The summary of this method is listed below. 1. 2. 3.

4.

Enumerate all possible leading tags Initialize all pairs of leading tags with no relation Scan the training data and classify the leading tags into the three categories: (1) the same phrase but different leading tag (e.g. B-NP, I-NP) (2) the same phrase and the same leading tag (e.g. I-NP, I-NP) (3) different phrase and different leading tag (e.g. S-NP B-VP) Construct the chunk pair validity matrix by projecting the leading tag relation

We use the Boolean value to represent the relation score of a pair of chunk tag (0: ~ invalid/ 1: valid). That is, P ( si | si −1 ) is derived via the above approach. Initialization: δ 0 (k ) := 1 ψ 0 (k ) := 0 Recursive: δ t +1 ( s j ) := max δ t ( si ) ~p ( s j | si ) p( s j | si , sî −1, o j ) i =1~ m

where sî −1 := arg maxψ i ( sh ) h =1~ m

ψ t +1( s j ) := arg maxψ t ( si ) ~p ( s j | si ) p ( s j | si , sî −1, o j ) i =1~ m

Termination: δT := max δT −1 ( si ) i =1~ m

Trace back: ψ t ( si ) := arg maxψ t +1 ( s j ) ~p ( s j | si ) j =1~ m

Fig. 2. The proposed hybrid 2-order inference algorithm

3.2

Speed-up Local Classifiers

As described in Section 2, it is necessary to take all the previous states into account. Directly retrieving the prediction scores from the classifiers might be slow especially when the feature set is large. According to observations, most feature weights can be accessed only once before considering next positions of word. We can use the following techniques to speed up computational efforts in accessing feature weights.

For second order CMMs, the feature set could be decomposed into three parts: F = f c ∪ f p1 ∪ f p 2

where f c is the fixed feature set, f p1 and f p 2 are the state features of previous word and the second words before. The decision function of conventional linear kernel support vector machines has the following form to the input x: F

g ( x; Ci ) =

∑w

ij

⋅ x j + bi

j

wij is the trained weight of feature j of class i and bi is the constant of class i. By decomposing the features as above, we can re-write the decision function as: g ( x; Ci ) = bi +

fc

∑

f p1

wij ⋅ x j +

j

∑

wip1 ⋅ x p1 +

j∈ p1

f p1

∑w

i p2

⋅ x p2

(8)

j∈ p 2

For each category, the first two terms in Eq. (8) is independent to previous predicted states, while the remaining terms are the sum of weights of the first-order and second-order features. Therefore, we can always keep the values of the first term terms and adding the remaining terms during inference. In this paper, we simply adopt the predicted states of previous 1 and 2 words as features and hence |fp1|=|fp2|=1. This technique is also applicable to maximum entropy models where the category constant bi is not needed for ME. In addition, the proposed heuristics could also be generalized to different K-order CMMs. Assume the testing time complexity of a full 2-order CMMs is (C2*|F|) where C is the number of category. By contrast, the testing time complexity is reduced to (|fc|+C2(|fp1|+|fp2|) where |F|=|fc|+|fp1|+|fp2|. The time complexity of our hybrid 2-order CMMs is (|fc|+|fp2|+C*(|fp1|).

4

Experiments

Six well-known sequential labeling tasks are used to evaluate our method, namely, CoNLL-2000 syntactic chunking, base-chunking, CoNLL-2003 English NER, BioText NER, Chinese POS tagging, and Chinese word segmentation. Table 1 shows the statistics of the six datasets. CoNLL-2000 chunking task is a well-known and widely evaluated in previous work [13 14 7 19 3]. The training data was derived from Treebank WSJ sec. 15-18 while section 20 was used for testing. The goal is to find the non-recursive phrase structures in a sentence, such as noun phrase (NP), verb phrase (VP), etc. There are 11 phrase types in this dataset. We follow the previous best settings for SVMs [7 19]. The IOE2 is used to represent the phrase structure and tagged the data with backward direction. Second, the base chunking dataset is the extension of CoNLL-2000 phrase recognition in which the goal is to find the base phrase of the full parse tree. The WSJ sec. 2-21 was used for training while sec. 23 was used for testing. We re-trained the

Brill tagger with the same training set to label POS tags for both training and testing data. The training and testing data of the Chinese POS tagging is mainly derived from the Penn Chinese Treebank 5.0. Ninety percent out of the data is used for training while the remaining 10% is used for testing. However, the task of the Chinese POS tagging is very different from classical English POS tagging in that there is no word boundary information in Chinese text. To achieve this, [10] presented a transformation method to encode each Chinese character with IOB-like tags. For example, the tag B-ADJ means the first character of a Chinese word which POS tag is ADJ (adjective). In this task, we simply use the IOB2 to represent the chunk structure. As a result, there are 64 chunk tags. Similar to the Chinese POS tagging task, the target of Chinese word segmentation (WS) is even simpler. The sequential tagger learns to determine whether the Chinese character is begin or interior of a word. As discussed in [20], using more complex chunk representation bring better segmentation accuracy in most Chinese word segmentation benchmarks. By following this line, we apply the six tags B, BI, I, IE, E, and S to represent the Chinese word. BI and IE are the interior after begin and interior before end of a chunk. B/I/E/S tags indicate the begin/interior/end/single of a chunk. We employed the official provided training/testing data from CoNLL-2003 to run the experiments on the NER task. The BioText NER is a small-scale biomedical named entity recognition task [15]. We randomly split 75% of the data for training while the remaining 25% was used for testing. For the above six tasks, we do not include any external resources such as gazetteers for NER. This make us clear to see the impact of each search algorithm under full supervised learning frameworks. Table 1.

Statistics of the evaluated sequential labeling tasks

Data Statistics CoNLL2000 Base Chunking CoNLL2003 BioText Chinese POS tagging

# of examples # of sentences

Training Testing Training Testing Training Testing Training Testing Training

220663 49389 950028 56684 204567 46666 20874 5250 745215

8935 2011 39832 2416 14987 3684 773 193 16909

Testing

81674

1878

Encoded chunk tags

# of total category

Feature threshold

I/E tags 11*2+1=23

f=2

I/E tags 20*2+1=41

f=2(SVM); f=4(ME)

B / I tags

4*2+1=9

f=2

B / I tags 10*2+1=21

f=2

B / I tags

32*2=64

f=2; f=4(ME)

Training 2.64M 57275 S/B/I/E/BI/I f=2; 6 E tags Testing 357989 7512 f=10(ME) C is the regularized parameter of SVM; σ is the Gaussian prior of the maximum entropy model f is the feature cutting threshold. Chinese WS

Learner parameters σ=1000; C=0.1 σ=1000; C=0.1 σ=1000; C=0.1 σ=1000; C=0.1 σ=1000; C=1 σ=1000; C=1

4.1

Settings

Since a couple of the tasks are similar, for example the CoNLL-2000 and base chunking, we adopt three feature types for the six datasets. We replicated coordinate sub-gradient descent optimization approaches [4] with L2SVM as learners. In basic, the SVM was designed for binary classification problems. To port to multiclass problems, we adopted the well-known one-versus-all (OVA) method. One good property of OVA is that parameter estimation process can be trained individually. This is in particularly useful to the tasks which produce very large number of features and categories. To obtain the probability output from SVM, we employ the sigmoid function with fixed parameter A=-2 and B=0 as noted by [11]. On the other hand, the maximum entropy model used in this paper was derived from [6]. As shown in Table1, the parameters of SVM and ME were the same for the six tasks. However, the ME is not scalable to the large-scale and large category tasks, such as base chunking and Chinese WS where more than 80M feature weights needs to be estimated per category. Hence, we incrementally increase the feature threshold until the feature size can be handled in our environment. Though one can increase the hardware performance to cover the shortage, this is not the way to solve problems. 4.2

Overall Results

The first experiment is used to evaluate the benefit of incorporating higher order information and the proposed inference algorithm. We also compared with greedy search and beam search with variant beam width (B=10,50,250) for all tasks. The overall experimental results are summarized in Table 2. Each row indicates the corresponding inference algorithm. The beam search was applied with different width of beams. Term “2- order” is the proposed hybrid deterministic and nondeterministic inference algorithm. Without considering learners, the inference algorithm of MXPOST is beam search-based with combining previous predicted states as features. Clearly, 2- order inference method seems to be very suitable for SVM. It achieves the best accuracy in four datasets. On the contrary, beam search with ME yields very competitive result as optimal algorithm. In most case, SVM with beam search yields close (but not equal) performance as well as 2- order inference. ME with beam search shows even more close accuracy in most cases. In CoNLL-2003, ME with no history (i.e. 0-order) achieves better performance than the first order and full second order CMMs. To see the significant difference among these search algorithms, we perform s-test [18] and McNemar-test to examine the statistical test. In most cases, it shows statistical significant difference between our 2- order inference method and Beam

search. In BioText and Chinese POS tagging tasks, our method and beam search has the significant difference (p