Convolutional Recurrent Neural Network for Question

1 downloads 0 Views 847KB Size Report
Dec 9, 2017 - improvement over RNN model for question answering task. Keywords—Deep learning, convolutional neural network, recurrent neural network ...
2017 3rd International Conference on Electrical Information and Communication Technology (EICT), 7-9 December 2017, Khulna, Bangladesh

Convolutional Recurrent Neural Network for Question Answering M .M . Arefin Zaman Department of Computer Science & Engineering Rajshahi University o f Engineering & Technology Rajshahi-6204, Bangladesh Email: [email protected]

Abstract—Question answering has become a very im portant task for natural language understanding as m ost natural lan­ guage processing problems can be posed as a question answering problem. Recurrent neural network (RNN) is a standard baseline model for various sequence prediction tasks including question answering. Recurrent networks can represent global information for a long period of tim e but they do not preserve local inform a­ tion very well. To address this problem we propose a model which is a combination o f recurrent and convolutional network that can be trained end to end using backpropagation. Our experim ents on bA bl dataset demonstrate that this model can achieve significant improvement over RNN model for question answering task.

Keywords—Deep learning, convolutional neural network, recurrent neural network, natural language processing, question answering I. I n t r o d u c t io n

Question Answering (QA) is a complex and difficult natural language processing task as it requires understanding context and reasoning through a long sequence of sentences. A QA system is given a story and a question. It is supposed to find the correct answer by going through the relevant facts from the story. Various tasks in Natural Language Processing (NLP) can be presented as a question answering problem [1]. Moreover, QA can be used to develop dialogue systems and chatbots [2]. Therefore, improving QA will enable the ubiquitous personal assistants (e.g. siri,cortana) to become more robust and helpful. QA has been studied in the field of information retrieval for a long period of time. But the datasets used to evaluate QA systems were very small. Traditional approaches typically involved rule-based algorithms or linear classifiers over hand-engineered features. With the advent o f more expressive models like deep neural networks, larger datasets were required. Hermann et al. proposed a large scale dataset for QA [3], Since then more attempts have been taken to create a large dataset to test critical reasoning ability of computer [4] [5], Weston et al. proposed a set o f artificial tasks called bAbl tasks which is used for evaluating the model in this paper [6].

978-1-5386-2307-7/17/$31.00 © 2 0 1 7 IEEE

Sadia Zaman M ishu Department of Computer Science & Engineering Rajshahi University of Engineering & Technology Rajshahi-6204, Bangladesh Email: [email protected]

Recurrent Neural Network (RNN) is a class of neural network which contains recurrent connection between hidden layers in consecutive timesteps. This type of architecture is very suitable for predicting temporal or sequential data (e.g. speech, text). RNN and its variants have become a standard architecture for various natural language processing tasks [7] [8], In principle, it is possible to select an answer from all possible words of the vocabulary from the joint representation of the story and question by using a Recurrent Neural Network Language Model [9]. Convolutional Neural Network (CNN) is one kind of feed-forward neural network primarily inspired by human visual cortex which contains consecutive convolutional and subsampling layer. Although CNN has helped revolutionize the computer vision field [10], it has also started impacting natural language processing domain in recent years. Dos Santos and Gatti have achieved state of the art result in sentiment analysis of single sentence using CNN [11], CNN has also been used for sentence classification [12] and sentence modeling [13]. Since RNN can contain long term dependency well, it can be augmented with more localized feature detector like CNN to improve its performance. This idea where convolutional and recurrent networks are used together have already been implemented for machine translation [14] and music classification [15], We propose similar kind of model for question answering which obtains better result than single RNN model. We discuss the relevant background study in Section II. Then we discuss the dataset with examples in Section HI and then present our proposed model for the task of question answering in Section IV. We describe the experimental result in Section V and finally conclude in Section VI. II. B a c k g r o u n d S t u d y A. Word Em bedding Word embedding creates a distributed representation of words where semantically similar words are mapped to nearby points in the vector space. Vector space models like

0

Classifier

o

°t-l A

A

S \ \

I II I I I I I

Word Matrix

-¡U & -

IIIIII II

[w ~ | th e

I I I I I I I I

[w ~| cat

9 I I I I I I I I

[w ~|

^

Unfold"

U|

t r

X

X ,-,

° t+i

v\

v\

;L w Average/Concatenate

°t A

7 *? ‘ w \U x,

+ d ,+1 t

1

\u x ,+1

Fig. 2. A recurrent neural network unfolded through time [24]

sat

Fig. 1. A general approach for learning word embedding [21]

These properties led to a wide range o f use of word vectors in language modeling and understanding tasks. B. Recurrent Neural Network

tf-idf or n-gram are traditionally used to create fixed length vector representation of words. Recently, embedding models like word2vec [16] or Glove [17] is popular for embedding words into low dimensional vectors. The idea o f distributed representation of concept is inspired from human cognition. This idea was proposed during the connectionist resurgence in 1980’s [18] [19]. Bengio et al. showed neural network along with distributional representation can surpass standard n-gram model in language modeling task [20]. Mikolov et al. proposed a Continuous Bag-of-Words model (CBOW) and a Skip-Gram model for learning word embedding from raw text [16]. In Fig. 1, a very well known framework for word embedding is shown. Here, every word is represented by a column in matrix W. The ith column of the W indicates the ith word in the vocabulary. Here the whole task was to predict the next word given the context. Sum or concatenation of the matrices corresponding to words in the context is used for prediction. After training the model, words with similar meaning usually have similar values in many dimensions of the vector. For example, word vectors can learn about semantic similarity [22].

Tiger Leopard Dhole Warthog Rhinoceros Lion

Difference in word vectors also represents different relation between words like gender, number, degree etc. [23] @king @man ~ @queen @woman (Gender) ^franee @french @mexico @Spanish (Language) $car @cars ~ dapple ^appies(^Hmber)

A neural network is called recurrent if it has one or more cycles, if it is possible to follow a path from a unit back to itself, then the network is referred to as recurrent network [25]. Generally, the recurrent architecture used in modem recurrent networks is the one shown in Fig. 1 called elman network [26]. This kind of network allows the hidden layer information from previous step to propagate through different timesteps in future hidden nodes. The hidden state at timestep t is a function o f the previous hidden state and the input in current timestep. st = f ( U x t + W s t - 1 )

(1)

The output(j/) at time step t is then calculated as ot = f ( V s t)

(2)

Here, x t = Input at t timestep st = Hidden state at t timestep ot = Output at t timestep / = an activation function like o or tanh U = Weight between input and hidden state W = Recurrent weight between hidden states V = Weight between hidden state and output In equation (1), the hidden state is a function o f previous context and current input. Due to this property, RNN is theoretically capable o f learning long term dependency among timesteps. But in practice, the gradient becomes very small after certain amount of timesteps [27], This problem has been solved by Hochreiter and Schmidhuber [28] who invented a new way to calculate recurrent unit called long short term memory (LSTM). 1) R N N language model: Language modeling is an important problem o f Artificial Intelligence (Al). Statistical language model captures the hidden statistical characteristics o f the distribution of word sequences in a document. Language models have very important applications in speech

1

1

1

0

0

1

1

0

0

1

0

0

4

3

4

1, 1,0 1,

2

4

3

0

0

1,0 1„

Q „

2

3

4

0

1

1,

° ,i

° ,o

Convolved

Fig. 3. Convolution operation using 3x3 kernel [31] Fig. 4. Convolutional Neural Networks for sentence classification [12]

recognition, machine translation, natural language processing etc. The goal o f a word level language model is to give the probability of the next word given all the previous words. Concretely, P (w n \wn- i , .............., w 2, wi ) . For example, P ( red | The color of roses is) =? Neural network based language model was proposed in a seminal paper by Bengio et al. which used a feedforward neural network [20]. Later, Mikolov et al. proposed Recurrent Neural Network based Language Model [9]. This model and its variants are now used in many sequence prediction problems including various natural language tasks like translation or text classification. C. Convolutional Neural Network Convolutional networks are a family of neural networks that are primarily inspired by the human vision system. These networks are especially suitable for processing data that have n-dimensional grid like structure(e.g. image). The inspiration behind CNN goes back to the experiment of Hubei and Wiesel where they showed that visual cortex has a special kind o f hierarchical architecture which creates a special representation from raw image signal [29]. CNN with end to end training has been used for document recognition [30] which showed its usefulness in the 1990s. In recent years, it has become a standard in the field of computer vision. The working procedure of CNN is first shown using an image example as it feels more intuitive this way. Then CNN for language processing is illustrated. Usually, convolutional networks have 2 major kinds of layers. They are explained below: 1) Convolution layer: Convolution layer generally takes an input and convolve the input with different kernels or filters and creates a new activation map (i.e. feature map). The values in different positions of the kernel act as the weights o f the network which are trainable. The convolution operation can be expressed as summation after doing elementwise multiplication of the kernel and small portion of the inputfsame size as the kernel) for all different positions in the input. The operation is illustrated in Fig. 3. Then the convolved features go through a non-linearity (e.g.

ReLU) to the next layer. There are 4 hyperparameters to choose in case of convolution. They are: K F S P

= = = =

Number of filters Spatial extent o f the filters Stride number The amount of zero padding

2) Pooling layer: It is also called subsampling layer because the job of a pooling layer is to reduce the spatial size of its input so that number of parameters is also decreased. It keeps the computational complexity low and also helps prevent overfitting by reducing parameters required to train. There are 2 hyperparameters in this layer which are not trainable. F = Spatial extent o f filters S = Stride number Generally, max or average pooling is used in this layer which means taking the maximum or average value within a certain window. In case o f language processing, a document or sequence of sentences is generally represented as a matrix where each row corresponds to a token (e.g. words). Each row is a word vector typically obtained using word embedding technique mentioned earlier. The number of columns is the dimension of the word vector. The width o f the filter is same as the number of columns of the input matrix. The length o f the filter can be a small odd number. How CNN works for NLP is shown in Fig. 4. III. T h e D a t a s e t We have performed experiments on the synthetic question answering tasks collectively known as facebook bAbl dataset [6], This dataset tests a model’s ability to retrieve relevant facts and reason over them. It consists of 20 different tasks where each task tests a specific ability like counting, time reasoning or basic induction/deduction.

Embedding Recurrent layer layer

Fig. 5. Proposed model for question answering.

Every task consists of a set of statements (story), a question and an answer (typically a single word). The answer is available at training time, but must be predicted at test time. Only a subset of the statements is necessary facts while others are just noise. The model has to be able to ignore them to reach to an answer. There are two versions of the training data. The first one has 1000 training samples per task and the second one is larger with 10,000 samples per task. Our model is trained on the second version. Some examples of the dataset are given below: Task 7: Counting

Task 14: Time Rea­ soning

l.In the afternoon Julie 1. Daniel picked up the football. went to the park. 2. Daniel dropped 2. Yesterday Julie was the football. at school. 3. Daniel got the 3. Julie went to the milk. cinema this evening. 4. Daniel took the apple. Where was Julie before the park? How many objects A:school is Daniel holding? A: two

Task 16: Basic Induction 1. Lily is a swan. 2. Lily is white. 3. Bemhard is green. 4. Greg is a swan. What color is Greg? A:white

Although performing well on this synthetic dataset does not guarantee that the model would do the same in real world datasets, it is necessary for a question answering model to ace these tasks in order to perform well in more challenging tasks. IV . T h e M

odel

Since CNN model contains local connections to a fixed number of words, it represents local information within its context by creating higher level features from them. Moreover, semantic representation created by CNN are rotation and translation invariant. A kernel k e R cd considers c words as context and generate a more abstract feature from that. This

can often act as an n-gram feature detector [32], Multiple kernels o f different size are applied to the input matrix and only the most important features are selected by using max-pooling. But CNN cannot learn long term dependency and its context size is also fixed. On the other hand, RNN updates its state with current input and previous state at each timestep thus preserving the global information. So it can model interdependence between words for an arbitrary amount of timesteps. For this reason, recurrent networks having LSTM or GRU cells are most commonly used for processing text data. To take the advantage of both worlds we propose a new model. Our proposed model is illustrated in Fig. 5. Here, the story and the question are converted into word vector first using an embedding layer. Then both the embeddings are used as input to an RNN with 50 LSTM units and the outputs of these layers are merged into a single vector. Another RNN takes this merged vector as input and produces a fixed length vector. This vector goes through a convolutional layer composed o f 50 filters of size 3 and a max-pooling layer. Finally, a fully connected softmax layer is applied over the whole vocabulary to get the index o f the answer. Dropout [33] has been applied to the embedding and convolutional layers to prevent overfitting which is not shown in the figure.

V. E x p e r im

ents

The model is evaluated on bAbl question answering dataset. For training purpose, 10000 samples (story, question and answer tuple) are used and the model is evaluated on 1000 test samples per task. All experiments are done using only CPU (intel corei5, 2.5GHz), no GPU is used. The model is created and run using Keras deep learning library (with Tensorflow backend) on jupyter notebook. The model is trained using backpropagation and Adam [34],

TABLE I Test

s e t a c c u r a c y (% ) o f

Task 1: Single Supporting Fact 2: Two Supporting Facts 3: Three Supporting Facts 4: Two Argument Relations 5: Three Argument Relations 6: Yes/No Questions 7: Counting 8: Lists/Sets 9: Simple Negation 10: Indefinite Knowledge 11: Basic Coreference 12: Conjunction 13: Compound Coreference 14: Time Reasoning 15: Basic Deduction 16: Basic Induction 17: Positional Reasoning 18: Size Reasoning 19: Path Finding 20: Agents Motivations Mean Accuracy

120

L S T M b a s e l in e MODEL

LSTM 50 20 20 61 70 48 49 45 64 44 72 74 94 27 21 23 51 52 8 91 49

a n d o u r pro po sed

C-RNN 88.9 45.0 36.9 77.1 99.2 83.0 93.8 78.2 88.5 71.1 95.1 95.0 95.2 41.5 54.2 48.6 95.0 96.4 17.8 98.4 74.95

A. Quantitative Result The experimental result is presented in table 1. From table 1, we can see that our proposed Convolutional Recurrent Neural Network (C-RNN) has outperformed the LSTM baseline proposed in [6] on every task. The mean accuracy of the LSTM model is 49%. On the other hand, the mean accuracy o f C-RNN is 74.95% which is 25.95% better than the LSTM model. A task is said to be solved if an accuracy of equal or greater than 95% is achieved [6]. By this standard, our model solves 7 out of 20 tasks. The result on 20 tasks is also shown in Fig. 6 by plotting the corresponding accuracy scores. On task 13, task 19 and task 20 both model performs very closely. But on multiple tasks (task 1, task 6, task 7, task 17 and task 18) our model exceeds the LSTM baseline by a big margin. V I. C o n c l u s i o n In this paper, we proposed a convolutional recurrent neural network model for question answering task. The model involves feeding outputs o f the LSTM to a convolutional layer for detecting short term dependency between words better. Our experiments on bAbl question-answering dataset demonstrate that the proposed model improves accuracy as compared to an LSTM model. An improvement of 25.95% in average accuracy is observed on the test set. However, there is still much to do. Our model performs better for large training set. But it takes more time to train. Use of external memory and attention along with this model can be very effective for answering questions from long stories. It can also help decrease the training data size. We want to further explore different memory augmented architectures like [35] [1],

bAbl tasks

^^LSTM ^ ^

c-rnn

Fig. 6. The plot shows the accuracy(%) comparison of the LSTM baseline and our proposed model for every task in the bAbl dataset

R eferences [1] A. Kumar, O. Irsoy, R Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher, “Ask me anything: Dynamic memory networks for natural language processing,” in International Conference on Machine Learning, pp. 1378-1387, 2016. [2] S. Quarteroni and S. Manandhar, “A chatbot-based interactive question answering system,” Decalog 2007, p. 83, 2007. [3] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Su­ leyman, and P. Blunsom, ‘Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems, pp. 1693-1701, 2015. [4] M. Richardson, C. J. Burges, and E. Renshaw, “Mctest: A challenge dataset for the open-domain machine comprehension of text.,” in EMNLP, vol. 3, p. 4, 2013. [5] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016. [6] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merrienboer, A. Joulin, and T. Mikolov, ‘Towards ai-complete question answering: A set of prerequisite toy tasks,” arXiv preprint arXiv:1502.05698, 2015. [7] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, pp. 3104-3112, 2014. [8] D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification.,” in EMNLP, pp. 1422-1432, 2015. [9] T. Mikolov, M. Karafiat, L. Burget, J. Cemocky, and S. Khudanpur, “Recurrent neural network based language model.,” in Interspeech, vol. 2, p. 3, 2010. [10] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘Tmagenet classification with deep convolutional neural networks,” in Advances in neural infor­ mation processing systems, pp. 1097-1105, 2012. [11] C. N. Dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts.,” in COLING, pp. 69-78, 2014. [12] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014. [13] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neu­ ral network for modelling sentences,” arXiv preprint arXiv:1404.2188, 2014. [14] P. Dakwale and C. Monz, “Convolutional over recurrent encoder for neural machine translation,” The Prague Bulletin o f Mathematical Lin­ guistics, vol. 108, no. 1, pp. 37-48, 2017. [15] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 2392-2396, IEEE, 2017. [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, pp. 3111— 3119, 2013.

[17] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings o f the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 15321543, 2014. [18] G. E. Hinton, “Learning distributed representations of concepts,” in Proceedings o f the eighth annual conference o f the cognitive science society, vol. 1, p. 12, Amherst, MA, 1986. [19] D. E. Rumelhart, J. L. McClelland, P. R. Group, et a l, Parallel distributed processing, vol. 1. MIT press Cambridge, MA, USA:, 1987. [20] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba­ bilistic language model,” Journal o f machine learning research, vol. 3, no. Feb, pp. 1137-1155, 2003. [21] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings o f the 31st International Conference on Machine Learning (ICML-14), pp. 1188-1196, 2014. [22] P. Liang, “Natural language understanding: Foundations and state-ofthe-art.” http://icml.cc/2015/tutorials/icml2015-nlu-tutorial.pdf. [23] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in contin­ uous space word representations.,” in hlt-Naacl, vol. 13, pp. 746-751, 2013. [24] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 4 3 6 ^ 4 4 , 2015. [25] M. I. Jordan, “Serial order: A parallel distributed processing approach,” Advances in psychology, vol. 121, pp. 471-495, 1997. [26] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179-211, 1990. [27] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157-166, 1994. [28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997. [29] D. H. Hubei and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal o f physiology, vol. 148, no. 3, pp. 574-591, 1959. [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings o f the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998. [31] D. Britz, “Understanding convolutional neural net­ works for nip.” http://www.wildml.com/2015/ll/ understanding-convolutional-neural-networks-for-nlp/. [32] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceed­ ings o f the 25th International Conference on Machine Learning, ICML ’08, (New York, NY, USA), pp. 160-167, ACM, 2008. [33] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal o f machine learning research, vol. 15, no. 1, pp. 1929-1958, 2014. [34] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [35] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.