Convex Hull. Page 10. And the lord speaketh, let there be heaven, and there were Donuts . /start / ... Combinatorial Problems. Convex Hull; Traveling Salesman.
Pointer Networks And their uses Oriol Vinyals, Navdeep Jaitly (Google Brain) Meire Fortunato (UC Berkeley)
Presented by: Priyansh Trivedi
Sequence to Sequence Models
Neural Turing Machines
Sentiment Analysis
Image Captioning
Argument Mining
Structured Question Answering Textual Entailment Semantic Parsing
Word Embeddings Language Modeling
Paraphrasing Text based Question Answering Named Entity Recognition Machine Translation
For NLP, recent RNNs are Sike!
Varying Output Labels A New Family of Problems
The target vocabulary of the output labels is usually fixed.
The target vocabulary of the output labels is usually fixed.
to
German
Englisch
The target vocabulary of the output labels is usually fixed. But what if it is not?
Convex Hull
The target vocabulary of the output labels is usually fixed. But what if it is not?
X: Y:
And the lord speaketh, let there be heaven, and there were Donuts .
/start /end
But what if it is not? X: What is the polish word for wreaths? Y: t4 - t 4
... a festival called Wianki (Polish for Wreaths) have become a tradition and a yearly event in the programme of cultural events in Warsaw. The festival traces its roots to a peaceful pagan ritual where maidens would float their wreaths of herbs on the water to predict when they would be married, and to whom ...
How to model this problem? Let the output labels consist of pointers to the input.
x = { x1, x2, x3, x4, x5, … xn-1, xn } y = { 2, 3, n-1 }
Which problems can be modeled this way? ➔
Combinatorial Problems Convex Hull; Traveling Salesman Problem
➔
Annotation Based NLP Named Entity Recognition; Argument Mining; Reading Comprehension …
➔
What else?
Can’t Sequence to Sequence models solve these problems?
Sequence to WHAT?
Sequence to Sequence 1.
Encode the inputs via an encoder RNN.
2.
After a certain point, the final hidden state of the encoder is used as a context vector for the decoder.
3.
Using the context, decoder outputs desired labels.
Limitations (w.r.t. Our problem)
➔
The dictionary of output labels has to be fixed
➔
Even if we take a superset of labels, cannot constrain its usable subsets per input example
Limitations (Generic)
Fixed context vector size for varied input. Diminishing influence of previous states on the later states.
Attention Mechanism A quick refresher.
Blend the encoder states to propagate extra information to the decoder.
Attention Mechanism
➔ Create a weighted combination of all input states. (Different weights for different position in output labels.)
➔ Major performance boost for data with longer term dependencies.
Visualized Attention Attention can be visualized, for insights. It’s interpreted as focus. What part of the input is relevant for this part of the output.
Blend the encoder states to propagate extra information to the decoder.
… instead, use the attention vector as pointers to the input element ...
Pointer Networks
Formally, RNNs with Attention
Pointer Networks
In detail An encoding RNN converts the input sequence to a code. That is fed to the decoding network. It produces a vector that modulates a content-based attention mechanism over inputs. The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input.
Ambiguities
➔ The cost function ➔ Disadvantages
Advantages ➔ Essentially, output is a
subset of input. Most varying output label problems modelled this way.
➔ Generalizes well (w.r.t.
input size).
Experiments
Toy Problem Input: sequence of integers < 5; + sequence of integers > 5, 10; + sequence of integers < 5 Variable Length of all three sequences.
Output: Find the middle sequence.
Toy Problem x = { 4,1,2,3,1,1,6,9,10,8,7,3,1,1 } y = { /start, /end } x = { 1,2,2,1,5,9,7,1 } y = { /start, /end }
Toy Problem
96% accuracy
Convex Hull
Convex Hull
➔ Good performance on variable length ➔ Generalizes on unseen lengths of sequence.
Traveling Salesman Problem
Basically, you have a plane...
Traveling Salesman Problem
Basically, you have a plane map...
Traveling Salesman Problem Does not generalize that well for inputs of unseen length.
Other Uses
Machine Comprehension using Match LSTM and Answer Pointer (ICLR, 2017) Was the state of the art, decades ago in late 2016 on SQuAD task.
Here’s my Point: Argumentation Mining with Pointer Networks (Rejected: ICLR, 2017)
Thank you for your ATTENTION!
See what I did there?
References
➔ ➔
Original Paper: https://arxiv.org/abs/1506.03134 Further Reading: This blog post (authored by Dev Nag) Took an experiment from the blog; some quotes too.
➔
Credits: Slide 2,23,24: http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ Slide 3: https://arxiv.org/pdf/1410.5401.pdf Slide 8: generated with the help of http://www.wordclouds.com Slide 9: https://math.stackexchange.com/questions/1093289/standard-cartesian-plane Slide 10: www.donut5k.com Slide 11: text snipped from SQuAD dataset (https://rajpurkar.github.io/SQuAD-explorer/) Slide 16: https://arxiv.org/pdf/1406.1078.pdf Slide 39: https://venturebeat.com/2011/11/16/the-elder-scrolls-v-skyrim-ships-7m-copies-actual-sales-unclear/