Pointer Networks And their uses

Pointer Networks And their uses Oriol Vinyals, Navdeep Jaitly (Google Brain) Meire Fortunato (UC Berkeley)

Presented by: Priyansh Trivedi

Sequence to Sequence Models

Neural Turing Machines

Sentiment Analysis

Image Captioning

Argument Mining

Structured Question Answering Textual Entailment Semantic Parsing

Word Embeddings Language Modeling

Paraphrasing Text based Question Answering Named Entity Recognition Machine Translation

For NLP, recent RNNs are Sike!

Varying Output Labels A New Family of Problems

The target vocabulary of the output labels is usually fixed.

The target vocabulary of the output labels is usually fixed.

to

German

Englisch

The target vocabulary of the output labels is usually fixed. But what if it is not?

Convex Hull

The target vocabulary of the output labels is usually fixed. But what if it is not?

X: Y:

And the lord speaketh, let there be heaven, and there were Donuts .

/start /end

But what if it is not? X: What is the polish word for wreaths? Y: t4 - t 4

... a festival called Wianki (Polish for Wreaths) have become a tradition and a yearly event in the programme of cultural events in Warsaw. The festival traces its roots to a peaceful pagan ritual where maidens would float their wreaths of herbs on the water to predict when they would be married, and to whom ...

How to model this problem? Let the output labels consist of pointers to the input.

x = { x1, x2, x3, x4, x5, … xn-1, xn } y = { 2, 3, n-1 }

Which problems can be modeled this way? ➔

Combinatorial Problems Convex Hull; Traveling Salesman Problem

➔

Annotation Based NLP Named Entity Recognition; Argument Mining; Reading Comprehension …

➔

What else?

Can’t Sequence to Sequence models solve these problems?

Sequence to WHAT?

Sequence to Sequence 1.

Encode the inputs via an encoder RNN.

2.

After a certain point, the final hidden state of the encoder is used as a context vector for the decoder.

3.

Using the context, decoder outputs desired labels.

Limitations (w.r.t. Our problem)

➔

The dictionary of output labels has to be fixed

➔

Even if we take a superset of labels, cannot constrain its usable subsets per input example

Limitations (Generic)

Fixed context vector size for varied input. Diminishing influence of previous states on the later states.

Attention Mechanism A quick refresher.

Blend the encoder states to propagate extra information to the decoder.

Attention Mechanism

➔ Create a weighted combination of all input states. (Different weights for different position in output labels.)

➔ Major performance boost for data with longer term dependencies.

Visualized Attention Attention can be visualized, for insights. It’s interpreted as focus. What part of the input is relevant for this part of the output.

Blend the encoder states to propagate extra information to the decoder.

… instead, use the attention vector as pointers to the input element ...

Pointer Networks

Formally, RNNs with Attention

Pointer Networks

In detail An encoding RNN converts the input sequence to a code. That is fed to the decoding network. It produces a vector that modulates a content-based attention mechanism over inputs. The output of the attention mechanism is a softmax distribution with dictionary size equal to the length of the input.

Ambiguities

➔ The cost function ➔ Disadvantages

Advantages ➔ Essentially, output is a

subset of input. Most varying output label problems modelled this way.

➔ Generalizes well (w.r.t.

input size).

Experiments

Toy Problem Input: sequence of integers < 5; + sequence of integers > 5, 10; + sequence of integers < 5 Variable Length of all three sequences.

Output: Find the middle sequence.

Toy Problem x = { 4,1,2,3,1,1,6,9,10,8,7,3,1,1 } y = { /start, /end } x = { 1,2,2,1,5,9,7,1 } y = { /start, /end }

Toy Problem

96% accuracy

Convex Hull

Convex Hull

➔ Good performance on variable length ➔ Generalizes on unseen lengths of sequence.

Traveling Salesman Problem

Basically, you have a plane...

Traveling Salesman Problem

Basically, you have a plane map...

Traveling Salesman Problem Does not generalize that well for inputs of unseen length.

Other Uses

Machine Comprehension using Match LSTM and Answer Pointer (ICLR, 2017) Was the state of the art, decades ago in late 2016 on SQuAD task.

Here’s my Point: Argumentation Mining with Pointer Networks (Rejected: ICLR, 2017)

Thank you for your ATTENTION!

See what I did there?

References

➔ ➔

Original Paper: https://arxiv.org/abs/1506.03134 Further Reading: This blog post (authored by Dev Nag) Took an experiment from the blog; some quotes too.

➔

Credits: Slide 2,23,24: http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/ Slide 3: https://arxiv.org/pdf/1410.5401.pdf Slide 8: generated with the help of http://www.wordclouds.com Slide 9: https://math.stackexchange.com/questions/1093289/standard-cartesian-plane Slide 10: www.donut5k.com Slide 11: text snipped from SQuAD dataset (https://rajpurkar.github.io/SQuAD-explorer/) Slide 16: https://arxiv.org/pdf/1406.1078.pdf Slide 39: https://venturebeat.com/2011/11/16/the-elder-scrolls-v-skyrim-ships-7m-copies-actual-sales-unclear/