Sequence Models and Attention Mechanism
Basic Models
Let’s say you want to input a French sentence like Jane visite I'Afrique Septembre
, and you want to translate it to the English sentence, Jane is visiting Africa in September
. How can you train a neural network to input the sequence x
and output the sequence y
?
First, let’s build a LSTM as an encoder network. We feed the network with one word a time from the original french sentence. After ingesting the input sequence the RNN then outputs a vector that represents the input sentence. After that, we can build a decoder network that takes the input from the encoder and outputs the translated English words.
Similarly, we can use the same encoder and decoder architecture for image captioning. Our encoder will be a CNN model (e.g., AlexNet
), and we replace the last layer (softmax
) with an RNN decoder. This model works quite well especially for generating text that are not so long.
Machine translation as building a conditional language model
Since the decoder in a machine translation model is an RNN, it generates text probabilistically, meaning it can produce multiple translations for the same English sentence:
Jane visite I'Afrique Septembre
-> Jane is visiting Africa in September.
-> Jane is going to be visiting Africa in September.
-> In September, Jane will visit Africa.
-> Her African friend welcomed Jane in September.
However, we don’t want to output a random English translation, we want to output the best and the most likely English translation. In the example above, clearly, the first translation is the most accurate, while the others are suboptimal.
But how can we control this generation process? We can frame it as a conditional probability problem: given an input
A naive approach is greedy search, where at each step, we select the word Jane is going to
is higher than Jane is visiting
, as going
is a more commonly seen word in English.
Instead, what we need to do, is to find a search algorithm that can find the following value for us:
Beam Search Algorithm
Let’s explain the Beam Search algorithm using our running example above. Once our decoder outputs the probability of the first English word (the outputs are from a softmax
layer that contains the possibility over 1000 words), represented as beam width
. If beam_width = 3
, then Beam will look at three candidates at a time.
Let’s say when evaluating the first words, it finds that the choices "in"
, "jane"
and "september"
are the most likely three possibilities for English outputs. Then Beam search will save the words in memory that it wants to try all three of these words.
For each of these three choices consider what should be the second word, so after "in"
, maybe a second word is "a"
,"aaron"
, "september"
or "zulu"
. To evaluate the probability of second word "in"
).
By hard-wiring
Note that the goal of the second step of beam search is to find the pair of the first and second words that is most likely, not just a second where is most likely.
We repeat the same from process to calculate the most likely second words for "jane"
and "september"
. Let’s say we have 10,000
words in the dictionary. In step 2, with the beam width
equals to 3
, we will need to evaluate 3 x 10000
options (10000 for each word). We then pick the top 3 most likely words based on the evaluation result from the above formula (the one with the highest conditional probability wins).
For example, in step 2, the "september"
is most likely to picked after "in"
. And for "jane"
, the model picks "is"
and "visits"
. Note that since beam_width = 3
, and we already picked three words, that means "september"
will be skipped.
For step 3, we simply repeat the same process above. The three words are "in september"
, "jane is"
and "jane visits"
, as shown below:
And the final outcome of this process will be that adding one word at a time that Beam search will decide that "Jane visits Africa in September. <EOS>"
will be terminated by the end of sentence symbol.
Refinements to Beam Search
Notice that the formula we use to determine the most likely words is the product of a sequence of probability numbers
Note that these probabilities are all numbers less than one, and multiplying a lot of these numbers result in a tiny number, which can result in numerical under-floor, meaning that is too small for the floating point of representation in your computer to store accurately.
In practice, instead of maximizing this product, we will take logs:
Then a log of a product becomes a sum of log, and maximizing this sum of log probabilities should give you the same results in terms of selecting the most likely sentence. We can further improve the formula to be more computing efficient
Instead of calculating the argmax of the log product, we normalize the value by
Error analysis on Beam Search
As you can see, Beam Search is an approximate search algorithm or a heuristic search algorithm. And so it doesn’t always output the most likely sentence. It’s only keeping track of 3
or 10
or 100
top possibilities. So what if Beam Search makes a mistake? We need some error analysis that can help us figure out whether it is the Beam Search algorithm that’s causing problems or whether it might be our RNN model that is causing problems.
Let reuse our running example from above. We use
Jane visite l’Afrique en septembre.
-> Human: Jane visits Africa in September.
-> Algorithm: Jane visited Africa last September.
We look at outputs of the softmax
layer from our RNN model, and find the value of
- Case 1:
>- Beam Search chose
, but attains higher - Conclusion: Beam Search is at fault
- Beam Search chose
- Case 2:
<=- RNN predicted
< - Conclusion: RNN model is at fault
- RNN predicted
Attention model intuition
We’ve been using an Encoder-Decoder architecture for machine translation. Where one RNN reads in a sentence and then different one outputs a sentence. There’s a modification to this called the Attention Model, that makes all this work much better.
The way our encoder works so far is to read in the whole sentence and then memorize all the words and store it in the activations. Then the decoder network will take it over to generate the English translation. However, this is very different from the way a human translator would translate a sentence in real world. A human translator would usually read the first part of the sentence and translate some words. And look at a few more words, generate a few more words and so on.
It turns out our RNN based Encoder-Decoder architecture does not perform very well when translating a long sentence. The Bleu score drops quickly after translating 20 or 30 words. This is because our RNN network has difficulty to memorize a super long sentence.
Before we introduce the attention mechanism, let’s first setup our new network architecture. Let’s say we use a bidirectional RNN for the translation job. Instead of doing a word to word translation (outputs a
Now, the question is, If we don’t look at the whole sentence, then what part of the input French sentence should you be looking at when trying to generate this first word? For each French word, we will be looking at the words before and after it, and our Attention Model will compute set of attention weights for the nearby words, representing how much attention we should be paying for those words.
For example, we’re going to use
And once we gather all weights information for the nearby words, we then feed it into the other RNN network as context for translating the first word. We then repeat the same process to generate the second, third and the rest of the words. The high-level process can be illustrated using the following diagram:
Attention model in detail
Let’s now formalize that intuition into the exact details of how you would implement an attention model. Since we have a bidirectional RNN, we use
: hidden state of the forward-direction(e.g.,pre-attention LSTM). : hidden state of the backward-direction(e.g.,pre-attention LSTM).
Next, we have our forward only, a single direction RNN with state
to denote the translated word at timestamp to denote the input context at each timestamp
We use the following formula to evaluate the amount of attention
Next, how do we compute this
The intuition is, if you want to decide how much attention to pay to the activation of
Now we should have everything to compute the context vector. The
Note that one downside to this approach is that it does take quadratic time or quadratic cost to run this algorithm.
To piece everything together, here is a diagram that visualize the computation process of the context vector:
- First, we use a RepeatVector node to copy 𝑠⟨𝑡−1⟩ 𝑇𝑥 times
- Then we use
Concatenation
to concatenate 𝑠⟨𝑡−1⟩ and 𝑎⟨𝑡’⟩ - The concatenation of 𝑠⟨𝑡−1⟩ and 𝑎⟨𝑡’⟩ is fed into a “Dense” layer, which computes 𝑒⟨𝑡, 𝑡′⟩
- 𝑒⟨𝑡,𝑡′⟩ is then passed through a
softmax
to compute 𝛼⟨𝑡, 𝑡′⟩
Note that the diagram doesn’t explicitly show variable 𝑒⟨𝑡, 𝑡′⟩, but 𝑒⟨𝑡, 𝑡′⟩ is above the Dense layer and below the Softmax layer in the diagram.