# Recurrent Neural Network

### Sequence Data Notations

• $x^{\langle i \rangle}$, 表示输入$x$中的第$i$个token
• $y^{\langle i \rangle}$, 表示输出$y$中的第$i$个token
• $x^{(i)\langle t \rangle}$，表示第$i$条输入样本中的第$t$个token
• $y^{(i)\langle t \rangle}$，表示第$i$条输出样本中的第$t$个token
• $n_x$，表示某个输入token向量长度
• $n_y$，表示某个输出token长度
• $x_5^{(2)\langle 3 \rangle\langle 4 \rangle}$, 表示第二条输入样本中，第三个layer中第4个token向量中的第五个元素

x: "Harry Potter and Hermione Granger invented a new spell."
y:  1 1 0 1 1 0 0 0 0


### Recurrent Neural Network

RNN的输入是一组sequence data，seguence中的每个$x^{\langle i \rangle}$会通过某一系列运算产生一个输出$y^{\langle i \rangle}$，并且该时间片上的输入除了有$x^{\langle i \rangle}$之外，还有可能来自前一个时间片的输出$a^{\langle i-1 \rangle}$，如下图所示

x.shape = (n_x, m, T_x)
a.shape = (n_a, m, T_a)


x_i = x[:,:,t]
a_i = a[:,:,t]


Numpy的实现如下

def rnn_cell_forward(xt, a_prev, parameters):
"""
Arguments:
xt -- your input data at timestep "t", numpy array of shape (n_x, m).
a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
parameters -- python dictionary containing:
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
ba --  Bias, numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
Returns:
a_next -- next hidden state, of shape (n_a, m)
yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
"""

# Retrieve parameters from "parameters"
Wax = parameters["Wax"]
Waa = parameters["Waa"]
Wya = parameters["Wya"]
ba = parameters["ba"]
by = parameters["by"]

# compute next activation state using the formula given above
a_next = np.tanh(np.dot(Wax, xt) + np.dot(Waa, a_prev) + ba)
# compute output of the current cell using the formula given above
yt_pred = softmax(np.dot(Wya, a_next) + by)
# store values you need for backward propagation in cache
cache = (a_next, a_prev, xt, parameters)

return a_next, yt_pred, cache

def rnn_forward(x, a0, parameters):
"""
Arguments:
x -- Input data for every time-step, of shape (n_x, m, T_x).
a0 -- Initial hidden state, of shape (n_a, m)
parameters -- python dictionary containing:
Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
ba --  Bias numpy array of shape (n_a, 1)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

Returns:
a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
caches -- tuple of values needed for the backward pass, contains (list of caches, x)
"""

# Initialize "caches" which will contain the list of all caches
caches = []

# Retrieve dimensions from shapes of x and parameters["Wya"]
n_x, m, T_x = x.shape
n_y, n_a = parameters["Wya"].shape

a = np.zeros((n_a, m, T_x))
y_pred = np.zeros((n_y, m, T_x))

a_next = a0

# loop over all time-steps
for t in range(T_x):
# Update next hidden state, compute the prediction, get the cache
a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a_next, parameters)
# Save the value of the new "next" hidden state in a
a[:,:,t] = a_next
# Save the value of the prediction in y
y_pred[:,:,t] = yt_pred
# Append "cache" to "caches"
caches.append(cache)

# store values needed for backward propagation in cache
caches = (caches, x)

return a, y_pred, caches


### Language Model

Language Model有很多种，其输入为一个句子，输出为预测结果，通常以概率形式表示。假如我们的字典有10000个单词，输入文本如下

Cats average 15 hours of sleep a day. <EOS>


1. 另$x^{\langle 1 \rangle}$和$a^{\langle 0 \rangle}$均为0，输出$\hat y^{\langle 1 \rangle}$是一个softmax结果，表示字典中每个单词出现的概率，是一个[10000,1]的向量，由于未经训练，每个单词出现的概率均为1/10000
2. 接下来我们用真实$y^{\langle 1 \rangle}$（”Cats”在字典中出现的概率）和 $a^{\langle 1 \rangle}$作为下一层的输入，得到$\hat y^{\langle 2 \rangle}$，其含义为当给定前一个单词为”Cats”时，当前单词是字典中各个单词的概率即 $P(?? |Cats)$，因此$\hat y^{\langle 2 \rangle}$也是[10000,1]的。注意到，此时的$x^{\langle 2 \rangle} = y^{\langle 1 \rangle}$
3. 类似的，第三层的输入为真实结果$y^{\langle 2 \rangle}$，即$P(average |Cats)$，和$a^{\langle 2 \rangle}$，输出为$\hat y^{\langle 2 \rangle}$，表示$P(?? |Cats average)$。同理，此时$x^{\langle 3 \rangle} = y^{\langle 2 \rangle}$
4. 重复上述步骤，直到走到EOS的位置

### 梯度消失

The cat, which already ate ... , was full
The cats, which already ate ..., were full


### GRU

GRU(Gated Recurrent Uinit)被设计用来解决上述问题，其核心思想是为每个token引入一个GRU unit - $c^{\langle t \rangle}$，计算方式如下

Tha cat,    which   already   ate ...,   was    full.
c[t]=1                               c[t]=1
g[t]=1  g[t]=0  g[t]=0    g[t]=0 ... g[t]=0


### LSTM

Long Short Term Memory(LSTM)是另一种通过建立前后token链接来解决梯度消失问题的方法，相比GRU更为流行一些。和GRU不同的是

1. LSTM使用$a^{\langle {t-1} \rangle}$来计算 $\hat c^{\langle t \rangle}$和$\Gamma_u ^{\langle t \rangle}$
2. LSTM使用两个gate来控制$c^{\langle t \rangle}$，一个前面提到的$\Gamma_u ^{\langle t \rangle}$，另一个是forget gate - $\Gamma_f ^{\langle t \rangle}$
3. LSTM使用了一个output gate来控制$a^{\langle t \rangle}$

Learn Gate

Learn Gate首先将short-term memroy($STM_{(t-1)}$)和$E_t$进行combine，然后将结果和一个ignore vector进行element-wise的相乘来决定矩阵中那些元素需要保留，哪些舍弃。这个ignore vector同样是通过$STM_{(t-1)}$和$E_t$生成，只是非线性函数用了sigmoid来限制输出的值域。

Forget Gate

Forget Gate用来控制long-term memory中哪些保留哪些舍弃，具体做法是$LTM_{t-1}$乘以一个forget factor$f(t)$。

Remember Gate

Remember Gate将上面两个gate的输出进行相加

Use Gate

Use Gate的输入来自Learn Gate和Forget Gate，组合方式如下

LSTM的numpy实现

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
"""
Arguments:
xt -- your input data at timestep "t", numpy array of shape (n_x, m).
a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
parameters -- python dictionary containing:
Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
bi -- Bias of the update gate, numpy array of shape (n_a, 1)
Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
bo --  Bias of the output gate, numpy array of shape (n_a, 1)
Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

Returns:
a_next -- next hidden state, of shape (n_a, m)
c_next -- next memory state, of shape (n_a, m)
yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)

Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
c stands for the cell state (memory)
"""

# Retrieve parameters from "parameters"
Wf = parameters["Wf"] # forget gate weight
bf = parameters["bf"]
Wi = parameters["Wi"] # update gate weight (notice the variable name)
bi = parameters["bi"] # (notice the variable name)
Wc = parameters["Wc"] # candidate value weight
bc = parameters["bc"]
Wo = parameters["Wo"] # output gate weight
bo = parameters["bo"]
Wy = parameters["Wy"] # prediction weight
by = parameters["by"]

# Retrieve dimensions from shapes of xt and Wy
n_x, m = xt.shape
n_y, n_a = Wy.shape

concat = np.zeros((n_a + n_x, m))
concat[: n_a, :] = a_prev
concat[n_a :, :] = xt

# Compute values for ft, it, cct, c_next, ot, a_next
ft = sigmoid(np.dot(Wf, concat) + bf)
it = sigmoid(np.dot(Wi, concat) + bi)
cct = np.tanh(np.dot(Wc, concat) + bc)
c_next = ft * c_prev + it * cct
ot = sigmoid(np.dot(Wo, concat) + bo)
a_next = ot * np.tanh(c_next)

# Compute prediction of the LSTM cell
yt_pred = softmax(np.dot(Wy, a_next) + by)

# store values needed for backward propagation in cache
cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

return a_next, c_next, yt_pred, cache

def lstm_forward(x, a0, parameters):
"""
Arguments:
x -- Input data for every time-step, of shape (n_x, m, T_x).
a0 -- Initial hidden state, of shape (n_a, m)
parameters -- python dictionary containing:
Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
bi -- Bias of the update gate, numpy array of shape (n_a, 1)
Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
bo -- Bias of the output gate, numpy array of shape (n_a, 1)
Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

Returns:
a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
c -- The value of the cell state, numpy array of shape (n_a, m, T_x)
caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
"""

# Initialize "caches", which will track the list of all the caches
caches = []

### START CODE HERE ###
Wy = parameters['Wy'] # saving parameters['Wy'] in a local variable in case students use Wy instead of parameters['Wy']
# Retrieve dimensions from shapes of x and parameters['Wy']
n_x, m, T_x = x.shape
n_y, n_a = parameters["Wy"].shape

# initialize "a", "c" and "y" with zeros
a = np.zeros((n_a, m, T_x))
c = np.zeros((n_a, m, T_x))
y = np.zeros((n_y, m, T_x))

# Initialize a_next and c_next
a_next = a0
c_next = np.zeros(a_next.shape)

# loop over all time-steps
for t in range(T_x):
# Get the 2D slice 'xt' from the 3D input 'x' at time step 't'
xt = x[:,:,t]
# Update next hidden state, next memory state, compute the prediction, get the cache
a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
# Save the value of the new "next" hidden state in a
a[:,:,t] = a_next
# Save the value of the next cell state
c[:,:,t]  = c_next
# Save the value of the prediction in y
y[:,:,t] = yt
# Append the cache into cache
caches.append(cache)

# store values needed for backward propagation in cache
caches = (caches, x)

return a, y, c, caches
`