Recurrent Neural Network


Sequence Data Notations

  • xi, 表示输入x中的第i个token
  • yi, 表示输出y中的第i个token
  • x(i)t,表示第i条输入样本中的第t个token
  • y(i)t,表示第i条输出样本中的第t个token
  • nx,表示某个输入token向量长度
  • ny,表示某个输出token长度
  • x5(2)34, 表示第二条输入样本中,第三个layer中第4个token向量中的第五个元素

以文本输入为例,假设我们有一个10000个单词的字典和一串文本,现在的问题是,找出下面文本中是人名的单词,输出结果是一维向量,其中1代表名字

x: "Harry Potter and Hermione Granger invented a new spell."
y:  1 1 0 1 1 0 0 0 0

我们用xi表示上述句子中的每个单词,则x1表示Harry, x2表示Potter,以此类推。假设在我们的字典中,and这个单词排在第5位,则x1的值为一个一维向量

x1=[0,0,0,0,1,0,...,0]

注意上面的式子通常用列向量表示,即xi[10000,1]

在实际应用中,x1往往是一个2D tensor,因为我们通常一次输入m条训练样本(mini-batch)。我们假设m=20,则此时我们有20列向量,我们可以横向将它们stack成一个二维矩阵。比如上面例子中,RNN在某个时刻的输入tensor的大小是[10000,20]的。

相应的,上述句子对应的y表示如下,其中yi表示是名字的概率

y=[1,1,0,1,1,0,0,0,0]

Recurrent Neural Network

RNN的输入是一组sequence data,seguence中的每个xi会通过某一系列运算产生一个输出yi,并且该时间片上的输入除了有xi之外,还有可能来自前一个时间片的输出ai1,如下图所示

图中的T表示时间片,aTxT时刻的hidden state。我们令 a0=0,则 a1y1的计算方式如下

a1=g(Waaa0+Waxx1+ba)y^1=g(Wyaa1+by)

对于at, 其中常用的activation函数为tanhReLU,对于y^i,可以用sigmoid函数。Generalize一下

at=g(Waaat1+Waxxt+ba)y^t=g(Wyat+by)

简单起见,我们可以将WaaWax合并,假设,Waa[100,100], Wax[100,10000](通常来说Wax较宽),则可以将Wax放到Waa的右边,即[Waa|Wax],则合成后的矩阵Wa[100,10100]Wa矩阵合并后,我们也需要将at1xt合并,合并方法类似,从水平改为竖直 [at1xt]得到[10100,100]的矩阵。

at=g(Wa[at1,xt]+ba)y^t=g(Wyat+by)

因此,我们需要学习的参数便集中在了Wa, baWy,by上。 在实际应用中,我们的xa通常都是三维的矩阵

x.shape = (n_x, m, T_x)
a.shape = (n_a, m, T_a)

其中n_x表示x的one hot vector的长度(字典数量,比如5000),m表示batch size,即样本数量(句子个数),T_x则表示每个样本的时间片数量(可以理解为句子中token的个数)。而x(i)或者a(i)为某个时刻t的切片,即

x_i = x[:,:,t]
a_i = a[:,:,t]

Loss函数

上一节中我们已经看到,对每条训练样本来说,任何一个单词产生的输出y^(i)t是一个一维向量,形式和分类问题类似,因此对于单个单词的loss函数可以用逻辑回归的loss函数

Lt(y^t,yt)=ytlogyt(1yt)log(1yt)

则对于整个样本(句子),loss函数为每个单词loss的和

L(y^,y)=t=1TLt(y^t,yt)

反向求导的过程如下

不同的RNN网络

除了上面提到的一种RNN网络外,根据实际应用的不同,RNN可以衍生出不同的结构,如下图所示

Language Model

Language Model有很多种,其输入为一个句子,输出为预测结果,通常以概率形式表示。假如我们的字典有10000个单词,输入文本如下

Cats average 15 hours of sleep a day. <EOS>

我们可以参考前面提到的RNN网络来构建我们的Model,如下图所示

其中每个cell的结构如下图所示

  1. x1a0均为0,输出y^1是一个softmax结果,表示字典中每个单词出现的概率,是一个[10000,1]的向量,由于未经训练,每个单词出现的概率均为1/10000
  2. 接下来我们用真实y1(”Cats”在字典中出现的概率)和 a1作为下一层的输入,得到y^2,其含义为当给定前一个单词为”Cats”时,当前单词是字典中各个单词的概率即 P(??|Cats),因此y^2也是[10000,1]的。注意到,此时的x2=y1
  3. 类似的,第三层的输入为真实结果y2,即P(average|Cats),和a2,输出为y^2,表示P(??|Catsaverage)。同理,此时x3=y2
  4. 重复上述步骤,直到走到EOS的位置

上述的RNN模型可以做到根据前面已有的单词来预测下一个单词是什么

梯度消失

不难发现,上面的RNN模型是基于前面的单词来预测后面出现的单词出现的概率,但是对于一些长句子,单词前后的联系可能被分隔开,比如英语中的定语从句

The cat, which already ate ... , was full
The cats, which already ate ..., were full

上面例子例子中catwas, catswere中间隔了一个很长的定语修饰,这就会导致当RNN在预测was或者were时,由于前面的主语信息(cat或者cats)位置很靠前,使得预测概率受到影响(如果RNN能识别出此时主语是cat/catswas/were的预测概略应该会提高)。具体在RNN中的表现是当做back prop时,由于网络太深,会出现梯度消失的问题,也就是说我们无法通过back prop来影响到cat后者cats的weight。

GRU

GRU(Gated Recurrent Uinit)被设计用来解决上述问题,其核心思想是为每个token引入一个GRU unit - ct,计算方式如下

c^t=tanh(Wc[ct1,xt]+bc)Γut=δ(Wu[ct1,xt]+bu)ct=Γutc^t+(1Γut)ct1

其中,Γut用来控制是否更新ct的值,δ通常为sigmoid函数,因此Γut的取值为0或1;*为element-wise的乘法运算

回到上面的例子,假设我们cats对应的ct值为01, 1表示主语是单数,0表示主语是复数。则直到计算was/were之前,ct的值会一直被保留,作为计算时的参考,保留的方式则是通过控制Γut来完成

Tha cat,    which   already   ate ...,   was    full.
    c[t]=1                               c[t]=1
    g[t]=1  g[t]=0  g[t]=0    g[t]=0 ... g[t]=0  

可以看到当Γut为1时,ct=ct1=at1,则前面的信息可以被一直保留下来。

注意到ct,c^t,Γut均为向量,其中Γut向量中的值为0或1,则上面最后一个式子的乘法计算为element-wise的,这样Γut就可以起到gate的作用。

LSTM

Long Short Term Memory(LSTM)是另一种通过建立前后token链接来解决梯度消失问题的方法,相比GRU更为流行一些。和GRU不同的是

  1. LSTM使用at1来计算 c^tΓut
  2. LSTM使用两个gate来控制ct,一个前面提到的Γut,另一个是forget gate - Γft
  3. LSTM使用了一个output gate来控制at
c^t=tanh(Wc[at1,xt]+bc)Γut=δ(Wu[at1,xt]+bu)Γft=δ(Wf[at1,xt]+bf)ct=Γutc^t+Γftct1Γot=δ(Wo[at1,xt]+bo)at=Γotanh(ct)

上述LSTM式子引入了三个gate函数,虽然步骤比较复杂,但是逻辑上还是比较清晰,也容易更好的整合到RNN网络中,下图是一个引入了LSTM的RNN的计算单元

每个LSTM单元都是可微分的,它里面一共包含四种运算:加法,乘法,tanhsigmoid每种运算均可微。如果把各个LSTM单元串联起来,则RNN的模型变为

上述红线表示了ct的记忆过程,通过gate的控制,可以使c3=c1, 从而达到缓存前面信息的作用,进而可以解决梯度消失的问题。

另一种更加直观理解LSTM的方式是LSTM cell看成四个gate的组合

将每个RNN cell串联起来可已得到

Learn Gate

Learn Gate首先将short-term memroy(STM(t1))和Et进行combine,然后将结果和一个ignore vector进行element-wise的相乘来决定矩阵中那些元素需要保留,哪些舍弃。这个ignore vector同样是通过STM(t1)Et生成,只是非线性函数用了sigmoid来限制输出的值域。

其中Ntit表示为

Nt=tanh(Wn[STMt1,Et]+bn)it=δ(Wi[STMt1,Et]+bi)

Forget Gate

Forget Gate用来控制long-term memory中哪些保留哪些舍弃,具体做法是LTMt1乘以一个forget factorf(t)

其中f(t)计算如下

f(t)=δ(Wf[STMt1,Et]+bf)

Remember Gate

Remember Gate将上面两个gate的输出进行相加

Use Gate

Use Gate的输入来自Learn Gate和Forget Gate,组合方式如下

其中UtVt的计算方式如下

Ut=tanh(WuLTMt1ft+bu)Vt=δ(Wv[STMt1,Et]+bv)

我们将上面四个gate组合到一起,可以得到下面的结果,和我们上面的结构类似

Resources

Appendix #1: RNN的numpy实现

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Wax, xt) + np.dot(Waa, a_prev) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_next) + by)
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

def rnn_forward(x, a0, parameters):
    """
    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    """
    
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))

    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a_next, parameters)
        # Save the value of the new "next" hidden state in a
        a[:,:,t] = a_next
        # Save the value of the prediction in y
        y_pred[:,:,t] = yt_pred
        # Append "cache" to "caches"
        caches.append(cache)
    
    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches

Appendix #2: LSTM的numpy实现

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
          c stands for the cell state (memory)
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"] # forget gate weight
    bf = parameters["bf"]
    Wi = parameters["Wi"] # update gate weight (notice the variable name)
    bi = parameters["bi"] # (notice the variable name)
    Wc = parameters["Wc"] # candidate value weight
    bc = parameters["bc"]
    Wo = parameters["Wo"] # output gate weight
    bo = parameters["bo"]
    Wy = parameters["Wy"] # prediction weight
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    concat = np.zeros((n_a + n_x, m))
    concat[: n_a, :] = a_prev
    concat[n_a :, :] = xt

    # Compute values for ft, it, cct, c_next, ot, a_next
    ft = sigmoid(np.dot(Wf, concat) + bf)
    it = sigmoid(np.dot(Wi, concat) + bi)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    c_next = ft * c_prev + it * cct
    ot = sigmoid(np.dot(Wo, concat) + bo)
    a_next = ot * np.tanh(c_next)
    
    # Compute prediction of the LSTM cell
    yt_pred = softmax(np.dot(Wy, a_next) + by)

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

def lstm_forward(x, a0, parameters):
    """
    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo -- Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    c -- The value of the cell state, numpy array of shape (n_a, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
    """

    # Initialize "caches", which will track the list of all the caches
    caches = []
    
    ### START CODE HERE ###
    Wy = parameters['Wy'] # saving parameters['Wy'] in a local variable in case students use Wy instead of parameters['Wy']
    # Retrieve dimensions from shapes of x and parameters['Wy']
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wy"].shape
    
    # initialize "a", "c" and "y" with zeros
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # Initialize a_next and c_next
    a_next = a0
    c_next = np.zeros(a_next.shape)
    
    # loop over all time-steps
    for t in range(T_x):
        # Get the 2D slice 'xt' from the 3D input 'x' at time step 't'
        xt = x[:,:,t]
        # Update next hidden state, next memory state, compute the prediction, get the cache
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
        # Save the value of the new "next" hidden state in a
        a[:,:,t] = a_next
        # Save the value of the next cell state
        c[:,:,t]  = c_next
        # Save the value of the prediction in y
        y[:,:,t] = yt
        # Append the cache into cache
        caches.append(cache)
        
    
    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches