CS231n Assignment3 遇到的问题

实现基于2019年版的课程
主要记录遇到的问题

我的assignment的github仓库，包含全部的代码和notebook。

Image Captioning with RNNs

本部分主要是实现RNN的基础版本。即如下的RNN，不过需要注意的是在代码中实现时要注意矩阵相乘的顺序。

首先是每次time step时的forward的backward，这里比较简单，按照上图公式implement一下就ok了

def rnn_step_forward(x, prev_h, Wx, Wh, b):
    next_h, cache = None, None
    next_h = np.tanh(np.dot(prev_h, Wh) + np.dot(x, Wx) + b)
    cache = Wx, Wh, next_h, prev_h, x, b
    return next_h, cache


def rnn_step_backward(dnext_h, cache):
    dx, dprev_h, dWx, dWh, db = None, None, None, None, None
    Wx, Wh, next_h, prev_h, x, b = cache
    dmid = (1 - np.square(next_h)) * dnext_h
    dprev_h = np.dot(dmid, Wh.T)
    dx = np.dot(dmid, Wx.T)
    dWh = np.dot(prev_h.T, dmid)
    dWx = np.dot(x.T, dmid)
    db = np.sum(dmid, axis=0)
    return dx, dprev_h, dWx, dWh, db

然后是在一定time sequence上的forward和backward，forward就是多层step forward的叠加，backward计算梯度就是将每次对x的梯度持续回传，将对W权值矩阵的梯度叠加即可。

def rnn_forward(x, h0, Wx, Wh, b):
    h, cache = None, None
    N, T, D = x.shape
    cache = []
    h = np.zeros((N, T, h0.shape[1]))
    for i in range(T):
        h0, c = rnn_step_forward(x[:, i, :], h0, Wx, Wh, b)
        h[:, i] += h0
        cache.append(c)
    return h, cache


def rnn_backward(dh, cache):
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    N, T, H = dh.shape
    D = cache[0][0].shape[0]
    dx = np.zeros((N, T, D))
    dh0 = np.zeros((N, H))
    dWx = np.zeros((D, H))
    dWh = np.zeros((H, H))
    db = np.zeros((H,))

    for i in reversed(range(T)):
        dx[:, i], dh0, dWx_mid, dWh_mid, db_mid = rnn_step_backward(dh[:, i] + dh0, cache.pop())
        dWx += dWx_mid
        dWh += dWh_mid
        db += db_mid
    return dx, dh0, dWx, dWh, db

实现这些基本核心组件后，还需要实现的就是根据word 生成 embedding，这里使用的是类似于查询的方法，有一个对应的生成embedding的矩阵，这个也是可以学习的。forward很简单，就是类似的查询，backward的实现我遇到了实现上的问题，最后借鉴了一下别人的code。:)

def word_embedding_forward(x, W):
    out, cache = None, None
    out = W[x]
    cache = x, W
    return out, cache


def word_embedding_backward(dout, cache):
    dW = None
    x, W = cache
    dW = np.zeros_like(W)
    N, T, D = dout.shape
    np.add.at(dW, x.flatten(), dout.reshape(-1, D))     # 这里借鉴了一下别人的代码
    return dW

最后要实现的就是class RNN了。这个部分只要认真看代码中的提升，注意下细节根据流程和之前实现好的模块实现即可了。

forward 函数，位于rnn.py rnn类内。这里的处理是将caption分为两部分：captions_in除了最后一个单词外，所有内容都将被输入到RNN；而captions_out只不包含第一个单词。这就是期望RNN生成的东西。它们彼此相对偏移一个，因为RNN在接收到单词t之后会产生单词（t + 1）。 captions_in的第一个元素将是START caption，我们的期望是captions_out的第一个元素将是第一个单词。

def loss(self, features, captions):
    """
    Compute training-time loss for the RNN. We input image features and
    ground-truth captions for those images, and use an RNN (or LSTM) to compute
    loss and gradients on all parameters.

    Inputs:
    - features: Input image features, of shape (N, D)
    - captions: Ground-truth captions; an integer array of shape (N, T) where
      each element is in the range 0 <= y[i, t] < V

    Returns a tuple of:
    - loss: Scalar loss
    - grads: Dictionary of gradients parallel to self.params
    """
    # Cut captions into two pieces: captions_in has everything but the last word
    # and will be input to the RNN; captions_out has everything but the first
    # word and this is what we will expect the RNN to generate. These are offset
    # by one relative to each other because the RNN should produce word (t+1)
    # after receiving word t. The first element of captions_in will be the START
    # token, and the first element of captions_out will be the first word.
    captions_in = captions[:, :-1]
    captions_out = captions[:, 1:]

    # You'll need this
    mask = (captions_out != self._null)

    # Weight and bias for the affine transform from image features to initial
    # hidden state
    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']

    # Word embedding matrix
    W_embed = self.params['W_embed']

    # Input-to-hidden, hidden-to-hidden, and biases for the RNN
    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']

    # Weight and bias for the hidden-to-vocab transformation.
    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

    loss, grads = 0.0, {}
    ############################################################################
    # TODO: Implement the forward and backward passes for the CaptioningRNN.   #
    # In the forward pass you will need to do the following:                   #
    # (1) Use an affine transformation to compute the initial hidden state     #
    #     from the image features. This should produce an array of shape (N, H)#
    # (2) Use a word embedding layer to transform the words in captions_in     #
    #     from indices to vectors, giving an array of shape (N, T, W).         #
    # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to    #
    #     process the sequence of input word vectors and produce hidden state  #
    #     vectors for all timesteps, producing an array of shape (N, T, H).    #
    # (4) Use a (temporal) affine transformation to compute scores over the    #
    #     vocabulary at every timestep using the hidden states, giving an      #
    #     array of shape (N, T, V).                                            #
    # (5) Use (temporal) softmax to compute loss using captions_out, ignoring  #
    #     the points where the output word is <NULL> using the mask above.     #
    #                                                                          #
    # In the backward pass you will need to compute the gradient of the loss   #
    # with respect to all model parameters. Use the loss and grads variables   #
    # defined above to store loss and gradients; grads[k] should give the      #
    # gradients for self.params[k].                                            #
    #                                                                          #
    # Note also that you are allowed to make use of functions from layers.py   #
    # in your implementation, if needed.                                       #
    ############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    caches = []
    out, cache = affine_forward(features, W_proj, b_proj)
    caches.append(cache)
    word_in, cache = word_embedding_forward(captions_in, W_embed)
    caches.append(cache)
    if self.cell_type == 'rnn':  # must rnn or lstm
        out, cache = rnn_forward(word_in, out, Wx, Wh, b)
    else:
        out, cache = lstm_forward(word_in, out, Wx, Wh, b)
    caches.append(cache)
    out, cache = temporal_affine_forward(out, W_vocab, b_vocab)
    caches.append(cache)

    # backward
    loss, dx = temporal_softmax_loss(out, captions_out, mask)

    dx, grads['W_vocab'], grads['b_vocab'] = temporal_affine_backward(dx, caches.pop())
    if self.cell_type == 'rnn':
        d_caption, dx, grads['Wx'], grads['Wh'], grads['b'] = rnn_backward(dx, caches.pop())
    else:
        d_caption, dx, grads['Wx'], grads['Wh'], grads['b'] = lstm_backward(dx, caches.pop())
    grads['W_embed'] = word_embedding_backward(d_caption, caches.pop())
    _, grads['W_proj'], grads['b_proj'] = affine_backward(dx, caches.pop())

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    return loss, grads

sample 函数，位于rnn.py rnn类内。其要实现的效果如下图所示。

def sample(self, features, max_length=30):
    """
    Run a test-time forward pass for the model, sampling captions for input
    feature vectors.

    At each timestep, we embed the current word, pass it and the previous hidden
    state to the RNN to get the next hidden state, use the hidden state to get
    scores for all vocab words, and choose the word with the highest score as
    the next word. The initial hidden state is computed by applying an affine
    transform to the input image features, and the initial word is the <START>
    token.

    For LSTMs you will also have to keep track of the cell state; in that case
    the initial cell state should be zero.

    Inputs:
    - features: Array of input image features of shape (N, D).
    - max_length: Maximum length T of generated captions.

    Returns:
    - captions: Array of shape (N, max_length) giving sampled captions,
      where each element is an integer in the range [0, V). The first element
      of captions should be the first sampled word, not the <START> token.
    """
    N = features.shape[0]
    captions = self._null * np.ones((N, max_length), dtype=np.int32)

    # Unpack parameters
    W_proj, b_proj = self.params['W_proj'], self.params['b_proj']
    W_embed = self.params['W_embed']
    Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b']
    W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab']

    ###########################################################################
    # TODO: Implement test-time sampling for the model. You will need to      #
    # initialize the hidden state of the RNN by applying the learned affine   #
    # transform to the input image features. The first word that you feed to  #
    # the RNN should be the <START> token; its value is stored in the         #
    # variable self._start. At each timestep you will need to do to:          #
    # (1) Embed the previous word using the learned word embeddings           #
    # (2) Make an RNN step using the previous hidden state and the embedded   #
    #     current word to get the next hidden state.                          #
    # (3) Apply the learned affine transformation to the next hidden state to #
    #     get scores for all words in the vocabulary                          #
    # (4) Select the word with the highest score as the next word, writing it #
    #     (the word index) to the appropriate slot in the captions variable   #
    #                                                                         #
    # For simplicity, you do not need to stop generating after an <END> token #
    # is sampled, but you can if you want to.                                 #
    #                                                                         #
    # HINT: You will not be able to use the rnn_forward or lstm_forward       #
    # functions; you'll need to call rnn_step_forward or lstm_step_forward in #
    # a loop.                                                                 #
    #                                                                         #
    # NOTE: we are still working over minibatches in this function. Also if   #
    # you are using an LSTM, initialize the first cell state to zeros.        #
    ###########################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    next_h, _ = affine_forward(features, W_proj, b_proj)
    next_c = np.zeros((N, W_proj.shape[1]))
    word = self._start * np.ones((N,), dtype=np.int32)
    # generate start token
    for i in range(max_length):
        word, _ = word_embedding_forward(word, W_embed)
        # embed the word to vector
        if self.cell_type == 'rnn':
            next_h, _ = rnn_step_forward(word, next_h, Wx, Wh, b)
        else:
            next_h, next_c, _ = lstm_step_forward(word, next_h, next_c, Wx, Wh, b)

        out, _ = affine_forward(next_h, W_vocab, b_vocab)
        # get the output
        word = out.argmax(axis=1)
        # sample
        captions[:, i] = word
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    return captions

Image Captioning with LSTMs

本部分主要就是将vanilla RNN变为LSTM在重复上述的任务。

首先是forward，根据公式敲一下就ok了。

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):
    """
    Forward pass for a single timestep of an LSTM.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Note that a sigmoid() function has already been provided for you in this file.

    Inputs:
    - x: Input data, of shape (N, D)
    - prev_h: Previous hidden state, of shape (N, H)
    - prev_c: previous cell state, of shape (N, H)
    - Wx: Input-to-hidden weights, of shape (D, 4H)
    - Wh: Hidden-to-hidden weights, of shape (H, 4H)
    - b: Biases, of shape (4H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - next_c: Next cell state, of shape (N, H)
    - cache: Tuple of values needed for backward pass.
    """
    next_h, next_c, cache = None, None, None
    N, H = prev_h.shape
    a = np.dot(x, Wx) + np.dot(prev_h, Wh) + b
    i = sigmoid(a[:, :H])
    f = sigmoid(a[:, H:2*H])
    o = sigmoid(a[:, 2*H:3*H])
    g = np.tanh(a[:, 3*H:])
    next_c = f * prev_c + i * g
    next_h = o * np.tanh(next_c)
    cache = i, f, o, g, next_c, Wh, Wx, prev_c, prev_h, x

    return next_h, next_c, cache

经历过那么多次求导，出现的问题越来越少了，实现起来也比较顺畅，下面给出我整理后的求导过程。

然后就是敲一下代码了。

def lstm_step_backward(dnext_h, dnext_c, cache):
    """
    Backward pass for a single timestep of an LSTM.

    Inputs:
    - dnext_h: Gradients of next hidden state, of shape (N, H)
    - dnext_c: Gradients of next cell state, of shape (N, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data, of shape (N, D)
    - dprev_h: Gradient of previous hidden state, of shape (N, H)
    - dprev_c: Gradient of previous cell state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None

    i, f, o, g, next_c, Wh, Wx, prev_c, prev_h, x = cache
    dprev_c = dnext_c * f + dnext_h * o * f * (1 - np.tanh(next_c)**2)
    dc = dnext_c + (1 - np.tanh(next_c)**2) * o * dnext_h     # 这里遇到了问题
    di = dc * g * i * (1 - i)
    df = dc * prev_c * f * (1 - f)
    do = dnext_h * np.tanh(next_c) * o * (1 - o)
    dg = dc * i * (1 - g**2)

    da = np.hstack((di, df, do, dg))
    dprev_h = np.dot(da, Wh.T)
    dx = np.dot(da, Wx.T)
    db = np.sum(da, axis=0)
    dWx = np.dot(x.T, da)
    dWh = np.dot(prev_h.T, da)

    return dx, dprev_h, dprev_c, dWx, dWh, db

接下来就是对于一个sequence而不是单独的time step使用LSTM了。首先是forward

def lstm_forward(x, h0, Wx, Wh, b):
    """
    Forward pass for an LSTM over an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the LSTM forward, we return the hidden states for all timesteps.

    Note that the initial cell state is passed as input, but the initial cell
    state is set to zero. Also note that the cell state is not returned; it is
    an internal variable to the LSTM and is not accessed from outside.

    Inputs:
    - x: Input data of shape (N, T, D)
    - h0: Initial hidden state of shape (N, H)
    - Wx: Weights for input-to-hidden connections, of shape (D, 4H)
    - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
    - b: Biases of shape (4H,)

    Returns a tuple of:
    - h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
    - cache: Values needed for the backward pass.
    """
    h, cache = None, None
    N, T, D = x.shape
    N, H = h0.shape
    h = np.zeros((N, T, H))
    cache = []
    c0 = np.zeros_like(h0)
    for i in range(T):
        h0, c0, c = lstm_step_forward(x[:, i, :], h0, c0, Wx, Wh, b)
        h[:, i, :] = h0
        cache.append(c)

    return h, cache

需要注意的是，在backward时，如何传递梯度：up stream的当前time step的loss + 上一个step step传回来的loss

代码如下：

def lstm_backward(dh, cache):
    """
    Backward pass for an LSTM over an entire sequence of data.]

    Inputs:
    - dh: Upstream gradients of hidden states, of shape (N, T, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data of shape (N, T, D)
    - dh0: Gradient of initial hidden state of shape (N, H)
    - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None

    N, T, H = dh.shape
    _, D = cache[0][-1].shape    # cache[0][-1]对应x
    dx = np.zeros((N, T, D))
    dc = np.zeros((N, H))
    dWx = np.zeros((D, 4 * H))
    dWh = np.zeros((H, 4 * H))
    db = np.zeros((4 * H,))
    dh0 = np.zeros((N, H))
    for i in reversed(range(T)):
        dx[:, i, :], dh0, dc, dWx_, dWh_, db_ = lstm_step_backward(dh[:, i, :] + dh0, dc, cache.pop())
        db += db_
        dWx += dWx_
        dWh += dWh_

    return dx, dh0, dWx, dWh, db

剩下的部分就比较ez了，只需对于上一个task的代码稍作修改即可，具体的代码就不再列出。

Network Visualization (PyTorch)

本部分主要实现Saliency Maps, Fooling image, Class visualization 三种的实现本质都是计算输入图片的梯度，Saliency Map是直接将关于正确label的loss的梯度的绝对值显示出来；Fooling Image 则是利用梯度信息，将一个A类别的输入变为网络识别为B类别；Class Visualization则是输入一个噪声，使用gradient ascent利用梯度信息将该图片在期望变成的类别下，神经网络的输出的期望类别的class score最大。由于使用pytorch，可以直接计算grad，整体实现比较简单，理解好这几个过程就可以了。代码部分不再单独给出，详情见我的github仓库。

在代码运行时，可能会遇到使用numpy.load()函数报错的情况，提示需要将allow_pickle=Ture，此时只需要在该np.load的参数内加入‘allow_pickle=True’即可。具体细节可见cs231n/data_utils.py/load_imagenet_val

Style Transfer

本部分就是实现style transfer的部分了！难度不高，核心就是实现好3个loss，并将整个流程梳理下来即可。实现后真的非常好玩！

首先是Style Transfer（2016）整体的框图

从图中就可以看出，包含两个loss，分别是：style image每层的每个filter的activation map的gram matrix（用来衡量相似性）和input image的gram matrix之间产生的loss（Style Loss）和 input image和content image之间的差异性（Content Loss）。最后为了使生成的图片更加真实，会加入一个正则项，这里使用的是Total-variation regularization。所以最终的loss就是：**Style Loss+Content Loss + Total-variation regularization **。之后使用这个loss计算输入图片的梯度，并使用Adam或者SDG等方法更新输入图片即可。

提取特征的网络使用的是squeeze net，因为其模型参数少，运算快且性能适中。

Content Loss

比较简单，如框图所示，就是将input image和content image在某一层的activation map做一下reshape，从1*C*HW变为1\C*(H*W)即可。相减后逐元素平方再求和即可。代码如下：

def content_loss(content_weight, content_current, content_original):
"""
Compute the content loss for style transfer.
Inputs:
- content_weight: Scalar giving the weighting for the content loss.
- content_current: features of the current image; this is a PyTorch Tensor of shape
  (1, C_l, H_l, W_l).
- content_target: features of the content image, Tensor with shape (1, C_l, H_l, W_l).

Returns:
- scalar content loss
"""
    loss = content_weight * torch.sum(torch.square(content_current.squeeze() - content_original.squeeze()))
    pass
    return loss

Style Loss

稍微复杂一点，核心就是计算一个Gram matrix，该矩阵用来衡量similarity。下图为Gram matrix计算的示意图。

左边是某一层的activation map，之后将其变为C*(H*W)，所有行向量直接做內积，就有了C*C的gram matrix。

代码如下：

def gram_matrix(features, normalize=True):
    """
    Compute the Gram matrix from features.
    Inputs:
    - features: PyTorch Tensor of shape (N, C, H, W) giving features for
      a batch of N images.
    - normalize: optional, whether to normalize the Gram matrix
        If True, divide the Gram matrix by the number of neurons (H * W * C)

    Returns:
    - gram: PyTorch Tensor of shape (N, C, C) giving the
      (optionally normalized) Gram matrices for the N input images.
    """
    N,C,H,W = features.size()
    new_features = features.reshape((N,C,H*W))
    gram_mat = torch.zeros(N,C,C)
    gram_mat = torch.bmm(new_features, new_features.permute(0,2,1))
    if normalize:
        return gram_mat / (H*W*C)
    else:
        return gram_matm

之后就是如框图所示，将style image各层的gram matrix和input image各层的gram matrix相减后逐元素平方再求和即可。代码如下：

def style_loss(feats, style_layers, style_targets, style_weights):
    """
    Computes the style loss at a set of layers.
    Inputs:
    - feats: list of the features at every layer of the current image, as produced by
      the extract_features function.
    - style_layers: List of layer indices into feats giving the layers to include in the
      style loss.
    - style_targets: List of the same length as style_layers, where style_targets[i] is
      a PyTorch Tensor giving the Gram matrix of the source style image computed at
      layer style_layers[i].
    - style_weights: List of the same length as style_layers, where style_weights[i]
      is a scalar giving the weight for the style loss at layer style_layers[i].

    Returns:
    - style_loss: A PyTorch Tensor holding a scalar giving the style loss.
    """
    tensor = torch.tensor(())
    loss = tensor.new_zeros(1)
    for i in range(len(style_layers)):
        gram_mat = gram_matrix(feats[style_layers[i]])
        loss += style_weights[i] * torch.sum((gram_mat - style_targets[i]).square())
    return loss

Total-variation regularization

该项是一个正则化项，公式如下：

代码如下：

def tv_loss(img, tv_weight):
    """
    Compute total variation loss.
    Inputs:
    - img: PyTorch Variable of shape (1, 3, H, W) holding an input image.
    - tv_weight: Scalar giving the weight w_t to use for the TV loss.

    Returns:
    - loss: PyTorch Variable holding a scalar giving the total variation loss
      for img weighted by tv_weight.
    """
    loss = tv_weight * (torch.sum((img[0,:,1:,:] - img[0,:,:-1,:]).square()) + torch.sum((img[0,:,:,1:] - img[0,:,:,:-1]).square()))
    return loss

Over ALL

完成的上面的部分后就是整合了，note book中的代码已经给出。不再赘述。值得注意的是，使用同样的代码，我们还可以完成Feature Inversion和texture synthesis这两个任务。对于texture synthesis，只需将content loss置零即可。对于Feature Inversion只需将style loss置零即可。下面是一些自己测试的结果。

Style Transfer Left: Style img and Input img. Right: result(200 iteration).

texture synthesis Left: Style img. Right: result(200 iteration).

GAN

本部分的GAN实现了报过最原始的2014年 GoodFellow的原始GAN，以及Last Square GAN（优化了损失函数）和Deep Convolutional GANs。整体来讲使用pytorch建立模型和训练都比较简单，note book中主要实现的就是discriminator和generator的loss，将这个实现好即可。这部分的难度不是很高，就不再列出了，具体代码和结果可以参考我的github仓库。

END

至此，整个CS231n的课程就结束啦！为期了一个多月，中间因为各种事前耽误了几天，本来计划1个月内就搞定的。感觉CS231n课程整体来讲的难度适中，在数学推倒部分设计的稍微少了一些，但可以建立很多intuition的东西，总之还是收获颇丰的。

最大的收获就是他的assignment了，有一定难度的同时也极大的加深了对于这些知识的理解。

TODO：

下一步可能会写一个GAN相关的小总结。
看一些more mathematic的东西