CS231n Assignment2 遇到的问题

实现基于2019年版的课程
主要记录遇到的问题

我的assignment的github仓库，包含全部的代码和notebook。

Fully-connected Neural Network

相比于Assignment1，对于整个网络的实现进行了进一步的封装，可以实现任意深度，大小的MLP。

值得一看的代码部分！solver.py中实现调用更新规则函数（在optim.py中实现），实现的很有趣！

基本思路为：使用getattr获取定义在optim.py中定义好的update rule函数！我是第一次见这种写法，感觉很巧妙，值得学习一波:)

当然，要先import optim。

Core Code(extract):

self.update_rule = kwargs.pop('update_rule', 'sgd')
...
   # Make sure the update rule exists, then replace the string
   # name with the actual function
   if not hasattr(optim, self.update_rule):
       raise ValueError('Invalid update_rule "%s"' % self.update_rule)
self.update_rule = getattr(optim, self.update_rule)

其余部分的实现（如：affine，ReLU的forward和backward；优化算法）注意好细节后都比较容易实现，因为比较复杂的代码框架已经提供好了。

值得注意的是，官方github仓库中的关于课程内容的markdown笔记值得一读。课程github仓库.

Batch Normalization

本部分的实现我遇到了一些问题，在完成上也花费了很多时间，主要遇到的问题还是导数的求取，以及如何将其变化为numpy数组的形式。

The Gradient of Batch Normalization

由于本部分想知道自己的代码是正确与否需要变为代码后，进行numerical check才能验证。所以，在第一次求导后，我花了很长时间debug，但最后发现是导数求错了。。。

求导中还是遇到了不少的问题，尤其是在使用链式法则的时候遇到了问题。看来是好久没有好好推公式了QuQ，于是重新复习了一下chain rule和矩阵求导之类的；并重新推到了一下公式。NOTE：以下推导可能并不非常严谨（部分可能不符合矩阵相乘维数），但作为示意和实现代码足够了。

首先是正向（forward pass）的公式。

然后是backward计算梯度。这里在计算关于x的偏导时，我遇到了一些问题，很容易丢掉一个导数项；画出变量之间的关系图可以很好地解决这个问题。推导如下：

有了这些部分，就可以实现第一个函数batchnorm_backward()和batchnorm_forward()了！

实际上，上式还可以继续化简，化简的结果更加简洁，省去很多中间变量。

至此，batch normalization的公式推导部分就OK了，下面就是代码实现了。

这里需要注意的是，由于在predict的时候，一般是没有batch数据的，所以此时没法直接使用batch normalization，所以一种方法就是利用训练时得到的均值和方差来作为predict时的均值和方差。其更新方法使用momentum更新为：

running_mean = momentum * running_mean + (1 - momentum) * sample_mean
running_var = momentum * running_var + (1 - momentum) * sample_var

Code Implement of Batch Normalization

forward没啥可说的，很easy，分开train与test即可

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        sample_mean = np.mean(x, axis=0)
        sample_var = np.var(x, axis=0)

        running_mean = momentum * running_mean + (1 - momentum) * sample_mean
        running_var = momentum * running_var + (1 - momentum) * sample_var

        x_norm = (x - sample_mean) / np.sqrt(sample_var + eps)
        out = gamma * x_norm + beta
        cache = x, x_norm, sample_mean, sample_var, gamma, beta, eps
    elif mode == 'test':
        x_norm = (x - running_mean) / np.sqrt(running_var + eps)
        out = gamma * x_norm + beta
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

backward实现两个版本，我是一个根据未化简公式来计算，另一个是根据化简后公式来计算，对比后明显可以看到化简后大量减少了运算次数，可以达到原来速度的3倍。

未化简公式版：

def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    x, x_norm, sample_mean, sample_var, gamma, beta, eps = cache
    N, D = x_norm.shape

    dbeta = np.sum(dout, axis=0)
    dgamma = np.sum(x_norm * dout, axis=0)
    dx_norm = dout * gamma
    dL_dvar = -0.5 * np.sum(dx_norm * (x - sample_mean), axis=0) * np.power(sample_var + eps, -1.5)
    # add L-->y-->x_hat-->x_i
    dx = dx_norm / np.sqrt(sample_var + eps)
    # add L-->mean-->x_i
    dx += (-1/N) * np.sum(dx_norm / np.sqrt(sample_var + eps), axis=0) + dL_dvar * np.sum(-2*(x - sample_mean)/N, axis=0)
    # add L-->var-->x_i
    dx += (2 / N) * (x - sample_mean) * dL_dvar

    return dx, dgamma, dbeta

化简公式版：

def batchnorm_backward_alt(dout, cache):
    """
    Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass. 
    See the jupyter notebook for more hints.
     
    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    dx, dgamma, dbeta = None, None, None
    x, x_hat, sample_mean, sample_var, gamma, beta, eps = cache
    N, D = x_hat.shape
    mid = 1 / np.sqrt(sample_var + eps)
    dbeta = np.sum(dout, axis=0)
    dgamma = np.sum(x_hat * dout, axis=0)
    dxhat = dout * gamma
    dx = (1 / N) * mid * (N * dxhat - np.sum(dxhat, axis=0) - x_hat * np.sum(dxhat * x_hat, axis=0))

    return dx, dgamma, dbeta

Layer Normalization

LN按照如下公式来输出，实际上就是把BN倒过来。。

LN的操作类似于将BN做了一个“转置”，对同一层网络的输出做一个标准化。注意，同一层的输出是单个图片的输出，比如对于一个batch为32的神经网络训练，会有32个均值和方差被得出，每个均值和方差都是由单个图片的所有channel之间做一个标准化。这么操作，就使得LN不受batch size的影响。

在代码的实现上只需将所有的相关矩阵装置一下就OK啦，即对于转置过来的输入x做BN即可！注意要保证输出的维数正确。

Layer Normalization Forward：

def layernorm_forward(x, gamma, beta, ln_param):
    """
    Forward pass for layer normalization.

    During both training and test-time, the incoming data is normalized per data-point,
    before being scaled by gamma and beta parameters identical to that of batch normalization.
    
    Note that in contrast to batch normalization, the behavior during train and test-time for
    layer normalization are identical, and we do not need to keep track of running averages
    of any sort.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - ln_param: Dictionary with the following keys:
        - eps: Constant for numeric stability

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    out, cache = None, None
    sample_var = np.var(x.T, axis=0)
    x_norm = (x.T - sample_mean) / np.sqrt(sample_var + eps)
    out = gamma * x_norm.T + beta
    cache = x, x_norm.T, sample_mean, sample_var, gamma, beta, eps
    return out, cache

Layer Normalization Backward：这里我实现了两个版本，分别是基于化简后的公式和未化简后的公式，均通过测试。

def layernorm_backward(dout, cache):
    """
    Backward pass for layer normalization.

    For this implementation, you can heavily rely on the work you've done already
    for batch normalization.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from layernorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None

    x, x_hat, sample_mean, sample_var, gamma, beta, eps = cache
    N, D = x_hat.shape

    mid = 1 / np.sqrt(sample_var + eps)
    dbeta = np.sum(dout, axis=0)
    dgamma = np.sum(x_hat * dout, axis=0)

    dxhat = dout * gamma
    dxhat = dxhat.T
    x_hat = x_hat.T
    dx = (1 / D) * mid * (D * dxhat - np.sum(dxhat, axis=0) - x_hat * np.sum(dxhat * x_hat, axis=0))
    dx = dx.T

    #####################################################################################
    #    Another vision of  LN backward (based on the origin vision of bn backward)     #
    #####################################################################################
    # x, x_norm, sample_mean, sample_var, gamma, beta, eps = cache
    # N, D = x_norm.shape
    #
    # dbeta = np.sum(dout, axis=0)
    # dgamma = np.sum(x_norm * dout, axis=0)
    #
    # x = x.T
    # dout = dout.T
    #
    # dx_norm = dout.T * gamma
    # dx_norm = dx_norm.T
    # dL_dvar = -0.5 * np.sum(dx_norm * (x - sample_mean), axis=0) * np.power(sample_var + eps, -1.5)
    # # add L-->y-->x_hat-->x_i
    # dx = dx_norm / np.sqrt(sample_var + eps)
    # # add L-->mean-->x_i
    # dx += (-1/D) * np.sum(dx_norm / np.sqrt(sample_var + eps), axis=0) + dL_dvar * np.sum(-2*(x - sample_mean)/N, axis=0)
    # # add L-->var-->x_i
    # dx += (2 / D) * (x - sample_mean) * dL_dvar
    # dx = dx.T

    return dx, dgamma, dbeta

BN vs LN

BN

首先是BN，BN是通过mini-batch来对相应的activation做规范化操作，使得输出的各个维度的均值为0，方差为1（标准化）。而最后的“scale and shift”，即加入一个放射变换，则是为了让因训练所需而“刻意”加入的BN能够有可能还原最初的输入，同时也缓解因为数据可能会因此丢失了一些信息，所以再加上beta和gama来恢复原始数据，这里beta和gama是可学习的。

BN的好处：

(1) 减轻了对参数、权重初始化的依赖。

(2) 训练更快，可以使用更高的学习率。

(3) BN一定程度上增加了泛化能力。

BN的缺点：

batch normalization依赖于batch的大小，当batch值很小时，计算的均值和方差不稳定。会引入很多噪声误差，若网络队伍差很敏感，则会难以训练和收敛。

这一个特性，导致batch normalization不适合以下的几种场景。

(1)batch非常小，比如训练资源有限无法应用较大的batch。

(2)RNN，因为它是一个动态的网络结构，即输入的size是不固定的，同一个batch中训练实例有长有短，无法根据BN的公式进行标准化。

关于Normalization的有效的原因：

Batch Normalization调整了数据的分布，不考虑激活函数，它让每一层的输出归一化到了均值为0方差为1的分布，这保证了梯度的有效性，目前大部分资料都这样解释，比如BN的原始论文认为的缓解了Internal Covariate Shift(ICS)问题。加入了BN的反向传播过程中，就不易出现梯度消失或梯度爆炸，梯度将始终保持在一个合理的范围内。而这样带来的好处就是，基于梯度的训练过程可以更加有效的进行，即加快收敛速度，减轻梯度消失或爆炸导致的无法训练的问题。

LN

BN 的一个缺点是需要较大的 batchsize 才能合理估训练数据的均值和方差，这在计算资源比较有限的时候往往不能达到，同时它也很难应用在数据长度不同的 RNN 模型上。Layer Normalization (LN) 的一个优势是不需要批训练，在单条数据内部就能归一化，他是针对于per datapoint的更新。

整体而言，LN用于RNN效果比较明显，但是在CNN上，不如BN。

Dropout

Dropout的代码部分非常简单，material中已经给出了代码实现，只需要实现一下forward和backw以及更新一下计算loss的函数即可。需要注意的是增加loss的部分，这里我使用caches当做堆栈存储前向计算loss时产生的caches，这样反向传播时只需要依次pop并根据网络结构计算梯度即可。本部分代码位于：cs231n/classifiers/fc_net.py。该loss位于为FullyConnectedNet类内。

def loss(self, X, y=None):
    """
    Compute loss and gradient for the fully-connected net.

    Input / output: Same as TwoLayerNet above.
    """
    X = X.astype(self.dtype)
    mode = 'test' if y is None else 'train'

    # Set train/test mode for batchnorm params and dropout param since they
    # behave differently during training and testing.
    if self.use_dropout:
        self.dropout_param['mode'] = mode
    if self.normalization=='batchnorm':
        for bn_param in self.bn_params:
            bn_param['mode'] = mode
    scores = None

    caches = []
    scores = X
    for i in range(self.num_layers):
        W = self.params['W' + str(i+1)]
        b = self.params['b' + str(i+1)]
        if i == self.num_layers - 1:
            scores, cache = affine_forward(scores, W, b)
        else:
            if self.normalization is None:
                scores, cache = affine_relu_forward(scores, W, b)
            elif self.normalization == "batchnorm":
                gamma = self.params['gamma' + str(i + 1)]
                beta = self.params['beta' + str(i + 1)]
                scores, cache = affine_bn_relu_forward(scores, W, b, gamma, beta, self.bn_params[i])
            elif self.normalization == "layernorm":
                gamma = self.params['gamma' + str(i + 1)]
                beta = self.params['beta' + str(i + 1)]
                scores, cache = affine_ln_relu_forward(scores, W, b, gamma, beta, self.bn_params[i])
            else:
                cache = None
        caches.append(cache)
        if self.use_dropout and i != self.num_layers-1:
            scores, cache = dropout_forward(scores, self.dropout_param)
            caches.append(cache)

    # If test mode return early
    if mode == 'test':
        return scores

    loss, grads = 0.0, {}
    reg = self.reg
    loss, dx = softmax_loss(scores, y)
    for i in reversed(range(self.num_layers)):
        w = 'W' + str(i + 1)
        b = 'b' + str(i + 1)
        gamma = 'gamma' + str(i + 1)
        beta = 'beta' + str(i + 1)
        loss += 0.5 * reg * np.sum(W * W)  # add reg term
        if i == self.num_layers - 1:
            dx, grads[w], grads[b] = affine_backward(dx, caches.pop())
        else:
            if self.use_dropout:
                dx = dropout_backward(dx, caches.pop())
            if self.normalization is None:
                dx, grads[w], grads[b] = affine_relu_backward(dx, caches.pop())
            if self.normalization == 'batchnorm':
                dx, grads[w], grads[b], grads[gamma], grads[beta] = affine_bn_relu_backward(dx, caches.pop())
            if self.normalization == 'layernorm':
                dx, grads[w], grads[b], grads[gamma], grads[beta] = affine_ln_relu_backward(dx, caches.pop())
        grads[w] += reg * self.params[w]

return loss, grads

Convolutional Networks

这部分就是实现CNN了！核心就是实现好卷积层和pooling层。同时也修改batch normalization以便适用于CNN，同时增加group normalization。

卷积层

由于在实现时并不需要考虑计算复杂度和时间复杂度，我使用了最简单直接的方法，在forward时，同官方给的note一样，每次更新一个激活神经元的值，即使用4层循环嵌套，直观的实现卷积的过程。TODO：向量化方法。

def conv_forward_naive(x, w, b, conv_param):
    """
    A naive implementation of the forward pass for a convolutional layer.

    The input consists of N data points, each with C channels, height H and
    width W. We convolve each input with F different filters, where each filter
    spans all C channels and has height HH and width WW.

    Input:
    - x: Input data of shape (N, C, H, W)
    - w: Filter weights of shape (F, C, HH, WW)
    - b: Biases, of shape (F,)
    - conv_param: A dictionary with the following keys:
      - 'stride': The number of pixels between adjacent receptive fields in the
        horizontal and vertical directions.
      - 'pad': The number of pixels that will be used to zero-pad the input. 
        

    During padding, 'pad' zeros should be placed symmetrically (i.e equally on both sides)
    along the height and width axes of the input. Be careful not to modfiy the original
    input x directly.

    Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
      H' = 1 + (H + 2 * pad - HH) / stride
      W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
    out = None

    pass
    pad = conv_param['pad']
    stride = conv_param['stride']
    x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)
    N, C, H, W = x.shape
    F, C, HH, WW = w.shape
    H_out = int(1 + (H + 2 * pad - HH) / stride)
    W_out = int(1 + (W + 2 * pad - WW) / stride)
    out = np.zeros((N, F, H_out, W_out))
    for n in range(N):
        for f in range(F):
            for h in range(H_out):
                for w_mid in range(W_out):
                    out[n, f, h, w_mid] = np.sum(
                        x_pad[n, :, h * stride:h * stride + HH, w_mid * stride:w_mid * stride + WW] * w[f, :, :, :]) + b[f]

    cache = (x, w, b, conv_param)
    return out, cache

在实现backward时，我也写出了简单情况下更新的公式，并根据这个最简单的展开形式以此来反向求梯度。如下图所示

def conv_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a convolutional layer.

    Inputs:
    - dout: Upstream derivatives.
    - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

    Returns a tuple of:
    - dx: Gradient with respect to x
    - dw: Gradient with respect to w
    - db: Gradient with respect to b
    """
    dx, dw, db = None, None, None
    
    x, w, b, conv_param = cache
    pad = conv_param['pad']
    stride = conv_param['stride']
    x_pad = np.pad(x, ((0, 0), (0, 0), (pad, pad), (pad, pad)), mode='constant', constant_values=0)
    N, F, H_out, W_out = dout.shape
    F, C, HH, WW = w.shape
    N, C, H, W = x.shape
    dx_pad = np.zeros_like(x_pad)
    dw = np.zeros_like(w)
    db = np.sum(dout, (0, 2, 3))
    for n in range(N):
        for f in range(F):
            for h_mid in range(H_out):
                for w_mid in range(W_out):
                    window = x_pad[n, :, stride * h_mid:stride * h_mid + HH, stride * w_mid:stride * w_mid + WW]
                    dx_pad[n, :, stride * h_mid:stride * h_mid + HH, stride * w_mid:stride * w_mid + WW] += \
                        dout[n, f, h_mid, w_mid] * w[f, :, :, :]
                    dw[f, :, :, :] += window * dout[n, f, h_mid, w_mid]
    dx = dx_pad[:, :, pad:pad + H, pad:pad + W]

    return dx, dw, db

Max Pooling

forward很简单，只需取respect field中最大的即可；backward时，将取最大值的位置处的梯度直接回传，其余置一即可。比较简单。

def max_pool_forward_naive(x, pool_param):
    """
    A naive implementation of the forward pass for a max-pooling layer.

    Inputs:
    - x: Input data, of shape (N, C, H, W)
    - pool_param: dictionary with the following keys:
      - 'pool_height': The height of each pooling region
      - 'pool_width': The width of each pooling region
      - 'stride': The distance between adjacent pooling regions

    No padding is necessary here. Output size is given by 

    Returns a tuple of:
    - out: Output data, of shape (N, C, H', W') where H' and W' are given by
      H' = 1 + (H - pool_height) / stride
      W' = 1 + (W - pool_width) / stride
    - cache: (x, pool_param)
    """
    out = None
    pool_height = pool_param['pool_height']
    pool_width = pool_param['pool_width']
    stride = pool_param['stride']
    N, C, H, W = x.shape
    H_out = int(1 + (H - pool_height) / stride)
    W_out = int(1 + (W - pool_width) / stride)
    out = np.zeros((N, C, H_out, W_out))
    for n in range(N):
        for f in range(C):
            for h in range(H_out):
                for w_mid in range(W_out):
                    out[n, f, h, w_mid] = np.max(
                        x[n, f, h * stride:h * stride + pool_height, w_mid * stride:w_mid * stride + pool_width])

    cache = (x, pool_param)
    return out, cache

def max_pool_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a max-pooling layer.

    Inputs:
    - dout: Upstream derivatives
    - cache: A tuple of (x, pool_param) as in the forward pass.

    Returns:
    - dx: Gradient with respect to x
    """
    dx = None
    x, pool_param = cache
    pool_height = pool_param['pool_height']
    pool_width = pool_param['pool_width']
    stride = pool_param['stride']
    N, C, H_out, W_out = dout.shape
    dx = np.zeros_like(x)

    for n in range(N):
        for f in range(C):
            for h_mid in range(H_out):
                for w_mid in range(W_out):
                    window = x[n, f, stride * h_mid:stride * h_mid + pool_height,
                             stride * w_mid:stride * w_mid + pool_width]
                    mask = window == np.max(window)
                    dx[n, f, stride * h_mid:stride * h_mid + pool_height,
                    stride * w_mid:stride * w_mid + pool_width] = mask * dout[n, f, h_mid, w_mid]

    return dx

Spatial Batch Normalization

实现起来非常简单，只需要重新reshape输入，之后使用之前实现过的正常的Batch Normalization就OK了，代码请看我的github仓库，这部分没有遇到问题。

Group Normalization

forward只需要稍微修改正常的Batch Normalization即可。

def spatial_groupnorm_forward(x, gamma, beta, G, gn_param):
    out, cache = None, None
    eps = gn_param.get('eps', 1e-5)
    N, C, H, W = x.shape
    x_new = x.reshape((N, G, C // G, H, W))
    mean = np.mean(x_new, axis=(2, 3, 4), keepdims=True)
    var = np.var(x_new, axis=(2, 3, 4), keepdims=True)

    x_norm = (x_new - mean) / np.sqrt(var + eps)
    x_norm = x_norm.reshape((N, C, H, W))
    gamma_new = gamma.reshape((1, C, 1, 1))
    beta_new = beta.reshape((1, C, 1, 1))
    out = gamma_new * x_norm + beta_new
    cache = G, x, x_norm, mean, var, gamma, beta, eps

    return out, cache

backward也并不复杂，本质上的求导与正常的batch normalization一致，不过在多个导数求和时，需要注意怎么进行sum。这里如果想要通过Gradient check也有一个小坑。。。 ps: 这里我调试了很久。。

def spatial_groupnorm_backward(dout, cache):
    dx, dgamma, dbeta = None, None, None
    N, C, H, W = dout.shape
    G, x, x_norm, mean, var, gamma, beta, eps = cache

    dgamma = np.sum(dout * x_norm, axis=(0, 2, 3)).reshape(1, C, 1, 1)
    x = x.reshape(N, G, C // G, H, W)
    # 这里想通过Gradientcheck必须需要将其reshape为(1, C, 1, 1)
    dbeta = np.sum(dout, axis=(0, 2, 3)).reshape(1, C, 1, 1)

    dx_norm = (dout * gamma).reshape(N, G, C // G, H, W)
    mean = mean.reshape(N, G, 1, 1, 1)
    var = var.reshape(N, G, 1, 1, 1)
    dL_dvar = -0.5 * np.sum(dx_norm * (x - mean), axis=(2, 3, 4)) * np.power(var.squeeze() + eps, -1.5)
    dL_dvar = dL_dvar.reshape(N, G, 1, 1, 1)

    mid = H * W * C // G
    # add L-->y-->x_hat-->x_i
    dx = dx_norm / np.sqrt(var + eps)
    # add L-->mean-->x_i
    dx += ((-1 / mid) * np.sum(dx_norm / np.sqrt(var + eps), axis=(2, 3, 4))).reshape(N, G, 1, 1, 1) + dL_dvar * (
        np.sum(-2 * (x - mean) / mid, axis=(2, 3, 4))).reshape(N, G, 1, 1, 1)
    # add L-->var-->x_i
    dx += (2 / mid) * (x - mean) * dL_dvar
    dx = dx.reshape((N, C, H, W))
    return dx, dgamma, dbeta

PyTorch on CIFAR-10

这部分比较简单，我在实现时没有遇到问题。偷了懒，没有实现最后的CIFAR-10 open-ended challenge。