CS231n Assignment1 遇到的问题

实现基于2019年版的课程
主要记录遇到的问题

Softmax implement

不论是实现softmax，SVM损失函数，二者遇到的问题都比较相似，主要为导数的推导和numpy的使用。由于softmax的实现稍微复杂一些，这里只记录softmax实现时的问题。

Gradient

使用SGD核心的工作就是计算softmax关于权值W的梯度。课程中没有给出推导过程，这里推导一下。

Numeric Stability Trick

为了防止出现数值计算不稳定，要在计算损失函数式加入修正项（对Gradient无影响）。

原始为：

Implement with numpy

Navie Version

给出naive版本的代码。如何计算的示意图已在推导过程中给出。naive版本的代码基本按照推导的公式梳理下来即可。

def softmax_loss_naive(W, X, y, reg):
    """
    Softmax loss function, naive implementation (with loops)
    
    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.
    
    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength
    
    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    num_classes = W.shape[1]
    num_train = X.shape[0]

    for i in range(num_train):
        scores = X[i].dot(W)
        scores -= np.max(scores)    # 一个数值修正的技巧，防止出现数值不稳定的问题
        scores = np.exp(scores)

        sum_scores = np.sum(scores)        # 可以简化写法，节省空间，懒得修改了
        P = scores / sum_scores
        L = -np.log(P)

        loss += L[y[i]]

        for j in range(num_classes):    # 计算梯度，分类讨论
            if j == y[i]:
                dW[:, j] += (-1 + P[y[i]])*X[i].T
            else:
                dW[:, j] += P[j]*X[i].T

    dW /= num_train
    dW += reg * W
    loss /= num_train
    loss += 0.5 * reg * np.sum(W * W)

    return loss, dW

Vectorized Version

向量化版本。这里就有非常多的细节需要注意了。首先还是给出完整代码。

def softmax_loss_vectorized(W, X, y, reg):
    """
    Softmax loss function, vectorized version.

    Inputs and outputs are the same as softmax_loss_naive.
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    num_classes = W.shape[1]
    num_train = X.shape[0]

    scores = X @ W  # ( N*C )
    scores -= np.max(scores, axis=1, keepdims=True)
    scores = np.exp(scores)
    sum_scores = np.sum(scores, axis=1,keepdims=True)

    P = scores / sum_scores
    L = -np.log(P)
    loss += np.sum(L[np.arange(num_train), y])

    loss /= num_train
    loss += 0.5 * reg * np.sum(W * W)   #

    # 计算gradient
    mid = np.zeros_like(P)  # 生成一个和P一样的0矩阵
    mid[np.arange(num_train), y] = 1  # 对矩阵中Y所对应的部分加一个1，因为一会要构造出需要的梯度计算
    dW = X.T @ (-mid + P)
    dW = dW / num_train + reg * W

    return loss, dW

首先应该画图明白计算中各个量的关系，以及他们是怎么来的，这个很重要。如下图所示

第一处就是在计算Numeric Stability Trick时，要找到每一个输入向量的最大元素，这里注意需要保证keepdims=True。

其控制了返回数组的shape，这样返回的shape为(N,1)。

1	scores -= np.max(scores, axis=1, keepdims=True)

同理在sum时，也要进行类似的处理，这样在归一化时才能work。

An Important Trick!!!

在这里我遇到了不少的问题，主要是numpy使用的不熟练。。。:( 所以特此记录下来。

L[np.arange(num_train), y] **与 **L[:,y] 的区别！

一开始的代码使用的是后者，因为目标就是获得所有行i中，列位置为y[i]的元素。所以想当然的使用了后者。但实际上，后者返回的是所有行x[i]中，x[i,y[j]]的元素！！！

示例如下：

而**L[np.arange(num_train), y] **则为：

如果将np.arange(num_train)看为list，则其长度必须与y相同！！！其效果就是二者分别迭代，每次返回二者迭代结果下标位置处的元素。如图所示。

所以可见，基于我们的需要，后者才能满足要求。

Two-Layer Neural Network

本部分的工作也与之前的部分比较相似，这里遇到主要问题还是如何处理求导的问题。

由于在这里我也遇到了一些问题，所以再次给出部分求导流程。

首先先给出网络的结构。

Gradient

Implement with numpy

下面给出代码实现。由于主要难点就是loss的实现了，之后的SGD和predict函数都非常简单，我没有遇到什么问题，这里只给出遇到了部分问题的loss与grad计算的部分。

def loss(self, X, y=None, reg=0.0):
    """
    Compute the loss and gradients for a two layer fully connected neural
    network.

    Inputs:
    - X: Input data of shape (N, D). Each X[i] is a training sample.
    - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
      an integer in the range 0 <= y[i] < C. This parameter is optional; if it
      is not passed then we only return scores, and if it is passed then we
      instead return the loss and gradients.
    - reg: Regularization strength.

    Returns:
    If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
    the score for class c on input X[i].

    If y is not None, instead return a tuple of:
    - loss: Loss (data loss and regularization loss) for this batch of training
      samples.
    - grads: Dictionary mapping parameter names to gradients of those parameters
      with respect to the loss function; has the same keys as self.params.
    """
    # Unpack variables from the params dictionary
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
    N, D = X.shape

    # Compute the forward pass
    scores = None

    h = np.maximum(X @ W1 + b1, 0)
    scores = h @ W2 + b2

    # If the targets are not given then jump out, we're done
    if y is None:
        return scores

    # Compute the loss
    loss = None
    scores = np.exp(scores)
    sum_scores = np.sum(scores, axis=1, keepdims=True)

    P = scores / sum_scores
    L = -np.log(P)
    loss = np.sum(L[np.arange(N), y])

    loss /= N
    loss += 1 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))

    # Backward pass: compute gradients
    grads = {}
    # 计算W2，b2
    dscore = P
    dscore[np.arange(N), y] -= 1
    dscore /= N        # 这里需要注意！！！
    # 计算梯度时只需要除一次N，这里debug花了很久。。
    grads['W2'] = h.T @ dscore + 2 * reg * W2
    grads['b2'] = np.sum(dscore, axis=0)
    # 计算W1，b1
    dh = dscore @ W2.T      # 目标函数对于h的偏导
    dh[h <= 0] = 0          # 此时dh变为关于w1@x+b1的导数
    grads['W1'] = X.T @ dh + 2 * reg * W1
    grads['b1'] = np.sum(dh, axis=0)

    return loss, grads

基本上按照公式并注意矩阵维数和细节就OK了，遇到不太会的画个图就解决了。

需要注意的是，在除以输入个数的时候，只需要除一次

这里一开始没有注意到，我一开始在每次计算梯度的时候都除了N，导致出现了误差，这里居然debug了很久。。

Parameter Tuning

有点懒，这部分工作没有完成。

Others

其余的部分（k-Nearest Neighbor classifier, SVM, Higher Level Representations: Image Features）并未遇到很多的问题。具体详情代码见我的github仓库。https://github.com/canVa4/CS231n-Assignments