CS231n Assignment1 遇到的问题

  • 实现基于2019年版的课程
  • 主要记录遇到的问题

Softmax implement

不论是实现softmax,SVM损失函数,二者遇到的问题都比较相似,主要为导数的推导numpy的使用。由于softmax的实现稍微复杂一些,这里只记录softmax实现时的问题。

Gradient

使用SGD核心的工作就是计算softmax关于权值W的梯度。课程中没有给出推导过程,这里推导一下。

image-20200807173340447

Numeric Stability Trick

为了防止出现数值计算不稳定,要在计算损失函数式加入修正项(对Gradient无影响)。

原始为:image-20200807174209667

image-20200807174255380

Implement with numpy

给出naive版本的代码。如何计算的示意图已在推导过程中给出。naive版本的代码基本按照推导的公式梳理下来即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)

Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.

Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength

Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)

num_classes = W.shape[1]
num_train = X.shape[0]

for i in range(num_train):
scores = X[i].dot(W)
scores -= np.max(scores) # 一个数值修正的技巧,防止出现数值不稳定的问题
scores = np.exp(scores)

sum_scores = np.sum(scores) # 可以简化写法,节省空间,懒得修改了
P = scores / sum_scores
L = -np.log(P)

loss += L[y[i]]

for j in range(num_classes): # 计算梯度,分类讨论
if j == y[i]:
dW[:, j] += (-1 + P[y[i]])*X[i].T
else:
dW[:, j] += P[j]*X[i].T

dW /= num_train
dW += reg * W
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)

return loss, dW

Vectorized Version

向量化版本。这里就有非常多的细节需要注意了。首先还是给出完整代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.

Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)

num_classes = W.shape[1]
num_train = X.shape[0]

scores = X @ W # ( N*C )
scores -= np.max(scores, axis=1, keepdims=True)
scores = np.exp(scores)
sum_scores = np.sum(scores, axis=1,keepdims=True)

P = scores / sum_scores
L = -np.log(P)
loss += np.sum(L[np.arange(num_train), y])

loss /= num_train
loss += 0.5 * reg * np.sum(W * W) #

# 计算gradient
mid = np.zeros_like(P) # 生成一个和P一样的0矩阵
mid[np.arange(num_train), y] = 1 # 对矩阵中Y所对应的部分加一个1,因为一会要构造出需要的梯度计算
dW = X.T @ (-mid + P)
dW = dW / num_train + reg * W

return loss, dW

首先应该画图明白计算中各个量的关系,以及他们是怎么来的,这个很重要。如下图所示

image-20200807175314544

第一处就是在计算Numeric Stability Trick时,要找到每一个输入向量的最大元素,这里注意需要保证keepdims=True。

其控制了返回数组的shape,这样返回的shape为(N,1)。

1
scores -= np.max(scores, axis=1, keepdims=True)

同理在sum时,也要进行类似的处理,这样在归一化时才能work。

An Important Trick!!!

在这里我遇到了不少的问题,主要是numpy使用的不熟练。。。:( 所以特此记录下来。

L[np.arange(num_train), y] **与 **L[:,y] 的区别!

一开始的代码使用的是后者,因为目标就是获得所有行i中,列位置为y[i]的元素。所以想当然的使用了后者。但实际上,后者返回的是所有行x[i]中,x[i,y[j]]的元素!!!

示例如下:

image-20200807180256834

image-20200807180318182

image-20200807180331281

image-20200807180431683

而**L[np.arange(num_train), y] **则为:

如果将np.arange(num_train)看为list,则其长度必须与y相同!!!其效果就是二者分别迭代,每次返回二者迭代结果下标位置处的元素。如图所示。

image-20200807180731248

所以可见,基于我们的需要,后者才能满足要求。

Two-Layer Neural Network

本部分的工作也与之前的部分比较相似,这里遇到主要问题还是如何处理求导的问题。

由于在这里我也遇到了一些问题,所以再次给出部分求导流程。

首先先给出网络的结构。

Gradient

image-20200807221946519

image-20200807223402245

Implement with numpy

下面给出代码实现。由于主要难点就是loss的实现了,之后的SGD和predict函数都非常简单,我没有遇到什么问题,这里只给出遇到了部分问题的loss与grad计算的部分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.

Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.

Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].

If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape

# Compute the forward pass
scores = None

h = np.maximum(X @ W1 + b1, 0)
scores = h @ W2 + b2

# If the targets are not given then jump out, we're done
if y is None:
return scores

# Compute the loss
loss = None
scores = np.exp(scores)
sum_scores = np.sum(scores, axis=1, keepdims=True)

P = scores / sum_scores
L = -np.log(P)
loss = np.sum(L[np.arange(N), y])

loss /= N
loss += 1 * reg * (np.sum(W1 * W1) + np.sum(W2 * W2))

# Backward pass: compute gradients
grads = {}
# 计算W2,b2
dscore = P
dscore[np.arange(N), y] -= 1
dscore /= N # 这里需要注意!!!
# 计算梯度时只需要除一次N,这里debug花了很久。。
grads['W2'] = h.T @ dscore + 2 * reg * W2
grads['b2'] = np.sum(dscore, axis=0)
# 计算W1,b1
dh = dscore @ W2.T # 目标函数对于h的偏导
dh[h <= 0] = 0 # 此时dh变为关于w1@x+b1的导数
grads['W1'] = X.T @ dh + 2 * reg * W1
grads['b1'] = np.sum(dh, axis=0)

return loss, grads

基本上按照公式并注意矩阵维数和细节就OK了,遇到不太会的画个图就解决了。

需要注意的是,在除以输入个数的时候,只需要除一次

这里一开始没有注意到,我一开始在每次计算梯度的时候都除了N,导致出现了误差,这里居然debug了很久。。

Parameter Tuning

有点懒,这部分工作没有完成。

Others

其余的部分(k-Nearest Neighbor classifier, SVM, Higher Level Representations: Image Features)并未遇到很多的问题。具体详情代码见我的github仓库。https://github.com/canVa4/CS231n-Assignments