[ML / DL] Backpropagation in softmax function

본 포스팅에서는 정말 많은 시간을 고민하고 검색한 주제를 다뤄볼까 한다. softmax에서 backpropagation을 어떻게 할까? 사실, 그렇게 쓸모 있는(?) 주제가 아닐 수도 있다. pytorch에서는 nn.CrossEntropyLoss 함수를 제공하고 이는 내부적으로 log softmax까지 해주기 때문에 softmax에서 계산되는 down stream gradient을 계산할 필요가 없기 때문이다. 이에 관한 내용은 아래 포스팅은 참고해주기 바란다.

https://steady-programming.tistory.com/91

[Pytorch] nn.CrossEntropyLoss와 nn.NLLLoss

분류 문제를 풀다보면 반드시 만나는 손실함수가 있다. 그것은 바로 cross entropy loss. cross entropy loss가 무엇을 의미하는지, 왜 수식이 그렇게 정의되는지에 대한 탐구는 아래 포스팅에서 진행했으

steady-programming.tistory.com

그럼에도 불구하고 왜 굳이 다루려는 것인가? 최근에 neural network을 구성하는 각 component의 forward, backward propagation을 구현해보는 프로젝트를 해보고 있다. (PR은 언제나 환영!)

https://github.com/bohyunshin/deep-learning-with-pure-numpy/tree/master

GitHub - bohyunshin/deep-learning-with-pure-numpy

Contribute to bohyunshin/deep-learning-with-pure-numpy development by creating an account on GitHub.

github.com

이렇게 하나하나 구현하다보니, 분류 문제를 학습하기 위해 마지막에 반드시 softmax layer가 들어가야 하고 그러면 이에 대한 gradient을 직접 구현해야함을 깨달았다... 즉, cross entropy loss에서 최초의 upstream gradient을 계산하고 ($\dfrac{\partial L}{\partial \hat{y}}$) softmax layer의 local gradient와 함께 upstream gradient을 downstream으로 흘러보내줘야 하니..

\[ \dfrac{\partial L}{\partial z} = \dfrac{\partial L}{\partial \hat{y}} \dfrac{\partial \hat{y}}{\partial z} \]

내부적으로는 $\dfrac{\partial \hat{y}}{\partial z}$을 계산해야하고.. 그러면 softmax의 backpropagation을 계산해야한다. 이제 각설하고 softmax의 gradient을 유도해보고 코드로 구현해보자. 완성된 코드는 아래에서 확인할 수 있다.

https://github.com/bohyunshin/deep-learning-with-pure-numpy/blob/master/src/tools/activations.py#L92

deep-learning-with-pure-numpy/src/tools/activations.py at master · bohyunshin/deep-learning-with-pure-numpy

Contribute to bohyunshin/deep-learning-with-pure-numpy development by creating an account on GitHub.

github.com

Overview

C개의 class label을 가지는 상황을 생각해보자. softmax은 아래와 같이 정의된다.

\[ a_i = \dfrac{exp(z_i)}{ \sum^C_{k=1} exp(z_k) } \]

식을 잘 보면 원래 $C$차원이었는데, 1차원으로 줄어든 것을 확인할 수 있다. $a_i$라는 값을 만들기 위해 $z_1, \cdots, z_C$의 입력값이 사용됐다. 따라서 gradient도 $\dfrac{\partial a_i}{\partial z_1}, \cdots, \dfrac{\partial a_i}{\partial z_C}$를 구해야한다. $a_1, \cdots, a_C$에 대한 gradient을 구한다고 생각해보면 곧, $C \times C$의 jacobian matrix을 구해야함을 알 수 있다. 근데 이는 데이터 한개에 대한 논의이고 데이터가 $n$개가 있다면 jacobian matrix의 차원은 총 $n \times C \times C$가 된다.

정리해보자. softmax layer을 통과하면 $n \times C$ 차원의 행렬이 생성된다. input에 대한 gradient는 $n \times C$ 차원의 행렬이 아니라 $n \times C \times C$ 차원의 행렬이 되는 것이다. 1개의 입력값이 사용되는 logistic function과는 다르게 하나의 activation 값을 구하기 위해 $C$개의 입력값이 사용되는 softmax의 특징 때문에 비롯되는 상황이다. 우선 몇 차원의 행렬로 gradient가 나올 것인지 대략적으로 그리고 들어가는 것이 매우 중요하다.

Local gradient in softmax

i번째 데이터의 j번째 확률값을 아래와 같이 정의하자.

\[ \hat{y}_{ik} = \dfrac{exp(z_{ik})}{\sum^{C}_{j=1} exp(z_{ij})} \]

앞서 살펴보았듯이, upstream gradient로 $\dfrac{\partial L}{\partial \hat{y}_{ik}}$가 들어오는 상황이고 우리는 $\dfrac{\partial \hat{y}_{ik}}{\partial z_{ij}}$을 구하고 싶은 것이다. 트릭을 써서 $\dfrac{\partial \log \hat{y}_{ik}}{\partial z_{ij}}$을 먼저 구해본다. 그 이유는 아래의 관계를 이용할 것이기 때문이다.

$\dfrac{\partial \log \hat{y}_{ik}}{\partial z_{ij}}$을 구하고 $\hat{y}_{ik}$을 곱한 것이 최종 우리가 원하는 gradient가 된다.

$\dfrac{\partial \log \hat{y}_{ik}}{\partial z_{ij}}$을 유도해보았다. $z_{ik}$와 $z_{ij}$가 같은지 여부에 따라서 indicator function으로 정의된다. 근데 앞서 살펴보았듯이, i번째 데이터에 대한 gradient는 $C \times C$ 차원의 jacobian 행렬로 나오므로 이를 살펴보자.

마지막으로 유도된 행렬은 아래와 같이 풀어서 써볼 수 있다.

이렇게 풀어서 써본 이유는 numpy을 이용해서 구현할 때 이를 사용하면 편리하기 때문이다. 이 유도식은 i번재 데이터를 가정한 것이고 n개의 데이터가 있다면 jacobian matrix는 $n \times C \times C$ 차원이 된다.

Backpropagation in softmax

softmax가 최종 레이어이고 출력으로 나오는 확률값을 cross entropy loss 함수에 넣어서 loss을 구할 것이다. 먼저 cross entorpy loss의 정의부터 다시 살펴보자.

\[ L = -\dfrac{1}{N} \sum_i \sum_j y_{ij} \log \hat{y}_{ij} \]

최초로 구해지는 gradient을 유도해보자.

\[ \dfrac{\partial L}{\hat{\partial y}_{ij}} = -\dfrac{1}{n} \dfrac{y_{ij}}{\hat{y}_{ij}} \]

이 gradient가 upstream gradient가 되어 softmax 레이어로 흘러들어온다. local gradient는 위에서 $n \times C \times C$ 차원의 행렬로 나옴을 살펴보았다. 그러면 upstream gradient와 local gradient을 섞어서 어떻게 downstream gradient로 만들어줄까? 잘 생각해보면 downstream gradient는 $n \times C$ 차원의 행렬이 되어야 한다. downstream gradient의 정의를 살펴보자.

\[ \dfrac{\partial L}{\partial z_{ik}} = \sum^{C}_{j=1} \dfrac{\partial \hat{y}_{ij}}{\partial z_{ik}} \dfrac{\partial L}{\partial \hat{y}_{ij}} \]

잘 생각해보면 $z_{ik}$는 $\hat{y}_{i1}, \cdots, \hat{y}_{ik}$을 계산하는데 모두 관여하였다. 따라서 각각의 예측값에 대한 gradient을 모두 더해줘야 한다. 그런데 각 예측값에 대해 upstream gradient가 있으므로 이를 곱해줘서 모두 더해준다. 근데 이를 잘 살펴보면 local gradient의 첨자가 $\hat{y}_{ij}$의 j에서만 바뀜을 알 수 있다. 즉, 위에서 구한 local gradient에서 열별로 행렬 연산을 하면 되는 것이다.

또한 upstream gradient도 첨자가 i번째 데이터의 class label에서만 바뀌므로 upstream gradient의 i번째 행과 i번째 데이터의 local gradient을 적절하게 조합하면 아래와 같이 i번째 데이터의 downstream gradient을 유도할 수 있음을 알 수 있다.

이 결과로 C 차원의 downstream gradient가 유도된다. 이 과정을 n개의 데이터에 대해 반복하면 $n \times C$ 차원의 downstream gradient가 완성되는 것이다.

Numpy code

코드로 살펴보자. 우선 softmax의 forward method부터 구현해보자.

class Softmax:
    def __init__(self):
        pass

    def forward(self, x):
        """
        params
        ------
        x: np.ndarray (n, n_label)

        returns
        -------
        y_pred: np.ndarray (n, n_label)
        """
        self.logit = x
        x = np.exp(x - x.max(axis=1).reshape(-1,1))
        y_pred = x / x.sum(axis=1).reshape(-1,1)
        self.y_pred = y_pred
        return y_pred

입력으로 $n \times C$ 차원의 logit이 들어옴을 가정한다. 중간에 exp을 취할 때, 행별 최대값을 빼주는데 overflow을 방지하는 과정이라고 생각하면 된다. 다음으로 gradient을 살펴보자.

class Softmax:
    def __init__(self):
        pass

    def forward(self, x):
        ...

    def backward(self, dx_out):
        """
        Step 1
        Calculate jacobian matrix (L, L)
        dyhat_i1/dz_i1 ... dyhat_i1/dz_iL
        ...
        dyhat_iL/dz_i1 ... dyhat_LL/dz_iL

        Note that for each data point, L x L jacobian matrix should be calculated. (N, L, L)

        Step 2
        Calculate downstream gradient
        dL/dz_ik = \sum_j dyhat_ij/dz_ik x dL/dhat_ij

        Note that because it is downstream gradient, its dimension is (N, L)

        params
        ------
        dx_out: np.ndarray (n, n_label)

        returns
        -------
        out_grad: np.ndarray (n, n_label)
        """
        # step 1
        n, L = dx_out.shape
        jacobian = np.zeros((n, L, L))
        I = np.identity(L)
        for i in range(n):
            y_pred_i = self.y_pred[i]
            right = I - np.tile(y_pred_i, L).reshape(L, L)
            left = np.tile(y_pred_i, L).reshape(L, L).T
            jacobian[i] = left * right

        # step 2
        dx_in = np.zeros((n, L))
        for i in range(n):
            dL_dz_ij = np.dot(dx_out[i], jacobian[i]) # (1, L)
            dx_in[i] = dL_dz_ij

        return dx_in

먼저 첫번째 단계에서 $n \times C \times C$ 차원의 jacobian 행렬을 구한다. `np.tile`을 사용하여 반복되는 값들을 행렬로 구성하였다. 다음으로 두번째 단계에서 upstream gradient인 `dx_out`과 local gradient인 `jacobian`을 행렬곱한다. 이때, `dx_out[i]`는 i번째 데이터의 upstream gradient으로 $1 \times C$ 차원이고 `jacobian[i]`는 i번째 데이터의 jacobian 행렬로 $C \times C$ 차원의 행렬이다. 이를 곱하면 최종적으로 $1 \times C$ 차원의 벡터가 생성되고 이는 i번째 데이터의 downstream gradient가 된다.

Conclusion

softmax 레이어의 local gradient을 직접 유도해보고 cross entropy loss로부터 흘러들어오는 upstream gradient와 연산하여 downstream gradient까지 유도해보았다. nn.CrossEntropyLoss가 내부적으로 이 많은 것을 하니.. 그냥 이를 쓰는게 낫다는 생각이 들기도 하지만 ㅎㅎ 엄밀하게 유도하고 코드까지 짜보니가 이해가 훨씬 잘 되는 것 같다. 다음은 구현한 softmax 레이어를 사용하여 mlp 또는 cnn을 구성해보고 pytorch 결과와 비교해보고자 한다.

Reference

https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

'ML&DL > Basics' 카테고리의 다른 글

[DL][Implementation] Numpy을 사용하여 CNN 구현하기 (0)	2024.08.31
[DL][Implementation] Backpropagation in convolution (0)	2024.08.31
[ML / DL] Cross entropy loss function in classification problem (12)	2024.07.24
[DL][Implementation] Numpy을 사용하여 MLP 구현하기 (1)	2024.07.13
[DL / Paper review] Auto-Encoding Variational Bayes (2)	2023.05.06

꾸준하게