Cross Entropy Loss

Many deep learning tasks involve classification, where a model outputs a series of probabilities for their corresponding labels. The goal is to correctly predict a given input’s label. Mathematically, it means generating max probabilities for the correct label. The probabilities are generated through a process called softmax.

The softmax function outputs a vector \(\hat{y}\), which represents estimated conditional probabilities of each class given an input \(x\), For example, \(\hat{y}_1 = P(y=\textrm{car}\ |\ x)\). Assume we have many features \(x^{(i)}\) and their corresponding labels \(y^{(i)}\). Then outputs of the model can be expressed succinctly as

\[ P(Y\ |\ X) = \prod^{k}_{i=1} P(y^{(i)} | \ x^{(i)}) \]

Our goal is to maximize \(P(Y | X)\). This is equivalent to minimizing the negative log-likelihood \( -\textrm{log} P(Y\ |\ X) = \sum^{k}_{i=1} -\textrm{log} P(y^{(i)} | \ x^{(i)}) \).

This loss function called the cross-entropy loss. It is widely used in many classification tasks. Our objective is to reduce the value of this loss function. This is equivalent to maximizing the predicted probability for the correct label.

To see why this works. Let take a toy example. Suppose we have three classes. Our model produces a vector with three probabilities for each input given.

import numpy as np

# produces two probability vector for two inputs
y_hat = np.array([[0.1, 0.3, 0.6], [0.2, 0.3, 0.5]])

The label is represented as the indices of the probabilities in y_hat, which will give us the generated probability for a the correct label.

y = np.array([0, 2])

Then, we implement the cross-entropy loss function as:

def cross_entropy(y_hat, y):
    return - np.log(y_hat[range(len(y_hat)), y])

Finally, we calculate the loss value for our given probability vectors:

cross_entropy(y_hat, y)

The result is array([2.30258509, 0.69314718]). In the first output [0.1, 0.3, 0.6], the label is at index 0. But our model gives max probability to index 2, and only \(0.1\) to the label, thus the greater loss value. In the second probability vector [0.2, 0.3, 0.5], we made the right prediction as we give the max probability to index 2 corresponding to the label, thus the smaller loss value.