Recall that the cross-entropy is the expected number of bits required to encode a message derived from a distribution into an encoding optimized for a distribution . We express this as the expected information density of events drawn from over the distribution of :

Consider a classification problem with two or more mutually exclusive labels. For each labeled example, we have a one-hot encoding of the true label. Meanwhile, the output of a classifier is a probability distribution over the labels. In an ideal world, our classifier would output the same distribution as the true labels. If we could quantify how much it fails to do so, we would have a very natural loss function for our classifier.

In the event that P and Q are identically distributed, then is minimized (and becomes regular old Shannon entropy). The more dissimilar are P and Q, the larger becomes. Hence, cross-entropy becomes an excellent proxy for the difference between the true distribution and our classifier’s predictions. So we can ask: how many bits of information does it take to encode the classifier’s output into an encoding optimized for the true label ? This is the cross-entropy loss,

This can also be expressed, without the explicit summation, in terms of vector operations:

Note that here the sum is over the possible labels. If we wish to generalize this to examples, then and become target matrices and respectively:

While it is possible to completely do away with both sums, the result is unintuitive, so usually this is expressed as a single summation:

Cross-entropy loss is typically the preferred loss function for classification problems. In the case of binary classification, we can use a simplified expression for binary cross-entropy loss, but it’s the same thing. A special version of binary cross-entropy loss is used as a loss function for autoencoders.