Recall that the cross-entropy
Consider a classification problem with two or more mutually exclusive labels. For each labeled example, we have a one-hot encoding of the true label. Meanwhile, the output of a classifier is a probability distribution over the labels. In an ideal world, our classifier would output the same distribution as the true labels. If we could quantify how much it fails to do so, we would have a very natural loss function for our classifier.
In the event that P and Q are identically distributed, then
This can also be expressed, without the explicit summation, in terms of vector operations:
Note that here the sum is over the
While it is possible to completely do away with both sums, the result is unintuitive, so usually this is expressed as a single summation:
Cross-entropy loss is typically the preferred loss function for classification problems. In the case of binary classification, we can use a simplified expression for binary cross-entropy loss, but it’s the same thing. A special version of binary cross-entropy loss is used as a loss function for autoencoders.