Definition

Binary Cross-Entropy (also called log loss) measures the difference between the true labels and the predicted probabilities in binary classification.

It penalizes predictions that are confident but wrong much more than predictions that are uncertain.


Formula

For a dataset with nnn samples:

$L = -\frac{1}{n} \sum_{i=1}^n \Big[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \Big]$

where:

  • $y_i \in \{0,1\}$: true label
  • $\hat{p}_i \in (0,1)$: predicted probability of class 1

Intuition

  • If the true label is 1, the loss reduces to:
    • $L = -\log(\hat{p})$ → The closer $\hat{p}$​ is to 1, the smaller the loss.
  • If the true label is 0, the loss reduces to:
    • $L = -\log(1-\hat{p})$ → The closer $\hat{p}$​ is to 0, the smaller the loss.

Example

Suppose true label $y=1$:

  • If model predicts $\hat{p}=0.9$:
    • $L = -\log(0.9) \approx 0.105$ → Good prediction (low loss).
  • If model predicts $\hat{p}=0.1$:
    • $L = -\log(0.1) \approx 2.302$ → Bad prediction (high loss).

Why It’s Useful

  1. Probabilistic Output: Works with probabilities, not just hard decisions.
  2. Asymmetric penalty: Wrong confident predictions are punished heavily.
  3. Differentiable: Great for optimization with gradient descent.
  4. Connection to Information Theory: Equivalent to minimizing the Kullback–Leibler divergence (KL divergence) between true and predicted distributions.

Applications

  • Logistic Regression (binary classification).
  • Neural Networks (binary classification tasks, final sigmoid layer + BCE).
  • Deep Learning: BCE is the standard loss for tasks like fraud detection, churn prediction, medical diagnosis, etc.

In short:
Binary Cross-Entropy measures how well predicted probabilities match true binary labels.

  • Correct confident predictions → small loss.
  • Wrong confident predictions → large loss.