Deep Learning 2 — Classification
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
In the previous chapter, we did regression — predicting a number. Here we move to classification — predicting a category. The big surprise of this chapter, and its beauty: going from one to the other requires almost nothing. The gradient keeps the same form.
Why this chapter?
You'll see:
- the logistic neuron (linear regression + sigmoid);
- the intuitive explanation of cross-entropy and its derivation from the Bernoulli distribution;
- the fact that the gradient keeps the same form ;
- the PyTorch pitfalls:
BCELossvsBCEWithLogitsLossvsCrossEntropyLoss; - multiclass classification: softmax +
CrossEntropyLoss.
From linear to logistic
For binary classification, the target is . We want a probability , so an output in .
Solution: apply the sigmoid function to the linear output.
squashes any real to . This is the logistic neuron.
Cross-entropy as "surprise"
The loss for classification is the cross-entropy:
The intuition: cross-entropy measures the "surprise" of the model facing the truth.
- If and : no surprise, .
- If and : huge surprise, .
- Symmetric for .
The more confident the model in the right answer, the smaller the loss. The more confident in the wrong answer, the more the loss explodes. This asymmetry is what pushes the model to learn calibrated probabilities.
Derivation from Bernoulli
Why this exact formula? It comes from maximum likelihood. If we model as a Bernoulli variable with parameter :
The likelihood of the observations is the product. Take the negative log (to get a function to minimise) and we land exactly on cross-entropy.
The gradient keeps the same form
This is the magical moment of the chapter. For the linear neuron:
For the logistic neuron, after computation:
Strictly the same formula. The only difference is in the definition of :
- linear:
- logistic:
Practical consequence: to turn a LinearNeuron into a LogisticNeuron, just change forward() to apply sigmoid. Everything else stays identical.
PyTorch version
model = nn.Linear(m, 1) # just the logits, no Sigmoid
criterion = nn.BCEWithLogitsLoss() # applies sigmoid + cross-entropy
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
Classic pitfalls
PyTorch offers several model + loss combinations that look similar but aren't equivalent:
| If the loss is... | The model must output... |
|---|---|
nn.BCELoss | a probability (with Sigmoid at the end) |
nn.BCEWithLogitsLoss | a logit (no Sigmoid) |
nn.CrossEntropyLoss (multiclass) | a vector of logits (no Softmax) |
:::warning Pitfall #1
Never put Sigmoid in the model AND use BCEWithLogitsLoss. The sigmoid would be applied twice, and the model wouldn't learn.
:::
BCEWithLogitsLoss is recommended: numerically more stable than Sigmoid + BCELoss.
Predicting the class
After training, to go from logit to class:
model.eval()
with torch.no_grad():
logits = model(X_test_t)
proba = torch.sigmoid(logits) # logit → probability
y_hat = (proba >= 0.5).float() # probability → class
Multiclass: softmax + CrossEntropyLoss
For classes, the last layer has neurons and produces a vector of logits:
The generalisation of sigmoid is the softmax:
All probabilities are positive and sum to 1.
The multiclass cross-entropy is simply:
Minus the log of the predicted probability for the true class.
model = nn.Sequential(
nn.Linear(m, 64),
nn.ReLU(),
nn.Linear(64, C), # multiclass logits
)
criterion = nn.CrossEntropyLoss() # applies LogSoftmax + cross-entropy
Expected shape of y
| Task | Model output | y shape | Type | Loss |
|---|---|---|---|---|
| Regression | (n, 1) | (n, 1) | float32 | MSELoss |
| Binary classification | (n, 1) | (n, 1) | float32 | BCEWithLogitsLoss |
| Multiclass classification | (n, C) | (n,) | long | CrossEntropyLoss |
:::warning Multiclass y
For CrossEntropyLoss, is a 1D vector of integers (not one-hot). Type long, not float.
:::
Predicting the class
with torch.no_grad():
logits = model(X_test_t)
y_hat = torch.argmax(logits, dim=1) # most probable class
argmax over the class dimension — no need to apply softmax explicitly, the order is preserved.