Aller au contenu principal

Deep Learning 2 — Classification

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

In the previous chapter, we did regression — predicting a number. Here we move to classification — predicting a category. The big surprise of this chapter, and its beauty: going from one to the other requires almost nothing. The gradient keeps the same form.

Why this chapter?

You'll see:

  • the logistic neuron (linear regression + sigmoid);
  • the intuitive explanation of cross-entropy and its derivation from the Bernoulli distribution;
  • the fact that the gradient keeps the same form 1nXT(uy)\frac{1}{n} X^T (u - y);
  • the PyTorch pitfalls: BCELoss vs BCEWithLogitsLoss vs CrossEntropyLoss;
  • multiclass classification: softmax + CrossEntropyLoss.

From linear to logistic

For binary classification, the target is y{0,1}y \in \{0, 1\}. We want a probability p^=P(y=1X)\hat{p} = P(y = 1 \mid X), so an output in (0,1)(0, 1).

Solution: apply the sigmoid function to the linear output.

z=Xw+b,u=σ(z)=11+ezz = X w + b, \quad u = \sigma(z) = \frac{1}{1 + e^{-z}}

σ(z)\sigma(z) squashes any real to (0,1)(0, 1). This is the logistic neuron.

Cross-entropy as "surprise"

The loss for classification is the cross-entropy:

Ei=[yilogui+(1yi)log(1ui)]E_i = -\big[ y_i \log u_i + (1 - y_i) \log(1 - u_i) \big]

The intuition: cross-entropy measures the "surprise" of the model facing the truth.

  • If yi=1y_i = 1 and ui1u_i \to 1: no surprise, Ei=log1=0E_i = -\log 1 = 0.
  • If yi=1y_i = 1 and ui0u_i \to 0: huge surprise, Ei+E_i \to +\infty.
  • Symmetric for yi=0y_i = 0.

The more confident the model in the right answer, the smaller the loss. The more confident in the wrong answer, the more the loss explodes. This asymmetry is what pushes the model to learn calibrated probabilities.

Derivation from Bernoulli

Why this exact formula? It comes from maximum likelihood. If we model yy as a Bernoulli variable with parameter pp:

P(yp)=py(1p)1yP(y \mid p) = p^y (1-p)^{1-y}

The likelihood of the observations is the product. Take the negative log (to get a function to minimise) and we land exactly on cross-entropy.

The gradient keeps the same form

This is the magical moment of the chapter. For the linear neuron:

Ew=1nXT(uy)\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y)

For the logistic neuron, after computation:

Ew=1nXT(uy)\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y)

Strictly the same formula. The only difference is in the definition of uu:

  • linear: u=Xw+bu = X w + b
  • logistic: u=σ(Xw+b)u = \sigma(X w + b)

Practical consequence: to turn a LinearNeuron into a LogisticNeuron, just change forward() to apply sigmoid. Everything else stays identical.

PyTorch version

model = nn.Linear(m, 1) # just the logits, no Sigmoid
criterion = nn.BCEWithLogitsLoss() # applies sigmoid + cross-entropy
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Classic pitfalls

PyTorch offers several model + loss combinations that look similar but aren't equivalent:

If the loss is...The model must output...
nn.BCELossa probability (with Sigmoid at the end)
nn.BCEWithLogitsLossa logit (no Sigmoid)
nn.CrossEntropyLoss (multiclass)a vector of logits (no Softmax)

:::warning Pitfall #1 Never put Sigmoid in the model AND use BCEWithLogitsLoss. The sigmoid would be applied twice, and the model wouldn't learn. :::

BCEWithLogitsLoss is recommended: numerically more stable than Sigmoid + BCELoss.

Predicting the class

After training, to go from logit to class:

model.eval()
with torch.no_grad():
logits = model(X_test_t)
proba = torch.sigmoid(logits) # logit → probability
y_hat = (proba >= 0.5).float() # probability → class

Multiclass: softmax + CrossEntropyLoss

For CC classes, the last layer has CC neurons and produces a vector of logits:

z=(z0,z1,,zC1)z = (z_0, z_1, \dots, z_{C-1})

The generalisation of sigmoid is the softmax:

P(y=cX)=ezckezkP(y = c \mid X) = \frac{e^{z_c}}{\sum_k e^{z_k}}

All probabilities are positive and sum to 1.

The multiclass cross-entropy is simply:

Ei=logP(y=yiXi)E_i = -\log P(y = y_i \mid X_i)

Minus the log of the predicted probability for the true class.

model = nn.Sequential(
nn.Linear(m, 64),
nn.ReLU(),
nn.Linear(64, C), # multiclass logits
)
criterion = nn.CrossEntropyLoss() # applies LogSoftmax + cross-entropy

Expected shape of y

TaskModel outputy shapeTypeLoss
Regression(n, 1)(n, 1)float32MSELoss
Binary classification(n, 1)(n, 1)float32BCEWithLogitsLoss
Multiclass classification(n, C)(n,)longCrossEntropyLoss

:::warning Multiclass y For CrossEntropyLoss, yy is a 1D vector of integers (not one-hot). Type long, not float. :::

Predicting the class

with torch.no_grad():
logits = model(X_test_t)
y_hat = torch.argmax(logits, dim=1) # most probable class

argmax over the class dimension — no need to apply softmax explicitly, the order is preserved.


Full notebook on Kaggle (forkable) →