Deep Learning 2 — Classification

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

In the previous chapter, we did regression — predicting a number. Here we move to classification — predicting a category. The big surprise of this chapter, and its beauty: going from one to the other requires almost nothing. The gradient keeps the same form.

Why this chapter?

You'll see:

the logistic neuron (linear regression + sigmoid);
the intuitive explanation of cross-entropy and its derivation from the Bernoulli distribution;
the fact that the gradient keeps the same form $\frac{1}{n} X^T (u - y)$ ;
the PyTorch pitfalls: BCELoss vs BCEWithLogitsLoss vs CrossEntropyLoss;
multiclass classification: softmax + CrossEntropyLoss.

From linear to logistic

For binary classification, the target is $y \in \{0, 1\}$ . We want a probability $\hat{p} = P(y = 1 \mid X)$ , so an output in $(0, 1)$ .

Solution: apply the sigmoid function to the linear output.

$z = X w + b, \quad u = \sigma(z) = \frac{1}{1 + e^{-z}}$

$\sigma(z)$ squashes any real to $(0, 1)$ . This is the logistic neuron.

Cross-entropy as "surprise"

The loss for classification is the cross-entropy:

$E_i = -\big[ y_i \log u_i + (1 - y_i) \log(1 - u_i) \big]$

The intuition: cross-entropy measures the "surprise" of the model facing the truth.

If $y_i = 1$ and $u_i \to 1$ : no surprise, $E_i = -\log 1 = 0$ .
If $y_i = 1$ and $u_i \to 0$ : huge surprise, $E_i \to +\infty$ .
Symmetric for $y_i = 0$ .

The more confident the model in the right answer, the smaller the loss. The more confident in the wrong answer, the more the loss explodes. This asymmetry is what pushes the model to learn calibrated probabilities.

Derivation from Bernoulli

Why this exact formula? It comes from maximum likelihood. If we model $y$ as a Bernoulli variable with parameter $p$ :

$P(y \mid p) = p^y (1-p)^{1-y}$

The likelihood of the observations is the product. Take the negative log (to get a function to minimise) and we land exactly on cross-entropy.

The gradient keeps the same form

This is the magical moment of the chapter. For the linear neuron:

$\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y)$

For the logistic neuron, after computation:

$\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y)$

Strictly the same formula. The only difference is in the definition of $u$ :

linear: $u = X w + b$
logistic: $u = \sigma(X w + b)$

Practical consequence: to turn a LinearNeuron into a LogisticNeuron, just change forward() to apply sigmoid. Everything else stays identical.

PyTorch version

model = nn.Linear(m, 1)             # just the logits, no Sigmoid
criterion = nn.BCEWithLogitsLoss()  # applies sigmoid + cross-entropy
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Classic pitfalls

PyTorch offers several model + loss combinations that look similar but aren't equivalent:

If the loss is...	The model must output...
`nn.BCELoss`	a probability (with `Sigmoid` at the end)
`nn.BCEWithLogitsLoss`	a logit (no Sigmoid)
`nn.CrossEntropyLoss` (multiclass)	a vector of logits (no Softmax)

:::warning Pitfall #1 Never put Sigmoid in the model AND use BCEWithLogitsLoss. The sigmoid would be applied twice, and the model wouldn't learn. :::

BCEWithLogitsLoss is recommended: numerically more stable than Sigmoid + BCELoss.

Predicting the class

After training, to go from logit to class:

model.eval()
with torch.no_grad():
    logits = model(X_test_t)
    proba  = torch.sigmoid(logits)         # logit → probability
    y_hat  = (proba >= 0.5).float()        # probability → class

Multiclass: softmax + CrossEntropyLoss

For $C$ classes, the last layer has $C$ neurons and produces a vector of logits:

$z = (z_0, z_1, \dots, z_{C-1})$

The generalisation of sigmoid is the softmax:

$P(y = c \mid X) = \frac{e^{z_c}}{\sum_k e^{z_k}}$

All probabilities are positive and sum to 1.

The multiclass cross-entropy is simply:

$E_i = -\log P(y = y_i \mid X_i)$

Minus the log of the predicted probability for the true class.

model = nn.Sequential(
    nn.Linear(m, 64),
    nn.ReLU(),
    nn.Linear(64, C),     # multiclass logits
)
criterion = nn.CrossEntropyLoss()  # applies LogSoftmax + cross-entropy

Expected shape of y

Task	Model output	y shape	Type	Loss
Regression	`(n, 1)`	`(n, 1)`	`float32`	`MSELoss`
Binary classification	`(n, 1)`	`(n, 1)`	`float32`	`BCEWithLogitsLoss`
Multiclass classification	`(n, C)`	`(n,)`	`long`	`CrossEntropyLoss`

:::warning Multiclass y For CrossEntropyLoss, $y$ is a 1D vector of integers (not one-hot). Type long, not float. :::

Predicting the class

with torch.no_grad():
    logits = model(X_test_t)
    y_hat  = torch.argmax(logits, dim=1)   # most probable class

argmax over the class dimension — no need to apply softmax explicitly, the order is preserved.

Full notebook on Kaggle (forkable) →

Why this chapter?​

From linear to logistic​

Cross-entropy as "surprise"​

Derivation from Bernoulli​

The gradient keeps the same form​

PyTorch version​

Classic pitfalls​

Predicting the class​

Multiclass: softmax + CrossEntropyLoss​

Expected shape of y​

Predicting the class​

Why this chapter?

From linear to logistic

Cross-entropy as "surprise"

Derivation from Bernoulli

The gradient keeps the same form

PyTorch version

Classic pitfalls

Predicting the class

Multiclass: softmax + CrossEntropyLoss

Expected shape of y

Predicting the class