DL · Chapter 2
Deep Learning 2 — Classification
In the previous chapter we built a single artificial neuron that performed linear regression. We derived the gradient of the mean squared error by hand, implemented gradient descent from scratch, and then rebuilt the same model in PyTorch using nn.Linear and an MSE loss. The neuron produced a real number, an unbounded prediction of a continuous quantity.
We now turn to the second great family of supervised problems: classification. The target is no longer a real number but a category — malignant or benign tumor, Adelie or Chinstrap or Gentoo penguin. The output of the model must therefore live in a different space, and the loss function used during the previous chapter is no longer appropriate. The objective of this chapter is to extend the linear neuron into a logistic neuron for binary classification, derive its gradient, reimplement it from scratch and then in PyTorch, and finally generalize the construction to multilayer networks and to multiclass problems with the softmax/cross-entropy pair.
By the end of the chapter we will have a clear view of the four building blocks that classification adds to the regression toolbox: the sigmoid (and its multiclass cousin, the softmax), the binary cross-entropy (and its multiclass cousin, the categorical cross-entropy), the convention that PyTorch loss functions consume logits rather than probabilities, and the way these objects compose with the multilayer skeleton inherited from the previous chapter.
Binary classification and the logistic neuron
In binary classification the target variable takes only two values:
A naive idea would be to ask the model to output exactly or . This is the perceptron rule:
The perceptron is conceptually attractive — it produces a hard decision and a clean linear boundary — but it has two defects that disqualify it in modern practice. First, the output is not a probability: we have no way to express "I am 90% confident that this tumor is malignant" rather than "I am 51% confident". Second, the step function is not differentiable, so we cannot train it with gradient descent. We need a smoother, probabilistic version of this threshold.
Probabilistic modeling: the Bernoulli law
A binary variable is naturally modeled by a Bernoulli law. For a parameter :
The two cases can be merged into a single compact expression:
This formula is the workhorse of binary probabilistic modeling: when it returns , when it returns . The remaining task is to learn from data — and, more precisely, to let depend on the input , since we expect the probability of being malignant to depend on the features of the tumor.
From probability to likelihood
In supervised learning the data are observed and fixed. The probability is therefore reread as a function of the parameters of the model, not of the data. We call this function the likelihood.
For a single observation:
For a dataset of independent observations:
where is the probability assigned by the model to example . The principle of maximum likelihood states that we should choose the parameters that make the observed data as plausible as possible — in other words, that maximize .
Predicting a probability with a neuron
To turn into a function of , we reuse the affine combination from the previous chapter:
This quantity is a real number, not a probability. We squash it into the unit interval with the logistic function, also called the sigmoid:
The sigmoid is monotone increasing, takes values in , equals at , and saturates exponentially fast at and for large negative and positive inputs. Composing the affine layer with the sigmoid gives the logistic neuron:
The logistic neuron is no longer a class predictor but a probability estimator. The hard decision is recovered afterwards by thresholding at, for example, .
Likelihood, log-likelihood and binary cross-entropy
Substituting the neuron output into the Bernoulli likelihood gives:
This product of probabilities is numerically dangerous: with a thousand examples each contributing a factor smaller than , the result quickly underflows to zero. We therefore work with the log-likelihood, which has the same maximizer but is a sum rather than a product:
By convention, machine learning algorithms minimize a cost rather than maximize a likelihood. We therefore flip the sign and obtain the binary cross-entropy:
Three names, one object. The function above is variously called binary cross-entropy, log-loss, or negative log-likelihood of a Bernoulli model. All three names refer to the same expression. Whichever the textbook uses, you should recognize the same underneath.
This loss replaces the MSE of the linear neuron. It is finite and differentiable in , it equals when the model assigns probability to the correct class, and it diverges to when the model assigns probability to the correct class — a very strong incentive for confidence calibration.
Gradient of the logistic neuron
To train the logistic neuron we need the gradient of with respect to and . Take a single example and trace the computation through three blocks:
with
The chain rule gives, for a single weight :
The three factors correspond to loss vs. output, activation vs. pre-activation, and pre-activation vs. weight.
Computing each factor
Factor 1 — derivative of the loss with respect to . Differentiating term by term:
The first term comes from , the second from via the chain rule on .
Factor 2 — derivative of the sigmoid. A classical identity states that the derivative of the sigmoid can be written using the sigmoid itself:
This is one of the great practical advantages of the sigmoid: once the forward pass has computed , the backward pass needs no extra exponentials.
Factor 3 — derivative of the affine layer. Since :
Magic cancellation
Multiplying the three factors:
Distributing inside the parentheses:
The final result is therefore astonishingly simple:
In vector and batch form:
Key formula. For the logistic neuron with binary cross-entropy loss, the gradient takes the same shape as for linear regression with MSE loss: The only difference is that here instead of .
Why this similarity is not a coincidence
The cancellation we just performed is a special case of a more general phenomenon. Whenever the activation function is the canonical link of the exponential-family distribution we are modeling, the gradient of the negative log-likelihood collapses to "(prediction target) times input". The MSE/identity pair (Gaussian model) and the cross-entropy/sigmoid pair (Bernoulli model) are two instances of the same scheme; in the multiclass case we will see a third, with cross-entropy and softmax. This is why your training loop barely changes between the linear and the logistic neuron: only the definition of changes, the rest of the algorithm — gradient assembly, parameter update, batching — is strictly identical.
A logistic neuron from scratch
Armed with this gradient, we can rewrite the LinearNeuron class of the previous chapter into a LogisticNeuron class. The structural changes are minimal:
class LogisticNeuron: def __init__(self): self.w = None self.b = 0.0 self.history = [] def forward(self, X): z = X @ self.w + self.b return 1.0 / (1.0 + np.exp(-z)) def predict(self, X, threshold=0.5): u = self.forward(X) return (u >= threshold).astype(int) def fit(self, X, y, lr=0.1, epochs=100, mode="batch", batch_size=32): n, m = X.shape if self.w is None: self.w = np.zeros(m) # ... gradient descent loop, identical to the linear neuron, # using the formulas dE/dw = X.T @ (u - y) / n and dE/db = mean(u - y)
Three things changed compared to the linear neuron of chapter 1:
- The
forwardmethod now applies the sigmoid:1 / (1 + exp(-z))instead ofz. - The cost function used to monitor convergence is the binary cross-entropy
rather than the MSE. In practice we add a small inside the logs to avoid
log(0)when the sigmoid saturates. - A new
predictmethod thresholds the probabilities at to produce hard class labels.
The gradient assembly (grad_w = X.T @ (u - y) / n, grad_b = (u - y).mean()) and the descent itself (self.w -= lr * grad_w) are character-for-character identical to the regression case. This is the practical payoff of the similarity we proved above.
A typical use on the cancer_mini dataset reads:
df = pd.read_csv(".../cancer_mini.csv") X = df.drop(columns=["diagnosis"]).to_numpy() y = df["diagnosis"].to_numpy() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) model = LogisticNeuron() model.fit(X_train, y_train, mode="batch", epochs=100, lr=0.5) y_hat = model.predict(X_test) print(accuracy_score(y_test, y_hat))
A few practical reflexes are worth highlighting here. We stratify the train/test split on so that both folds preserve the class balance — important when the positive class is rare. We standardize the inputs on the training set and apply the same transformation to the test set, never the other way around. And we evaluate with classification metrics (accuracy, confusion matrix, precision, recall) rather than with a regression metric like MSE.
The logistic neuron in PyTorch
The next step is to abandon the from-scratch implementation and let PyTorch take care of the autograd machinery. The logistic neuron in PyTorch reuses the nn.Linear layer of the previous chapter; only the loss and the post-processing change.
Two ways of writing it
Option 1 — explicit sigmoid with nn.BCELoss. We can build a model that returns probabilities directly and feed those into the binary cross-entropy:
model = nn.Sequential( nn.Linear(m, 1), nn.Sigmoid() ) criterion = nn.BCELoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
Here model(X) directly returns and criterion(u, y) computes the binary cross-entropy.
Option 2 — recommended, raw logits with nn.BCEWithLogitsLoss. We do not put a sigmoid at the end of the model. The model returns the logits , and BCEWithLogitsLoss applies the sigmoid and the cross-entropy together, in a numerically stable way:
model = nn.Linear(m, 1) criterion = nn.BCEWithLogitsLoss() optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
Pitfall —
BCELossvs.BCEWithLogitsLoss.BCELossexpects probabilities in ;BCEWithLogitsLossexpects raw logits in and applies the sigmoid internally using the log-sum-exp trick. The second version is the right default: it is numerically more stable, it avoids the "I forgot the sigmoid" bug, and it is faster because the forward and the loss are fused. Never stacknn.Sigmoidfollowed byBCEWithLogitsLoss— you would apply the sigmoid twice and get nonsense.
Data preparation
As in regression, we convert NumPy arrays to float32 tensors. For binary classification, the target tensor must have shape to match the model output:
X_train = torch.tensor(X_train, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1) X_test = torch.tensor(X_test, dtype=torch.float32) y_test = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
Training loop
The training loop is character-for-character identical to the regression loop of chapter 1:
epochs = 200 loss_history = [] for _ in range(epochs): optimizer.zero_grad() logits = model(X_train) loss = criterion(logits, y_train) loss.backward() optimizer.step() loss_history.append(loss.item())
The only difference is the type of criterion. The four-step ritual zero_grad -> forward -> backward -> step does not change.
From logits to predicted classes
Because the model returns logits and not probabilities, we need an explicit conversion at evaluation time:
model.eval() with torch.no_grad(): logits = model(X_test) proba = torch.sigmoid(logits) y_hat = (proba >= 0.5).float()
We then send the predictions back to NumPy and compute the usual sklearn metrics:
y_hat_np = y_hat.cpu().numpy().reshape(-1) y_test_np = y_test.cpu().numpy().reshape(-1) print("Accuracy:", accuracy_score(y_test_np, y_hat_np)) print("Confusion matrix:\n", confusion_matrix(y_test_np, y_hat_np))
The model.eval() call disables training-only behaviors such as dropout and batch-norm running-stats updates. In our minimal example it is not strictly necessary, but it is the right reflex to acquire from day one. The with torch.no_grad() context tells autograd not to record operations, which saves memory and time at inference.
Side-by-side correspondence
Listing the changes between the from-scratch and the PyTorch versions clarifies what autograd actually buys us:
- forward: from-scratch writes explicitly; in PyTorch,
model(X)returns the logit, andtorch.sigmoid(or the loss) applies the sigmoid; - gradient: from-scratch performs the manual chain-rule computation we did above; in PyTorch,
loss.backward()does it for us, in any depth and shape; - update: from-scratch writes
w -= lr * grad_w; in PyTorch,optimizer.step()walks the parameter list of the model and applies the rule of the chosen optimizer.
Multilayer networks for classification
Once the single neuron is in place, going multilayer is essentially free. We stack nn.Linear layers separated by nonlinear activations — the very construction we used in the linear chapter, except that we will end the network with a logit head and a binary cross-entropy loss.
Conceptually, the network performs:
Why ReLU between layers and no activation at the end? Because:
- intermediate non-linearities are what allow the network to represent functions that a single neuron could not — without them, the composition of linear layers collapses back into a single linear layer;
- the final layer must produce a logit, not a probability, so that the loss
BCEWithLogitsLosscan apply the sigmoid internally with full numerical stability.
A typical implementation reads:
model = nn.Sequential( nn.Linear(m, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, 1), # logits, no activation ) criterion = nn.BCEWithLogitsLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
The nn.Sequential wrapper is convenient for purely sequential architectures. For more complex designs — branches, skip connections, residual blocks — one usually subclasses nn.Module:
class MLPClassifier(nn.Module): def __init__(self, m, hidden=(64, 32)): super().__init__() self.fc1 = nn.Linear(m, hidden[0]) self.fc2 = nn.Linear(hidden[0], hidden[1]) self.head = nn.Linear(hidden[1], 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) return self.head(x) # logits
The training loop is unchanged with respect to the single-neuron case. This is the central message of the PyTorch ecosystem: gradients and updates are plumbing that you write once and reuse, while what changes from one experiment to the next is the model and the loss.
Multiclass classification
Most real classification problems are not binary. Penguins come in three species, digits in ten, ImageNet classes in a thousand. The target now takes more than two values:
and we want to estimate, for each input , the full discrete distribution
The construction generalizes the binary case along three axes: the output dimension, the activation function, and the loss.
Logits, softmax, and cross-entropy
The network now ends with output neurons — one per class — and the last layer is again linear:
nn.Linear(h, C)
The output is a vector of real scores called logits:
Logits are unbounded and do not yet form a probability distribution. To turn them into probabilities we use the softmax function:
The softmax exponentiates each score and renormalizes so that the components sum to . It generalizes the sigmoid: with classes, the softmax of is mathematically equivalent to applying a sigmoid to the difference .
The corresponding loss is the categorical cross-entropy, defined per example as
It picks out the probability assigned to the correct class and takes its negative logarithm — a strict extension of the binary case.
Target encoding: integer indices, not one-hot
Two conventions exist for representing the target in a multiclass problem: a vector of integer class indices, or a one-hot matrix. PyTorch's nn.CrossEntropyLoss expects the integer-index form:
- is a vector of shape — not ;
- ;
- has dtype
torch.long.
This is one of the most common beginner mistakes: feeding a one-hot-encoded to CrossEntropyLoss produces a cryptic shape error. Use LabelEncoder from sklearn to convert string labels to integers before turning them into tensors.
CrossEntropyLoss expects logits
The most important convention to remember:
Pitfall —
CrossEntropyLossexpects logits, not softmax outputs. Just likeBCEWithLogitsLossin the binary case,nn.CrossEntropyLossapplies the softmax internally (viaLogSoftmax + NLLLoss), in a numerically stable way using the log-sum-exp trick. Do not applytorch.softmaxornn.Softmaxat the end of your model — if you do, you will apply the softmax twice and the loss will collapse to a near-constant.
A typical multiclass model therefore reads:
model = nn.Sequential( nn.Linear(m, 64), nn.ReLU(), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, C), # multiclass logits, no softmax ) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
Predicting a class
After training, the predicted class is the argmax of the logits along the class dimension:
model.eval() with torch.no_grad(): logits = model(X_test) y_hat = torch.argmax(logits, dim=1)
Mathematically:
The argmax of the logits and the argmax of the softmax probabilities are the same — softmax is a monotone transformation — so we can skip the softmax altogether at inference time, which is faster and more stable.
A complete recipe for penguins_mini
Putting everything together, a multiclass classifier on the penguins_mini dataset goes through the following stages:
- Load the data and separate from . The target column contains strings such as
"Adelie","Chinstrap","Gentoo". - Encode the target with
LabelEncoder, producing integer labels in . - Stratified train/test split with
test_size=0.2,random_state=42,stratify=y. - Standardize inputs with
StandardScaler(fit_transformon train,transformon test). - Convert:
X_*totorch.float32,y_*totorch.long. - Build a
Sequentialmodel with one or two hidden layers ending innn.Linear(h, C). - Use
nn.CrossEntropyLoss()and an optimizer (SGD or Adam). - Run the standard training loop; store the loss in
loss_history. - At test time, compute logits, take
argmax(dim=1), and report accuracy and confusion matrix.
The structure of the loop is unchanged. What changes between binary and multiclass is essentially the dimension of the last layer (1 vs. ), the dtype of (float32 vs. long), the loss class (BCEWithLogitsLoss vs. CrossEntropyLoss), and the prediction rule (threshold vs. argmax).
Summary
The cluster of ideas introduced in this chapter forms the foundation of every classification model you will meet later in the course — from the convolutional networks of the next chapters to large language models trained with cross-entropy on next-token prediction. To synthesize:
- Binary classification estimates via a logistic neuron , trained by minimizing the binary cross-entropy, which is the negative log-likelihood of a Bernoulli model.
- The gradient of the binary cross-entropy with respect to the weights collapses, after a beautiful cancellation, to the same as in linear regression. The training loop is therefore identical; only the definition of changes.
- In PyTorch, the recommended pattern is
nn.Linear(m, 1)withnn.BCEWithLogitsLoss, which combines sigmoid and cross-entropy in a numerically stable way. - A multilayer classifier simply stacks
nn.Linear + ReLUblocks before a logit head; the loss and the loop do not change. - Multiclass classification generalizes the construction with output neurons, the softmax activation, the categorical cross-entropy, and the convention that PyTorch's
CrossEntropyLossconsumes raw logits and integer-index targets.
Exercises
Exercise 1 — Gradient by hand. Starting from with and :
- Compute .
- Compute .
- Compute and .
- Apply the chain rule and check that you recover .
- Write the batch version and in matrix form.
Exercise 2 — Logistic neuron from scratch. Take your LinearNeuron class from chapter 1 and turn it into a LogisticNeuron. The three changes are: apply a sigmoid in forward, use binary cross-entropy in the loss tracker (with a small added inside the logs to avoid log(0)), and add a predict(X, threshold=0.5) method. Test it on cancer_mini with a stratified train/test split and StandardScaler normalization.
Exercise 3 — PyTorch logistic neuron. Reproduce exercise 2 in PyTorch using nn.Linear(m, 1) and nn.BCEWithLogitsLoss. Compare the test accuracy and the loss curve to those of your from-scratch implementation. They should agree up to small differences due to initialization and optimizer choice.
Exercise 4 — Multilayer network. Replace the single nn.Linear(m, 1) with a multilayer classifier
Linear(m, 64) -> ReLU -> Linear(64, 32) -> ReLU -> Linear(32, 1)
trained on cancer_mini with BCEWithLogitsLoss. Does adding depth help on this dataset? Plot the loss curves of both models on the same axes.
Exercise 5 — Multiclass on penguins_mini. Follow the nine-step recipe from the previous section. Build a model with one hidden layer of units and an output layer of size . Use LabelEncoder to encode the species, CrossEntropyLoss as the loss, and argmax(dim=1) as the prediction rule. Report the accuracy and the confusion matrix on the test set.
Exercise 6 — BCELoss vs. BCEWithLogitsLoss. Build two equivalent models on cancer_mini: model A is nn.Sequential(nn.Linear(m, 1), nn.Sigmoid()) with BCELoss; model B is nn.Linear(m, 1) with BCEWithLogitsLoss. Train both with the same data, optimizer and seed. Verify that the loss curves are visually identical, and check at extreme initial weights what happens to the gradients (the BCELoss version may underflow; the BCEWithLogitsLoss version does not).
Exercise 7 — Forgetting the softmax / forgetting the sigmoid. On penguins_mini, deliberately add a nn.Softmax(dim=1) at the end of your multiclass model and keep CrossEntropyLoss. Train, plot the loss, and compare with the version without softmax. Explain in two or three sentences what goes wrong.
Going further
- PyTorch documentation —
nn.BCEWithLogitsLossandnn.CrossEntropyLoss. Both pages contain a precise description of the computation, the input/target shape conventions, and the rationale for the log-sum-exp trick. Reading them once carefully avoids most beginner confusion. https://pytorch.org/docs/stable/nn.html#loss-functions - Goodfellow, Bengio, Courville — Deep Learning, chapter 6 ("Deep Feedforward Networks"). The reference textbook for the conceptual scaffolding of MLPs: choice of activation, choice of output unit, choice of cost function, and the generic maximum likelihood perspective that unifies regression and classification.
- Andrej Karpathy — "A Recipe for Training Neural Networks". A short and dense blog post listing the practical reflexes that separate working neural networks from broken ones: visualize your data first, start with a small model, fix one thing at a time, scrutinize the loss curve, and so on. Required reading before tackling the convolutional chapters that follow. https://karpathy.github.io/2019/04/25/recipe/
- Christopher Olah — "Calculus on Computational Graphs: Backpropagation". A very visual introduction to the chain rule on graphs, which makes it crystal clear why autograd works and why the gradient cancellation we observed in this chapter is structural rather than accidental.
- scikit-learn —
LogisticRegressionuser guide. A complementary view from the classical-statistics side: regularization, multinomial vs. one-vs-rest, solver choice. Useful as a sanity check whenever you are tempted to deploy a 50-million-parameter network on a dataset where logistic regression would do.