DL · Chapter 2

Deep Learning 2 — Classification

In the previous chapter we built a single artificial neuron that performed linear regression. We derived the gradient of the mean squared error by hand, implemented gradient descent from scratch, and then rebuilt the same model in PyTorch using nn.Linear and an MSE loss. The neuron produced a real number, an unbounded prediction of a continuous quantity.

We now turn to the second great family of supervised problems: classification. The target is no longer a real number but a category — malignant or benign tumor, Adelie or Chinstrap or Gentoo penguin. The output of the model must therefore live in a different space, and the loss function used during the previous chapter is no longer appropriate. The objective of this chapter is to extend the linear neuron into a logistic neuron for binary classification, derive its gradient, reimplement it from scratch and then in PyTorch, and finally generalize the construction to multilayer networks and to multiclass problems with the softmax/cross-entropy pair.

By the end of the chapter we will have a clear view of the four building blocks that classification adds to the regression toolbox: the sigmoid (and its multiclass cousin, the softmax), the binary cross-entropy (and its multiclass cousin, the categorical cross-entropy), the convention that PyTorch loss functions consume logits rather than probabilities, and the way these objects compose with the multilayer skeleton inherited from the previous chapter.

Binary classification and the logistic neuron

In binary classification the target variable takes only two values:

y \in \{0, 1\}

A naive idea would be to ask the model to output exactly $0$ or $1$ . This is the perceptron rule:

\hat{y} = \begin{cases} 1 & \text{if } Xw + b \ge 0 \\ 0 & \text{otherwise} \end{cases}

The perceptron is conceptually attractive — it produces a hard decision and a clean linear boundary — but it has two defects that disqualify it in modern practice. First, the output is not a probability: we have no way to express "I am 90% confident that this tumor is malignant" rather than "I am 51% confident". Second, the step function is not differentiable, so we cannot train it with gradient descent. We need a smoother, probabilistic version of this threshold.

Probabilistic modeling: the Bernoulli law

A binary variable is naturally modeled by a Bernoulli law. For a parameter $p \in [0, 1]$ :

P(y = 1) = p, \qquad P(y = 0) = 1 - p

The two cases can be merged into a single compact expression:

P(y \mid p) = p^{y} (1 - p)^{1 - y}

This formula is the workhorse of binary probabilistic modeling: when $y = 1$ it returns $p$ , when $y = 0$ it returns $1 - p$ . The remaining task is to learn $p$ from data — and, more precisely, to let $p$ depend on the input $X$ , since we expect the probability of being malignant to depend on the features of the tumor.

From probability to likelihood

In supervised learning the data $(X_i, y_i)$ are observed and fixed. The probability $P(y \mid p)$ is therefore reread as a function of the parameters of the model, not of the data. We call this function the likelihood.

For a single observation:

\mathcal{L}(p \mid y) = P(y \mid p)

For a dataset of $n$ independent observations:

\mathcal{L} = \prod_{i=1}^{n} P(y_i \mid p_i)

where $p_i$ is the probability assigned by the model to example $i$ . The principle of maximum likelihood states that we should choose the parameters that make the observed data as plausible as possible — in other words, that maximize $\mathcal{L}$ .

Predicting a probability with a neuron

To turn $p$ into a function of $X$ , we reuse the affine combination from the previous chapter:

z = X w + b

This quantity is a real number, not a probability. We squash it into the unit interval with the logistic function, also called the sigmoid:

\sigma(z) = \frac{1}{1 + e^{-z}}

The sigmoid is monotone increasing, takes values in $(0, 1)$ , equals $1/2$ at $z = 0$ , and saturates exponentially fast at $0$ and $1$ for large negative and positive inputs. Composing the affine layer with the sigmoid gives the logistic neuron:

p = P(y = 1 \mid X) = \sigma(X w + b)

The logistic neuron is no longer a class predictor but a probability estimator. The hard decision is recovered afterwards by thresholding $p$ at, for example, $0.5$ .

Likelihood, log-likelihood and binary cross-entropy

Substituting the neuron output into the Bernoulli likelihood gives:

\mathcal{L}(w, b) = \prod_{i=1}^{n} \left[\sigma(X_i w + b)\right]^{y_i} \left[1 - \sigma(X_i w + b)\right]^{1 - y_i}

This product of probabilities is numerically dangerous: with a thousand examples each contributing a factor smaller than $1$ , the result quickly underflows to zero. We therefore work with the log-likelihood, which has the same maximizer but is a sum rather than a product:

\log \mathcal{L}(w, b) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]

By convention, machine learning algorithms minimize a cost rather than maximize a likelihood. We therefore flip the sign and obtain the binary cross-entropy:

E = - \frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]

Three names, one object. The function above is variously called binary cross-entropy, log-loss, or negative log-likelihood of a Bernoulli model. All three names refer to the same expression. Whichever the textbook uses, you should recognize the same $-y \log p - (1 - y) \log(1 - p)$ underneath.

This loss replaces the MSE of the linear neuron. It is finite and differentiable in $p \in (0, 1)$ , it equals $0$ when the model assigns probability $1$ to the correct class, and it diverges to $+\infty$ when the model assigns probability $0$ to the correct class — a very strong incentive for confidence calibration.

Gradient of the logistic neuron

To train the logistic neuron we need the gradient of $E$ with respect to $w$ and $b$ . Take a single example $(X_i, y_i)$ and trace the computation through three blocks:

w \;\longrightarrow\; z_i \;\longrightarrow\; u_i \;\longrightarrow\; E_i

with

z_i = X_i w + b, \qquad u_i = \sigma(z_i), \qquad E_i = -\left[y_i \log(u_i) + (1 - y_i) \log(1 - u_i)\right]

The chain rule gives, for a single weight $w_j$ :

\frac{\partial E_i}{\partial w_j} = \frac{\partial E_i}{\partial u_i} \cdot \frac{\partial u_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_j}

The three factors correspond to loss vs. output, activation vs. pre-activation, and pre-activation vs. weight.

Computing each factor

Factor 1 — derivative of the loss with respect to $u_i$ . Differentiating term by term:

\frac{\partial E_i}{\partial u_i} = -\frac{y_i}{u_i} + \frac{1 - y_i}{1 - u_i}

The first term comes from $-y_i \log(u_i)$ , the second from $-(1 - y_i) \log(1 - u_i)$ via the chain rule on $\log(1 - u_i)$ .

Factor 2 — derivative of the sigmoid. A classical identity states that the derivative of the sigmoid can be written using the sigmoid itself:

\frac{\partial u_i}{\partial z_i} = u_i (1 - u_i)

This is one of the great practical advantages of the sigmoid: once the forward pass has computed $u_i$ , the backward pass needs no extra exponentials.

Factor 3 — derivative of the affine layer. Since $z_i = \sum_j x_{ij} w_j + b$ :

\frac{\partial z_i}{\partial w_j} = x_{ij}, \qquad \frac{\partial z_i}{\partial b} = 1

Magic cancellation

Multiplying the three factors:

\frac{\partial E_i}{\partial w_j} = \left( -\frac{y_i}{u_i} + \frac{1 - y_i}{1 - u_i} \right) \cdot u_i (1 - u_i) \cdot x_{ij}

Distributing $u_i (1 - u_i)$ inside the parentheses:

\left( -\frac{y_i}{u_i} + \frac{1 - y_i}{1 - u_i} \right) u_i (1 - u_i) = -y_i (1 - u_i) + (1 - y_i) u_i = u_i - y_i

The final result is therefore astonishingly simple:

\frac{\partial E_i}{\partial w_j} = (u_i - y_i) \, x_{ij}

In vector and batch form:

\frac{\partial E}{\partial w} = \frac{1}{n} X^{T} (u - y), \qquad \frac{\partial E}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (u_i - y_i)

Key formula. For the logistic neuron with binary cross-entropy loss, the gradient takes the same shape as for linear regression with MSE loss: $\nabla_w E = \tfrac{1}{n} X^{T}(u - y), \qquad \nabla_b E = \tfrac{1}{n} \sum_i (u_i - y_i).$ The only difference is that here $u = \sigma(Xw + b)$ instead of $u = Xw + b$ .

Why this similarity is not a coincidence

The cancellation we just performed is a special case of a more general phenomenon. Whenever the activation function is the canonical link of the exponential-family distribution we are modeling, the gradient of the negative log-likelihood collapses to "(prediction $-$ target) times input". The MSE/identity pair (Gaussian model) and the cross-entropy/sigmoid pair (Bernoulli model) are two instances of the same scheme; in the multiclass case we will see a third, with cross-entropy and softmax. This is why your training loop barely changes between the linear and the logistic neuron: only the definition of $u$ changes, the rest of the algorithm — gradient assembly, parameter update, batching — is strictly identical.

A logistic neuron from scratch

Armed with this gradient, we can rewrite the LinearNeuron class of the previous chapter into a LogisticNeuron class. The structural changes are minimal:

class LogisticNeuron:
    def __init__(self):
        self.w = None
        self.b = 0.0
        self.history = []

    def forward(self, X):
        z = X @ self.w + self.b
        return 1.0 / (1.0 + np.exp(-z))

    def predict(self, X, threshold=0.5):
        u = self.forward(X)
        return (u >= threshold).astype(int)

    def fit(self, X, y, lr=0.1, epochs=100, mode="batch", batch_size=32):
        n, m = X.shape
        if self.w is None:
            self.w = np.zeros(m)
        # ... gradient descent loop, identical to the linear neuron,
        # using the formulas dE/dw = X.T @ (u - y) / n and dE/db = mean(u - y)

Three things changed compared to the linear neuron of chapter 1:

The forward method now applies the sigmoid: 1 / (1 + exp(-z)) instead of z.
The cost function used to monitor convergence is the binary cross-entropy $E = -\tfrac{1}{n} \sum_i [y_i \log(u_i) + (1 - y_i) \log(1 - u_i)]$ rather than the MSE. In practice we add a small $\varepsilon$ inside the logs to avoid log(0) when the sigmoid saturates.
A new predict method thresholds the probabilities at $0.5$ to produce hard class labels.

The gradient assembly (grad_w = X.T @ (u - y) / n, grad_b = (u - y).mean()) and the descent itself (self.w -= lr * grad_w) are character-for-character identical to the regression case. This is the practical payoff of the similarity we proved above.

A typical use on the cancer_mini dataset reads:

df = pd.read_csv(".../cancer_mini.csv")
X = df.drop(columns=["diagnosis"]).to_numpy()
y = df["diagnosis"].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticNeuron()
model.fit(X_train, y_train, mode="batch", epochs=100, lr=0.5)
y_hat = model.predict(X_test)
print(accuracy_score(y_test, y_hat))

A few practical reflexes are worth highlighting here. We stratify the train/test split on $y$ so that both folds preserve the class balance — important when the positive class is rare. We standardize the inputs on the training set and apply the same transformation to the test set, never the other way around. And we evaluate with classification metrics (accuracy, confusion matrix, precision, recall) rather than with a regression metric like MSE.

The logistic neuron in PyTorch

The next step is to abandon the from-scratch implementation and let PyTorch take care of the autograd machinery. The logistic neuron in PyTorch reuses the nn.Linear layer of the previous chapter; only the loss and the post-processing change.

Two ways of writing it

Option 1 — explicit sigmoid with nn.BCELoss. We can build a model that returns probabilities directly and feed those into the binary cross-entropy:

model = nn.Sequential(
    nn.Linear(m, 1),
    nn.Sigmoid()
)

criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Here model(X) directly returns $u \in (0, 1)$ and criterion(u, y) computes the binary cross-entropy.

Option 2 — recommended, raw logits with nn.BCEWithLogitsLoss. We do not put a sigmoid at the end of the model. The model returns the logits $z = Xw + b$ , and BCEWithLogitsLoss applies the sigmoid and the cross-entropy together, in a numerically stable way:

model = nn.Linear(m, 1)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

Pitfall — BCELoss vs. BCEWithLogitsLoss. BCELoss expects probabilities in $(0, 1)$ ; BCEWithLogitsLoss expects raw logits in $\mathbb{R}$ and applies the sigmoid internally using the log-sum-exp trick. The second version is the right default: it is numerically more stable, it avoids the "I forgot the sigmoid" bug, and it is faster because the forward and the loss are fused. Never stack nn.Sigmoid followed by BCEWithLogitsLoss — you would apply the sigmoid twice and get nonsense.

Data preparation

As in regression, we convert NumPy arrays to float32 tensors. For binary classification, the target tensor must have shape $(n, 1)$ to match the model output:

X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
X_test  = torch.tensor(X_test,  dtype=torch.float32)
y_test  = torch.tensor(y_test,  dtype=torch.float32).view(-1, 1)

Training loop

The training loop is character-for-character identical to the regression loop of chapter 1:

epochs = 200
loss_history = []

for _ in range(epochs):
    optimizer.zero_grad()
    logits = model(X_train)
    loss = criterion(logits, y_train)
    loss.backward()
    optimizer.step()
    loss_history.append(loss.item())

The only difference is the type of criterion. The four-step ritual zero_grad -> forward -> backward -> step does not change.

From logits to predicted classes

Because the model returns logits and not probabilities, we need an explicit conversion at evaluation time:

model.eval()
with torch.no_grad():
    logits = model(X_test)
    proba  = torch.sigmoid(logits)
    y_hat  = (proba >= 0.5).float()

We then send the predictions back to NumPy and compute the usual sklearn metrics:

y_hat_np  = y_hat.cpu().numpy().reshape(-1)
y_test_np = y_test.cpu().numpy().reshape(-1)

print("Accuracy:", accuracy_score(y_test_np, y_hat_np))
print("Confusion matrix:\n", confusion_matrix(y_test_np, y_hat_np))

The model.eval() call disables training-only behaviors such as dropout and batch-norm running-stats updates. In our minimal example it is not strictly necessary, but it is the right reflex to acquire from day one. The with torch.no_grad() context tells autograd not to record operations, which saves memory and time at inference.

Side-by-side correspondence

Listing the changes between the from-scratch and the PyTorch versions clarifies what autograd actually buys us:

forward: from-scratch writes $u = \sigma(Xw + b)$ explicitly; in PyTorch, model(X) returns the logit, and torch.sigmoid (or the loss) applies the sigmoid;
gradient: from-scratch performs the manual chain-rule computation we did above; in PyTorch, loss.backward() does it for us, in any depth and shape;
update: from-scratch writes w -= lr * grad_w; in PyTorch, optimizer.step() walks the parameter list of the model and applies the rule of the chosen optimizer.

Multilayer networks for classification

Once the single neuron is in place, going multilayer is essentially free. We stack nn.Linear layers separated by nonlinear activations — the very construction we used in the linear chapter, except that we will end the network with a logit head and a binary cross-entropy loss.

Conceptually, the network performs:

X \;\xrightarrow{\text{Linear + ReLU}}\; h_1 \;\xrightarrow{\text{Linear + ReLU}}\; h_2 \;\xrightarrow{\text{Linear}}\; z

Why ReLU between layers and no activation at the end? Because:

intermediate non-linearities are what allow the network to represent functions that a single neuron could not — without them, the composition of linear layers collapses back into a single linear layer;
the final layer must produce a logit, not a probability, so that the loss BCEWithLogitsLoss can apply the sigmoid internally with full numerical stability.

A typical implementation reads:

model = nn.Sequential(
    nn.Linear(m, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1),  # logits, no activation
)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

The nn.Sequential wrapper is convenient for purely sequential architectures. For more complex designs — branches, skip connections, residual blocks — one usually subclasses nn.Module:

class MLPClassifier(nn.Module):
    def __init__(self, m, hidden=(64, 32)):
        super().__init__()
        self.fc1 = nn.Linear(m, hidden[0])
        self.fc2 = nn.Linear(hidden[0], hidden[1])
        self.head = nn.Linear(hidden[1], 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.head(x)  # logits

The training loop is unchanged with respect to the single-neuron case. This is the central message of the PyTorch ecosystem: gradients and updates are plumbing that you write once and reuse, while what changes from one experiment to the next is the model and the loss.

Multiclass classification

Most real classification problems are not binary. Penguins come in three species, digits in ten, ImageNet classes in a thousand. The target now takes more than two values:

y \in \{0, 1, \dots, C - 1\}

and we want to estimate, for each input $X$ , the full discrete distribution

P(y = c \mid X), \quad c = 0, \dots, C - 1

The construction generalizes the binary case along three axes: the output dimension, the activation function, and the loss.

Logits, softmax, and cross-entropy

The network now ends with $C$ output neurons — one per class — and the last layer is again linear:

nn.Linear(h, C)

The output is a vector of $C$ real scores called logits:

z = (z_0, z_1, \dots, z_{C-1})

Logits are unbounded and do not yet form a probability distribution. To turn them into probabilities we use the softmax function:

P(y = c \mid X) = \frac{e^{z_c}}{\sum_{k=0}^{C-1} e^{z_k}}

The softmax exponentiates each score and renormalizes so that the components sum to $1$ . It generalizes the sigmoid: with $C = 2$ classes, the softmax of $(z_0, z_1)$ is mathematically equivalent to applying a sigmoid to the difference $z_1 - z_0$ .

The corresponding loss is the categorical cross-entropy, defined per example as

\mathcal{L}_i = -\log P(y = y_i \mid X_i) = -\log \frac{e^{z_{i,y_i}}}{\sum_k e^{z_{i,k}}}

It picks out the probability assigned to the correct class and takes its negative logarithm — a strict extension of the binary case.

Target encoding: integer indices, not one-hot

Two conventions exist for representing the target $y$ in a multiclass problem: a vector of integer class indices, or a one-hot matrix. PyTorch's nn.CrossEntropyLoss expects the integer-index form:

$y$ is a vector of shape $(n,)$ — not $(n, C)$ ;
$y_i \in \{0, \dots, C - 1\}$ ;
$y$ has dtype torch.long.

This is one of the most common beginner mistakes: feeding a one-hot-encoded $y$ to CrossEntropyLoss produces a cryptic shape error. Use LabelEncoder from sklearn to convert string labels to integers before turning them into tensors.

`CrossEntropyLoss` expects logits

The most important convention to remember:

Pitfall — CrossEntropyLoss expects logits, not softmax outputs. Just like BCEWithLogitsLoss in the binary case, nn.CrossEntropyLoss applies the softmax internally (via LogSoftmax + NLLLoss), in a numerically stable way using the log-sum-exp trick. Do not apply torch.softmax or nn.Softmax at the end of your model — if you do, you will apply the softmax twice and the loss will collapse to a near-constant.

A typical multiclass model therefore reads:

model = nn.Sequential(
    nn.Linear(m, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, C),  # multiclass logits, no softmax
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Predicting a class

After training, the predicted class is the argmax of the logits along the class dimension:

model.eval()
with torch.no_grad():
    logits = model(X_test)
    y_hat  = torch.argmax(logits, dim=1)

Mathematically:

\hat{y} = \arg\max_{c} z_c

The argmax of the logits and the argmax of the softmax probabilities are the same — softmax is a monotone transformation — so we can skip the softmax altogether at inference time, which is faster and more stable.

A complete recipe for `penguins_mini`

Putting everything together, a multiclass classifier on the penguins_mini dataset goes through the following stages:

Load the data and separate $X$ from $y$ . The target column contains strings such as "Adelie", "Chinstrap", "Gentoo".
Encode the target with LabelEncoder, producing integer labels in $\{0, 1, 2\}$ .
Stratified train/test split with test_size=0.2, random_state=42, stratify=y.
Standardize inputs with StandardScaler (fit_transform on train, transform on test).
Convert: X_* to torch.float32, y_* to torch.long.
Build a Sequential model with one or two hidden layers ending in nn.Linear(h, C).
Use nn.CrossEntropyLoss() and an optimizer (SGD or Adam).
Run the standard training loop; store the loss in loss_history.
At test time, compute logits, take argmax(dim=1), and report accuracy and confusion matrix.

The structure of the loop is unchanged. What changes between binary and multiclass is essentially the dimension of the last layer (1 vs. $C$ ), the dtype of $y$ (float32 vs. long), the loss class (BCEWithLogitsLoss vs. CrossEntropyLoss), and the prediction rule (threshold vs. argmax).

Summary

The cluster of ideas introduced in this chapter forms the foundation of every classification model you will meet later in the course — from the convolutional networks of the next chapters to large language models trained with cross-entropy on next-token prediction. To synthesize:

Binary classification estimates $P(y = 1 \mid X)$ via a logistic neuron $\sigma(Xw + b)$ , trained by minimizing the binary cross-entropy, which is the negative log-likelihood of a Bernoulli model.
The gradient of the binary cross-entropy with respect to the weights collapses, after a beautiful cancellation, to the same $\tfrac{1}{n} X^T (u - y)$ as in linear regression. The training loop is therefore identical; only the definition of $u$ changes.
In PyTorch, the recommended pattern is nn.Linear(m, 1) with nn.BCEWithLogitsLoss, which combines sigmoid and cross-entropy in a numerically stable way.
A multilayer classifier simply stacks nn.Linear + ReLU blocks before a logit head; the loss and the loop do not change.
Multiclass classification generalizes the construction with $C$ output neurons, the softmax activation, the categorical cross-entropy, and the convention that PyTorch's CrossEntropyLoss consumes raw logits and integer-index targets.

Exercises

Exercise 1 — Gradient by hand. Starting from $E_i = -[y_i \log u_i + (1 - y_i) \log(1 - u_i)]$ with $u_i = \sigma(z_i)$ and $z_i = X_i w + b$ :

Compute $\partial E_i / \partial u_i$ .
Compute $\partial u_i / \partial z_i$ .
Compute $\partial z_i / \partial w_j$ and $\partial z_i / \partial b$ .
Apply the chain rule and check that you recover $\partial E_i / \partial w_j = (u_i - y_i) x_{ij}$ .
Write the batch version $\partial E / \partial w$ and $\partial E / \partial b$ in matrix form.

Exercise 2 — Logistic neuron from scratch. Take your LinearNeuron class from chapter 1 and turn it into a LogisticNeuron. The three changes are: apply a sigmoid in forward, use binary cross-entropy in the loss tracker (with a small $\varepsilon$ added inside the logs to avoid log(0)), and add a predict(X, threshold=0.5) method. Test it on cancer_mini with a stratified train/test split and StandardScaler normalization.

Exercise 3 — PyTorch logistic neuron. Reproduce exercise 2 in PyTorch using nn.Linear(m, 1) and nn.BCEWithLogitsLoss. Compare the test accuracy and the loss curve to those of your from-scratch implementation. They should agree up to small differences due to initialization and optimizer choice.

Exercise 4 — Multilayer network. Replace the single nn.Linear(m, 1) with a multilayer classifier Linear(m, 64) -> ReLU -> Linear(64, 32) -> ReLU -> Linear(32, 1) trained on cancer_mini with BCEWithLogitsLoss. Does adding depth help on this dataset? Plot the loss curves of both models on the same axes.

Exercise 5 — Multiclass on penguins_mini. Follow the nine-step recipe from the previous section. Build a model with one hidden layer of $32$ units and an output layer of size $C = 3$ . Use LabelEncoder to encode the species, CrossEntropyLoss as the loss, and argmax(dim=1) as the prediction rule. Report the accuracy and the confusion matrix on the test set.

Exercise 6 — BCELoss vs. BCEWithLogitsLoss. Build two equivalent models on cancer_mini: model A is nn.Sequential(nn.Linear(m, 1), nn.Sigmoid()) with BCELoss; model B is nn.Linear(m, 1) with BCEWithLogitsLoss. Train both with the same data, optimizer and seed. Verify that the loss curves are visually identical, and check at extreme initial weights what happens to the gradients (the BCELoss version may underflow; the BCEWithLogitsLoss version does not).

Exercise 7 — Forgetting the softmax / forgetting the sigmoid. On penguins_mini, deliberately add a nn.Softmax(dim=1) at the end of your multiclass model and keep CrossEntropyLoss. Train, plot the loss, and compare with the version without softmax. Explain in two or three sentences what goes wrong.

Going further

PyTorch documentation — nn.BCEWithLogitsLoss and nn.CrossEntropyLoss. Both pages contain a precise description of the computation, the input/target shape conventions, and the rationale for the log-sum-exp trick. Reading them once carefully avoids most beginner confusion. https://pytorch.org/docs/stable/nn.html#loss-functions
Goodfellow, Bengio, Courville — Deep Learning, chapter 6 ("Deep Feedforward Networks"). The reference textbook for the conceptual scaffolding of MLPs: choice of activation, choice of output unit, choice of cost function, and the generic maximum likelihood perspective that unifies regression and classification.
Andrej Karpathy — "A Recipe for Training Neural Networks". A short and dense blog post listing the practical reflexes that separate working neural networks from broken ones: visualize your data first, start with a small model, fix one thing at a time, scrutinize the loss curve, and so on. Required reading before tackling the convolutional chapters that follow. https://karpathy.github.io/2019/04/25/recipe/
Christopher Olah — "Calculus on Computational Graphs: Backpropagation". A very visual introduction to the chain rule on graphs, which makes it crystal clear why autograd works and why the gradient cancellation we observed in this chapter is structural rather than accidental.
scikit-learn — LogisticRegression user guide. A complementary view from the classical-statistics side: regularization, multinomial vs. one-vs-rest, solver choice. Useful as a sanity check whenever you are tempted to deploy a 50-million-parameter network on a dataset where logistic regression would do.