DL · Chapter 3

Deep Learning 3 — Convolutional Networks, Part 1

The two previous chapters built up a small but complete deep-learning vocabulary: a linear neuron, the chain rule, stochastic gradient descent, the cross-entropy loss, and a multi-layer perceptron (MLP) able to classify images of handwritten digits with reasonable accuracy. Yet the way the MLP "sees" an image is profoundly unsatisfying. To feed a 28 by 28 picture into nn.Linear, we had to flatten it into a vector of 784 numbers, throwing away every notion of which pixel sat next to which. The first hidden neuron treated pixel $(0, 0)$ and pixel $(13, 14)$ as completely interchangeable inputs. A digit shifted two pixels to the right became, from the network's point of view, an entirely different example.

This chapter introduces the architecture that fixed those defects and quietly revolutionised computer vision: the convolutional neural network (CNN). We will see how the operation of 2D convolution exploits two structural properties of natural images — locality and translation equivariance — to drastically cut the number of parameters while improving generalisation. We will translate every concept into PyTorch (nn.Conv2d, nn.MaxPool2d, nn.Flatten, nn.BatchNorm2d), introduce the all-important (B, C, H, W) tensor convention, and train a small CNN end-to-end on MNIST. By the end of the chapter you will be able to read a CNN architecture diagram, predict the shape of every intermediate tensor, and write a clean PyTorch model both with nn.Sequential and with a class inheriting from nn.Module.

From MLP to CNN: why locality matters

An MLP applied to an image of size $H \times W$ collapses the spatial grid into a vector of length $H \cdot W$ and connects it to the first hidden layer through a dense matrix. If the hidden layer has $h$ neurons, that matrix already contains $H \cdot W \cdot h$ weights — for a 28 by 28 image and a modest $h = 100$ , this is 78 400 parameters in a single layer. For a 224 by 224 colour image of the kind ImageNet uses, the same matrix would balloon to fifteen million parameters before any actual learning begins. Worse, every one of those weights has to be relearned from scratch: nothing in the model architecture says that the pixel at position $(i, j)$ is geometrically related to its neighbour at $(i, j+1)$ .

Two observations about real images suggest a much better design. The first is locality: the pieces of structure that matter — edges, corners, textures, small motifs — extend over a few neighbouring pixels at a time, not across the entire image. To detect an edge in the upper-left corner you do not need to look at the lower-right corner. The second is translation equivariance: a vertical edge is a vertical edge whether it appears in the top-left or the bottom-right of the picture. The detector that recognises it should therefore be the same detector everywhere, not a different one for every position.

A convolutional layer hard-codes both properties into the architecture. Each neuron in the layer is connected only to a small spatial neighbourhood of its input — typically a 3 by 3 or 5 by 5 window — and the same set of weights is reused at every position of the image. This is the principle of weight sharing. Its consequences are dramatic: far fewer parameters than an MLP, much better generalisation, and a built-in robustness to small translations of the input.

The 2D convolution operation

A 2D convolution applies a small filter (also called a kernel) to an input image in order to produce a feature map. At each position the filter observes a local neighbourhood, computes a weighted sum of the pixels it covers (plus a bias), and produces one output activation. Sliding the filter over every valid position of the image yields a complete feature map, which is itself a 2D array.

Formally, if $X$ is the input and $K$ a kernel of size $k \times k$ , the output at position $(i, j)$ is

Y_{i,j} = \sum_{u=0}^{k-1} \sum_{v=0}^{k-1} K_{u, v} \, X_{i+u, \, j+v} + b.

The kernel weights $K_{u, v}$ and the bias $b$ are the trainable parameters of the layer. They are exactly the same regardless of the position $(i, j)$ at which the kernel is applied; this is the formal expression of weight sharing. A single 3 by 3 kernel has only 9 weights plus 1 bias, yet it produces an output value at every spatial position of the image.

A convolutional layer does not learn just one filter, it learns several in parallel. Each filter specialises in a different pattern — one might react to vertical edges, another to horizontal edges, a third to a particular texture — and each produces its own feature map. The number of filters is called the number of output channels.

Three hyperparameters control the geometry of the operation. The kernel size $K$ sets the spatial extent of the filter. The stride $S$ sets the step by which the filter moves between two consecutive applications: a stride of 1 produces an output at every pixel, a stride of 2 skips every other position and halves the spatial resolution. The padding $P$ adds rows and columns of zeros around the input so that the filter can also be applied near the borders. Without padding, the output is slightly smaller than the input; with $P = (K-1)/2$ for an odd kernel size and $S = 1$ , the output keeps the same spatial size as the input ("same" padding).

These parameters interact through one piece of arithmetic worth memorising:

Convolution arithmetic — output size formula. For an input of spatial size $H$ , kernel size $K$ , padding $P$ and stride $S$ , the output spatial size is $H_{\text{out}} = \left\lfloor \frac{H - K + 2P}{S} \right\rfloor + 1.$ With $K = 3$ , $P = 1$ , $S = 1$ , the size is preserved: $H_{\text{out}} = H$ . With $K = 3$ , $P = 0$ , $S = 1$ , two rows and two columns are lost: $H_{\text{out}} = H - 2$ .

Tensor shapes: the (B, C, H, W) convention

Before any code can be written we need to agree on how images are stored as tensors. PyTorch — and most deep learning frameworks — represents image data with a strict four-dimensional convention.

Shape pivot rule. A batch of images is always stored as a 4D tensor of shape $(B, \, C, \, H, \, W),$ where $B$ is the batch size, $C$ the number of channels, and $H \times W$ the spatial size. nn.Conv2d does not accept 2D or 3D tensors — only 4D tensors. Every reshape, every unsqueeze, every transpose you write should aim to land precisely on this layout.

For a grayscale image like an MNIST digit there is one channel, so $C = 1$ . For an RGB image there are three, so $C = 3$ . Channels are the second axis in PyTorch, not the last as in NumPy or Pillow. This channels-first convention is the source of one of the most common bugs of beginners: reading an RGB image with plt.imread, getting an array of shape $(H, W, 3)$ , and feeding it directly to a CNN. The fix is a transpose:

X = np.transpose(X, (0, 3, 1, 2))   # (N, H, W, 3) -> (N, 3, H, W)

For a grayscale dataset stored as a flat CSV, the typical pipeline is to reshape into 2D images and then add the missing channel dimension explicitly:

X = X.reshape(n, 28, 28)            # (N, 28, 28)
X_t = torch.tensor(X, dtype=torch.float32).unsqueeze(1)
                                    # (N, 1, 28, 28)

unsqueeze(1) does not modify any value, it merely inserts a singleton dimension at position 1 to bring the tensor in line with the (B, C, H, W) rule. Forgetting this single call is, in our experience, the single most frequent error when moving from MLPs to CNNs.

A Conv2d layer applied to such a tensor produces an output of shape $(B, C_{\text{out}}, H_{\text{out}}, W_{\text{out}})$ , where $C_{\text{out}}$ is the number of filters declared in the layer.

Pooling: controlled spatial reduction

Stacking convolutions alone keeps the spatial resolution roughly constant. To build a deep network that progressively summarises the image into a compact representation suitable for classification, we need an operation that reduces the spatial size while preserving the most informative features. That operation is pooling.

Max pooling is the most common variant. A 2 by 2 max-pooling layer slides a 2 by 2 window over the feature map and keeps, in each window, only the maximum value. The output is therefore four times smaller spatially: $(H, W) \to (H/2, W/2)$ . The number of channels is unchanged. Average pooling does the same with the mean instead of the max; it is less common in practice but appears in some classical architectures.

Pooling has three desirable effects. It quickly reduces the computational cost of subsequent layers, since each pooling halves the number of pixels to process. It enlarges the receptive field — the region of the input image that influences a given output activation — without adding new parameters. And because the maximum is invariant to small displacements within the pooling window, it makes the representation slightly robust to small translations of the input.

In PyTorch, max pooling is one line:

nn.MaxPool2d(kernel_size=2, stride=2)

with the typical default of stride = kernel_size, so that the windows do not overlap.

The receptive field and the classical CNN architecture

The receptive field of a unit deep in the network is the region of the input image that can possibly influence its activation. After a single 3 by 3 convolution, each output unit sees a 3 by 3 region of the input. After two stacked 3 by 3 convolutions, the receptive field grows to 5 by 5: the second layer combines 3 by 3 windows of activations that themselves see 3 by 3 patches. After a 2 by 2 max-pool the receptive field doubles in both directions. By stacking convolutions and pooling, the receptive field of the deepest units grows quickly enough to cover the entire image — but only after several layers.

This observation justifies the classical CNN architecture used in textbooks since LeCun's LeNet-5 in 1998:

Input (B, 1, H, W)
 -> [Conv2d -> ReLU -> MaxPool2d]   x N times
 -> Flatten
 -> [Linear -> ReLU]                x M times
 -> Linear (logits)

The convolutional stage extracts a hierarchy of local features, from low-level edges and textures to high-level motifs and parts. The flattening operation pivots the 4D tensor into a 2D matrix of shape $(B, C \cdot H \cdot W)$ . The dense stage then performs the classification proper, exactly as in a regular MLP, on top of this learned representation.

Building a small CNN in PyTorch with nn.Sequential

It is time to put the bricks together. Our first model will use nn.Sequential, the same convenience container we already met for MLPs. The vocabulary is essentially identical — only the building blocks change.

A 2D convolution is created with

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)

where in_channels must match the channel dimension of the input tensor and out_channels is the number of filters the layer will learn. nn.ReLU() introduces the non-linearity. Without it, a stack of convolutions would still be equivalent to a single linear operator, and the depth would buy nothing. nn.MaxPool2d(2) halves $H$ and $W$ . nn.Flatten() collapses (B, C, H, W) into (B, C * H * W) so that an nn.Linear layer can produce the final logits. The classification loss is nn.CrossEntropyLoss(), which expects raw logits of shape (B, n_classes) and integer labels of shape (B,).

For MNIST, with a single 16-filter convolution and one pooling, a complete model fits in five lines:

model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, padding=1),   # (B, 16, 28, 28)
    nn.ReLU(),
    nn.MaxPool2d(2),                              # (B, 16, 14, 14)
    nn.Flatten(),                                 # (B, 16*14*14)
    nn.Linear(16 * 14 * 14, 10),
)

The comments after each line track the shape of the running tensor. With $K = 3$ and $P = 1$ the convolution preserves the spatial size, so the input $(B, 1, 28, 28)$ becomes $(B, 16, 28, 28)$ . The pool divides $H$ and $W$ by two, giving $(B, 16, 14, 14)$ . The flatten produces $(B, 3136)$ , which is also the value we pass to the Linear layer's in_features. Predicting these shapes by hand, with the convolution-arithmetic formula, is a habit worth cultivating: in the CNN world, mistakes typically reveal themselves as cryptic shape-mismatch errors at the boundary between the convolutional stage and the linear head.

Training a CNN on MNIST end-to-end

The end-to-end training script for MNIST follows the same pattern we used for MLPs, with three additions: the data is reshaped into images and given a channel dimension, the model is a CNN, and we keep an eye on the shapes.

import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

df = pd.read_csv("mnist_small.csv")

X = df.drop(columns="label").to_numpy()
y = df["label"].to_numpy()

n = X.shape[0]
X = X.reshape(n, 28, 28) / 255.0           # (N, 28, 28), normalised

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train_t = torch.tensor(X_train, dtype=torch.float32).unsqueeze(1)
y_train_t = torch.tensor(y_train, dtype=torch.long)
X_test_t  = torch.tensor(X_test,  dtype=torch.float32).unsqueeze(1)
y_test_t  = torch.tensor(y_test,  dtype=torch.long)

train_loader = DataLoader(
    TensorDataset(X_train_t, y_train_t),
    batch_size=64, shuffle=True,
)

model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(16 * 14 * 14, 10),
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(10):
    epoch_loss = 0.0
    for Xb, yb in train_loader:
        optimizer.zero_grad()
        logits = model(Xb)
        loss = criterion(logits, yb)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    print(f"epoch {epoch} - loss {epoch_loss / len(train_loader):.4f}")

with torch.no_grad():
    logits = model(X_test_t)
    y_hat = torch.argmax(logits, dim=1).numpy()

print("Accuracy:", accuracy_score(y_test, y_hat))
print(confusion_matrix(y_test, y_hat))

A few observations are in order. The pixel values are divided by 255 so that they lie in $[0, 1]$ ; CNNs train far better on normalised inputs than on raw byte-valued images. Labels must be torch.long, never float, because CrossEntropyLoss interprets them as class indices. The DataLoader only wraps the training data: the test set is small enough to be evaluated in a single forward pass without batching. Inside the evaluation block, torch.no_grad() disables the autograd machinery, saving memory and speeding up inference.

On mnist_small.csv this minimal CNN reaches roughly 97 to 98 percent accuracy after a few dozen epochs — a clear improvement over the MLP baseline, with fewer parameters and more graceful behaviour on translated digits.

Stacking convolutional layers

A single convolution is enough to grasp the mechanism, but a real CNN is deep: it stacks several convolutional blocks before the dense head. The principle is straightforward — the out_channels of one layer becomes the in_channels of the next.

model = nn.Sequential(
    nn.Conv2d(1, 16, kernel_size=3, padding=1),   # (B, 16, 28, 28)
    nn.ReLU(),
    nn.MaxPool2d(2),                              # (B, 16, 14, 14)
    nn.Conv2d(16, 32, kernel_size=3, padding=1),  # (B, 32, 14, 14)
    nn.ReLU(),
    nn.MaxPool2d(2),                              # (B, 32, 7, 7)
    nn.Flatten(),                                 # (B, 32*7*7)
    nn.Linear(32 * 7 * 7, 10),
)

The number of channels typically increases as we go deeper (16, then 32, then 64, then 128). Intuitively, the deeper layers must encode richer combinations of low-level features, so they need more "vocabulary" — more filters — to express them. Meanwhile the spatial size shrinks by a factor of two after each pooling, which keeps the total amount of activation roughly constant per layer.

Two errors are particularly common when stacking. The first is forgetting to update in_channels of the next convolution to match the previous layer's out_channels. The second is mis-computing the in_features of the final Linear layer. The convolution-arithmetic formula above is your ally; use it methodically, tracking the spatial size after each block. With $K = 3$ , $P = 1$ , two consecutive 2-by-2 max-pools turn 28 by 28 into 7 by 7, so the linear layer must accept $32 \times 7 \times 7 = 1568$ inputs.

BatchNorm2d: stabilising training in deep CNNs

Once the network has more than two or three convolutional blocks, training with plain SGD becomes noticeably more delicate: the loss oscillates, the learning rate has to be carefully tuned, and the model occasionally fails to converge at all. Batch normalisation, introduced by Ioffe and Szegedy in 2015, is a small but powerful trick that fixes most of those issues. The idea is to renormalise the activations of each channel so that, within each minibatch, they have zero mean and unit variance:

\hat{x} = \frac{x - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \varepsilon}}, \qquad y = \gamma \hat{x} + \beta,

where $\gamma$ and $\beta$ are two learnable parameters per channel that allow the network to recover any scale and offset it actually needs. In PyTorch the layer is called nn.BatchNorm2d and it is conventionally inserted right after the convolution, before the activation:

nn.Conv2d(16, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),

The effect is a markedly smoother loss curve, robustness to a wider range of learning rates, and faster convergence. Batch normalisation works because it keeps the distribution of activations stable across layers, which prevents pathological gradients from accumulating during back-propagation.

Alternative syntax: nn.Module classes

nn.Sequential is convenient for strictly linear pipelines, but real architectures — those with skip connections, branches, or layers reused at different depths — quickly outgrow it. The standard PyTorch idiom is to define a class that inherits from nn.Module, declare every layer in __init__, and describe the data flow in forward.

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.pool  = nn.MaxPool2d(2)
        self.relu  = nn.ReLU()
        self.fc    = nn.Linear(32 * 7 * 7, 10)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))   # (B, 16, 14, 14)
        x = self.pool(self.relu(self.conv2(x)))   # (B, 32,  7,  7)
        x = torch.flatten(x, start_dim=1)         # (B, 32*7*7)
        x = self.fc(x)
        return x

The two methods cooperate. __init__ is executed once when the model is instantiated; it creates the layers and registers them as attributes, which causes PyTorch to track their parameters automatically and to move them to the right device when model.to(device) is called. forward is executed at every forward pass; it specifies the order of operations, can include if statements, can reuse the same layer multiple times (note how self.pool and self.relu are applied twice), and can implement non-sequential connections — all things that nn.Sequential cannot express.

Note one important subtlety. The flatten operation appears here as torch.flatten(x, start_dim=1) rather than as a layer. This is a stateless functional call, equivalent to nn.Flatten(), but written inline in the forward pass. Both styles are perfectly valid; choose the one that reads more naturally in the surrounding code.

Once the model is defined, training is identical in every other respect:

model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Moving training to the GPU

CNNs are dramatically faster on GPUs than on CPUs — typically by one or two orders of magnitude. PyTorch does not move anything implicitly; the choice of device must be explicit. The conventional pattern, valid for any model and any dataset, is the following. First, decide on a device:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Then move the model to that device once and for all:

model = model.to(device)

Finally, inside the training loop, move each minibatch to the same device:

for Xb, yb in train_loader:
    Xb = Xb.to(device)
    yb = yb.to(device)
    optimizer.zero_grad()
    logits = model(Xb)
    loss = criterion(logits, yb)
    loss.backward()
    optimizer.step()

The model and the data must live on the same device — a CPU tensor fed to a GPU model triggers an immediate runtime error. For evaluation, the same logic applies, with the additional precaution of disabling gradients:

model.eval()
with torch.no_grad():
    logits = model(X_test_t.to(device))
    y_hat = torch.argmax(logits, dim=1)

y_hat = y_hat.detach().cpu().numpy()

The triple .detach().cpu().numpy() is the canonical way to bring a tensor back to NumPy: detach it from the autograd graph, move it to the CPU, and convert it. Forgetting either of the first two steps is a frequent source of confusing errors when computing scikit-learn metrics on GPU outputs.

Exercises

Convolution arithmetic. Apply the formula $H_{\text{out}} = \lfloor (H - K + 2P)/S \rfloor + 1$ to determine the output spatial size of the following layers, given an input of size $32 \times 32$ : (a) Conv2d(3, 16, kernel_size=3, padding=1, stride=1); (b) Conv2d(3, 16, kernel_size=5, padding=0, stride=1); (c) Conv2d(3, 16, kernel_size=3, padding=1, stride=2); (d) MaxPool2d(2). For each layer, also state the output number of channels.
Reshape pattern for MNIST. Starting from a NumPy array X of shape (N, 784) containing flattened MNIST digits and a vector of integer labels y, write the four lines of code that produce, in order, an array of shape (N, 28, 28), then a normalised float tensor of shape (N, 1, 28, 28), then a long tensor of labels, then a DataLoader with a batch size of 32. Compare your snippet with the canonical pipeline in the chapter.
Counting parameters. For the small MNIST CNN of the chapter — Conv2d(1, 16, 3, padding=1) -> ReLU -> MaxPool2d(2) -> Flatten -> Linear(16*14*14, 10) — count the total number of trainable parameters. Compare with an MLP that flattens the 784-pixel image and maps it through a single Linear(784, 100) -> ReLU -> Linear(100, 10) head. Which model has more parameters? Which one would you expect to generalise better?
Stacking and pooling depth. You are designing a CNN for 64 by 64 grayscale images. You want each convolution to preserve the spatial size and each pooling to halve it. How many MaxPool2d(2) layers can you afford before the spatial map shrinks to 1 by 1? After three such poolings with output channel counts of 32, 64, 128, what is the input dimension of the final Linear layer?
From Sequential to Module. Take the two-convolution nn.Sequential model of the chapter and rewrite it as a class SimpleCNN(nn.Module) with explicit __init__ and forward methods. Use torch.flatten(x, start_dim=1) instead of nn.Flatten(). Verify on a single forward pass with a random tensor of shape (8, 1, 28, 28) that the output has shape (8, 10).
Adding BatchNorm. Insert a nn.BatchNorm2d after each convolution of the previous exercise, before the ReLU. Train on mnist_small.csv for a fixed number of epochs and compare the loss curves with and without batch normalisation. What do you observe about the smoothness of the curve and the final accuracy?
Receptive field. Compute the receptive field of a single output unit at the deepest layer of the architecture Conv2d(K=3) -> MaxPool2d(2) -> Conv2d(K=3) -> MaxPool2d(2) -> Conv2d(K=3). Express the answer in pixels of the input image. Does this receptive field cover the whole image for a 28 by 28 input?
GPU pipeline. Adapt the CIFAR-10 starter from the notebook (RGB images of shape (N, 32, 32, 3)) to run on GPU. Pay particular attention to the channels-last to channels-first transpose, and to placing both the model and every minibatch on the same device.

Going further

The classical reference for the topics introduced here is the Stanford CS231n course, "Convolutional Neural Networks for Visual Recognition", whose lecture notes — freely available online — give an exceptionally clear treatment of convolution arithmetic, pooling, receptive fields and architectural design choices. They are an excellent companion to this chapter for any reader who wishes to go further.

The architectures we have built so far cap out at five or six convolutional layers; deeper networks suffer from vanishing gradients and become very hard to train. The breakthrough that unlocked truly deep CNNs was the residual network of He, Zhang, Ren and Sun, "Deep Residual Learning for Image Recognition" (CVPR 2016). By introducing skip connections that let each block learn a residual correction rather than the full transformation, ResNet made it routine to train networks with 50, 101 or even 152 layers. ResNet and its descendants are the workhorses of modern computer vision.

In practice, you rarely implement ResNet from scratch. The torchvision library ships pre-built and pre-trained versions of all the standard architectures — ResNet, VGG, DenseNet, EfficientNet, Vision Transformers — together with the canonical image datasets (MNIST, CIFAR, ImageNet) and a rich set of image transformations. A single line such as torchvision.models.resnet18(weights="DEFAULT") returns a fully trained ImageNet classifier ready to fine-tune on your own data. The next chapter will pick up exactly there.