Deep Learning 3 — Convolutional networks (1/3)

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

First contact with convolutional neural networks (CNNs). The main conceptual challenge of this chapter: PyTorch's BCHW tensor convention. Once that rule is well understood, the rest follows naturally.

Why this chapter?

You'll learn:

why a CNN rather than an MLP for images;
the (B, C, H, W) pivot rule PyTorch expects for images;
the bricks Conv2d, MaxPool2d, Flatten;
how to stack several conv layers;
the nn.Module syntax (the professional way to write a model);
GPU usage with .to(device).

Why a CNN?

An MLP treats an image as a 1D vector (flatten $(28, 28)$ into $(784,)$ ). Three problems:

Loss of spatial proximity: for the MLP, pixel $(1,1)$ and pixel $(1,2)$ are as foreign as pixel $(1,1)$ and pixel $(28,28)$ .
Huge parameter count: for a 224×224 RGB image, ~150,000 input features. With 100 hidden neurons, ~15 million weights for the first layer alone.
No translation invariance: if the object shifts by a few pixels, the network no longer recognises it.

The CNN solves all three by exploiting two ideas:

Local connectivity: a neuron only looks at a small neighbourhood (3×3 pixels), not the whole image.
Weight sharing: the same 3×3 filter is applied at all positions. Far fewer parameters, plus free translation invariance.

The golden rule: (B, C, H, W)

PyTorch always expects a 4D tensor (batch, channels, height, width) for convolutions. Period.

Dimension breakdown:

Dim	Meaning	MNIST	CIFAR-10
B	number of images processed in parallel	64	64
C	number of channels	1 (grey)	3 (RGB)
H	height in pixels	28	32
W	width in pixels	28	32

This convention is called BCHW (or NCHW). PyTorch enforces it for GPU performance reasons.

:::warning matplotlib uses a different convention matplotlib (and numpy in general) expects images in HWC (height, width, channels). This difference will force a few permutations for RGB images. :::

For every image-tensor manipulation, print its shape with print(X.shape). It's the reflex that prevents 90% of bugs.

Reshaping images

MNIST (1 channel, stored flat)

X = df.drop(columns='label').to_numpy()        # (N, 784)
X = X.reshape(-1, 1, 28, 28) / 255.0           # (N, 1, 28, 28) — already BCHW
X_t = torch.tensor(X, dtype=torch.float32)

CIFAR-10 (3 channels, stored as HWC)

X = df.drop(columns='label').to_numpy()        # (N, 3072)
X = X.reshape(-1, 32, 32, 3) / 255.0           # (N, H, W, C) — image convention
X = X.transpose(0, 3, 1, 2)                    # (N, C, H, W) — permute for PyTorch
X_t = torch.tensor(X, dtype=torch.float32)

The only difference: for RGB, we permute axes from HWC to CHW.

Brick 1: nn.Conv2d

A convolution applies a filter (kernel) of small size, sliding it over the image. At each position, it computes a weighted sum of the covered pixels. Each filter learns to detect a spatial pattern — edge, corner, texture.

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)

in_channels: input channels (1 for MNIST, 3 for CIFAR).
out_channels: number of learned filters (= output channels).
kernel_size: filter size (typically 3 or 5).
padding=1 with kernel_size=3 keeps H×W unchanged.

Shape: $(B, C_\text{in}, H, W) \to (B, C_\text{out}, H_\text{out}, W_\text{out})$ .

Brick 2: nn.ReLU

As in an MLP, we insert a non-linear activation after each convolution. Without it, stacked Conv2d collapse to a single equivalent Conv2d. ReLU is the standard activation.

nn.ReLU()

Brick 3: nn.MaxPool2d

Pooling reduces the spatial size of the image while keeping the useful information. MaxPool2d(2) halves height and width by keeping the max value of each 2×2 patch.

nn.MaxPool2d(kernel_size=2)

Effect: $(B, C, H, W) \to (B, C, H/2, W/2)$ . Channel count unchanged.

Benefits: fewer parameters in later layers, robustness to small translations, focus on strong patterns.

Brick 4: nn.Flatten + nn.Linear

After a few Conv → ReLU → Pool blocks, we flatten the 4D tensor to 2D for a dense classification layer:

nn.Flatten()                  # (B, C, H, W) → (B, C*H*W)
nn.Linear(C*H*W, n_classes)   # logits

No activation after the final Linear — CrossEntropyLoss handles softmax internally.

Typical CNN architecture

model = nn.Sequential(
    nn.Conv2d(1, 8, kernel_size=3, padding=1),   # (N, 1, 28, 28) → (N, 8, 28, 28)
    nn.ReLU(),
    nn.MaxPool2d(2),                              # → (N, 8, 14, 14)
    nn.Flatten(),                                 # → (N, 8*14*14)
    nn.Linear(8 * 14 * 14, 10),                   # → (N, 10) logits
)

Stacking multiple blocks

For a more powerful model, stack Conv → ReLU → Pool blocks. Rule: a layer's out_channels becomes the next layer's in_channels.

nn.Sequential(
    nn.Conv2d(3, 16, 3, padding=1),  nn.ReLU(),  nn.MaxPool2d(2),  # → 16×16
    nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(),  nn.MaxPool2d(2),  # → 8×8
    nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),  nn.MaxPool2d(2),  # → 4×4
    nn.Flatten(),
    nn.Linear(64 * 4 * 4, 10),
)

Channel count typically grows (16 → 32 → 64) — we gain expressivity at the cost of resolution.

nn.Module syntax

For richer architectures, define a class inheriting from nn.Module:

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.pool  = nn.MaxPool2d(2)
        self.relu  = nn.ReLU()
        self.fc    = nn.Linear(32 * 7 * 7, 10)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = torch.flatten(x, start_dim=1)
        x = self.fc(x)
        return x

This is the standard pattern in practice. It allows branching, residual connections, layer sharing, etc.

Using the GPU

PyTorch never moves things automatically. Everything must be explicit.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In the training loop, move batches as you fit:

for Xb, yb in train_loader:
    Xb = Xb.to(device)
    yb = yb.to(device)
    optimizer.zero_grad()
    logits = model(Xb)
    loss = criterion(logits, yb)
    loss.backward()
    optimizer.step()

At the end, to compute metrics with scikit-learn or matplotlib, bring tensors back to CPU:

y_hat_np = y_hat.cpu().numpy()

Full notebook on Kaggle (forkable) →

Why this chapter?​

Why a CNN?​

The golden rule: (B, C, H, W)​

Reshaping images​

MNIST (1 channel, stored flat)​

CIFAR-10 (3 channels, stored as HWC)​

Brick 1: nn.Conv2d​

Brick 2: nn.ReLU​

Brick 3: nn.MaxPool2d​

Brick 4: nn.Flatten + nn.Linear​

Typical CNN architecture​

Stacking multiple blocks​

nn.Module syntax​

Using the GPU​