Deep Learning 3 — Convolutional networks (1/3)
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
First contact with convolutional neural networks (CNNs). The main conceptual challenge of this chapter: PyTorch's BCHW tensor convention. Once that rule is well understood, the rest follows naturally.
Why this chapter?
You'll learn:
- why a CNN rather than an MLP for images;
- the
(B, C, H, W)pivot rule PyTorch expects for images; - the bricks
Conv2d,MaxPool2d,Flatten; - how to stack several conv layers;
- the
nn.Modulesyntax (the professional way to write a model); - GPU usage with
.to(device).
Why a CNN?
An MLP treats an image as a 1D vector (flatten into ). Three problems:
- Loss of spatial proximity: for the MLP, pixel and pixel are as foreign as pixel and pixel .
- Huge parameter count: for a 224×224 RGB image, ~150,000 input features. With 100 hidden neurons, ~15 million weights for the first layer alone.
- No translation invariance: if the object shifts by a few pixels, the network no longer recognises it.
The CNN solves all three by exploiting two ideas:
- Local connectivity: a neuron only looks at a small neighbourhood (3×3 pixels), not the whole image.
- Weight sharing: the same 3×3 filter is applied at all positions. Far fewer parameters, plus free translation invariance.
The golden rule: (B, C, H, W)
PyTorch always expects a 4D tensor
(batch, channels, height, width)for convolutions. Period.
Dimension breakdown:
| Dim | Meaning | MNIST | CIFAR-10 |
|---|---|---|---|
| B | number of images processed in parallel | 64 | 64 |
| C | number of channels | 1 (grey) | 3 (RGB) |
| H | height in pixels | 28 | 32 |
| W | width in pixels | 28 | 32 |
This convention is called BCHW (or NCHW). PyTorch enforces it for GPU performance reasons.
:::warning matplotlib uses a different convention
matplotlib (and numpy in general) expects images in HWC (height, width, channels). This difference will force a few permutations for RGB images.
:::
For every image-tensor manipulation, print its shape with print(X.shape). It's the reflex that prevents 90% of bugs.
Reshaping images
MNIST (1 channel, stored flat)
X = df.drop(columns='label').to_numpy() # (N, 784)
X = X.reshape(-1, 1, 28, 28) / 255.0 # (N, 1, 28, 28) — already BCHW
X_t = torch.tensor(X, dtype=torch.float32)
CIFAR-10 (3 channels, stored as HWC)
X = df.drop(columns='label').to_numpy() # (N, 3072)
X = X.reshape(-1, 32, 32, 3) / 255.0 # (N, H, W, C) — image convention
X = X.transpose(0, 3, 1, 2) # (N, C, H, W) — permute for PyTorch
X_t = torch.tensor(X, dtype=torch.float32)
The only difference: for RGB, we permute axes from HWC to CHW.
Brick 1: nn.Conv2d
A convolution applies a filter (kernel) of small size, sliding it over the image. At each position, it computes a weighted sum of the covered pixels. Each filter learns to detect a spatial pattern — edge, corner, texture.
nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
in_channels: input channels (1 for MNIST, 3 for CIFAR).out_channels: number of learned filters (= output channels).kernel_size: filter size (typically 3 or 5).padding=1withkernel_size=3keeps H×W unchanged.
Shape: .
Brick 2: nn.ReLU
As in an MLP, we insert a non-linear activation after each convolution. Without it, stacked Conv2d collapse to a single equivalent Conv2d. ReLU is the standard activation.
nn.ReLU()
Brick 3: nn.MaxPool2d
Pooling reduces the spatial size of the image while keeping the useful information. MaxPool2d(2) halves height and width by keeping the max value of each 2×2 patch.
nn.MaxPool2d(kernel_size=2)
Effect: . Channel count unchanged.
Benefits: fewer parameters in later layers, robustness to small translations, focus on strong patterns.
Brick 4: nn.Flatten + nn.Linear
After a few Conv → ReLU → Pool blocks, we flatten the 4D tensor to 2D for a dense classification layer:
nn.Flatten() # (B, C, H, W) → (B, C*H*W)
nn.Linear(C*H*W, n_classes) # logits
No activation after the final Linear — CrossEntropyLoss handles softmax internally.
Typical CNN architecture
model = nn.Sequential(
nn.Conv2d(1, 8, kernel_size=3, padding=1), # (N, 1, 28, 28) → (N, 8, 28, 28)
nn.ReLU(),
nn.MaxPool2d(2), # → (N, 8, 14, 14)
nn.Flatten(), # → (N, 8*14*14)
nn.Linear(8 * 14 * 14, 10), # → (N, 10) logits
)
Stacking multiple blocks
For a more powerful model, stack Conv → ReLU → Pool blocks. Rule: a layer's out_channels becomes the next layer's in_channels.
nn.Sequential(
nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 16×16
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 8×8
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 4×4
nn.Flatten(),
nn.Linear(64 * 4 * 4, 10),
)
Channel count typically grows (16 → 32 → 64) — we gain expressivity at the cost of resolution.
nn.Module syntax
For richer architectures, define a class inheriting from nn.Module:
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.pool = nn.MaxPool2d(2)
self.relu = nn.ReLU()
self.fc = nn.Linear(32 * 7 * 7, 10)
def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = torch.flatten(x, start_dim=1)
x = self.fc(x)
return x
This is the standard pattern in practice. It allows branching, residual connections, layer sharing, etc.
Using the GPU
PyTorch never moves things automatically. Everything must be explicit.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
In the training loop, move batches as you fit:
for Xb, yb in train_loader:
Xb = Xb.to(device)
yb = yb.to(device)
optimizer.zero_grad()
logits = model(Xb)
loss = criterion(logits, yb)
loss.backward()
optimizer.step()
At the end, to compute metrics with scikit-learn or matplotlib, bring tensors back to CPU:
y_hat_np = y_hat.cpu().numpy()