Aller au contenu principal

Deep Learning 3 — Convolutional networks (1/3)

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

First contact with convolutional neural networks (CNNs). The main conceptual challenge of this chapter: PyTorch's BCHW tensor convention. Once that rule is well understood, the rest follows naturally.

Why this chapter?

You'll learn:

  • why a CNN rather than an MLP for images;
  • the (B, C, H, W) pivot rule PyTorch expects for images;
  • the bricks Conv2d, MaxPool2d, Flatten;
  • how to stack several conv layers;
  • the nn.Module syntax (the professional way to write a model);
  • GPU usage with .to(device).

Why a CNN?

An MLP treats an image as a 1D vector (flatten (28,28)(28, 28) into (784,)(784,)). Three problems:

  1. Loss of spatial proximity: for the MLP, pixel (1,1)(1,1) and pixel (1,2)(1,2) are as foreign as pixel (1,1)(1,1) and pixel (28,28)(28,28).
  2. Huge parameter count: for a 224×224 RGB image, ~150,000 input features. With 100 hidden neurons, ~15 million weights for the first layer alone.
  3. No translation invariance: if the object shifts by a few pixels, the network no longer recognises it.

The CNN solves all three by exploiting two ideas:

  • Local connectivity: a neuron only looks at a small neighbourhood (3×3 pixels), not the whole image.
  • Weight sharing: the same 3×3 filter is applied at all positions. Far fewer parameters, plus free translation invariance.

The golden rule: (B, C, H, W)

PyTorch always expects a 4D tensor (batch, channels, height, width) for convolutions. Period.

Dimension breakdown:

DimMeaningMNISTCIFAR-10
Bnumber of images processed in parallel6464
Cnumber of channels1 (grey)3 (RGB)
Hheight in pixels2832
Wwidth in pixels2832

This convention is called BCHW (or NCHW). PyTorch enforces it for GPU performance reasons.

:::warning matplotlib uses a different convention matplotlib (and numpy in general) expects images in HWC (height, width, channels). This difference will force a few permutations for RGB images. :::

For every image-tensor manipulation, print its shape with print(X.shape). It's the reflex that prevents 90% of bugs.

Reshaping images

MNIST (1 channel, stored flat)

X = df.drop(columns='label').to_numpy() # (N, 784)
X = X.reshape(-1, 1, 28, 28) / 255.0 # (N, 1, 28, 28) — already BCHW
X_t = torch.tensor(X, dtype=torch.float32)

CIFAR-10 (3 channels, stored as HWC)

X = df.drop(columns='label').to_numpy() # (N, 3072)
X = X.reshape(-1, 32, 32, 3) / 255.0 # (N, H, W, C) — image convention
X = X.transpose(0, 3, 1, 2) # (N, C, H, W) — permute for PyTorch
X_t = torch.tensor(X, dtype=torch.float32)

The only difference: for RGB, we permute axes from HWC to CHW.

Brick 1: nn.Conv2d

A convolution applies a filter (kernel) of small size, sliding it over the image. At each position, it computes a weighted sum of the covered pixels. Each filter learns to detect a spatial pattern — edge, corner, texture.

nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
  • in_channels: input channels (1 for MNIST, 3 for CIFAR).
  • out_channels: number of learned filters (= output channels).
  • kernel_size: filter size (typically 3 or 5).
  • padding=1 with kernel_size=3 keeps H×W unchanged.

Shape: (B,Cin,H,W)(B,Cout,Hout,Wout)(B, C_\text{in}, H, W) \to (B, C_\text{out}, H_\text{out}, W_\text{out}).

Brick 2: nn.ReLU

As in an MLP, we insert a non-linear activation after each convolution. Without it, stacked Conv2d collapse to a single equivalent Conv2d. ReLU is the standard activation.

nn.ReLU()

Brick 3: nn.MaxPool2d

Pooling reduces the spatial size of the image while keeping the useful information. MaxPool2d(2) halves height and width by keeping the max value of each 2×2 patch.

nn.MaxPool2d(kernel_size=2)

Effect: (B,C,H,W)(B,C,H/2,W/2)(B, C, H, W) \to (B, C, H/2, W/2). Channel count unchanged.

Benefits: fewer parameters in later layers, robustness to small translations, focus on strong patterns.

Brick 4: nn.Flatten + nn.Linear

After a few Conv → ReLU → Pool blocks, we flatten the 4D tensor to 2D for a dense classification layer:

nn.Flatten() # (B, C, H, W) → (B, C*H*W)
nn.Linear(C*H*W, n_classes) # logits

No activation after the final LinearCrossEntropyLoss handles softmax internally.

Typical CNN architecture

model = nn.Sequential(
nn.Conv2d(1, 8, kernel_size=3, padding=1), # (N, 1, 28, 28) → (N, 8, 28, 28)
nn.ReLU(),
nn.MaxPool2d(2), # → (N, 8, 14, 14)
nn.Flatten(), # → (N, 8*14*14)
nn.Linear(8 * 14 * 14, 10), # → (N, 10) logits
)

Stacking multiple blocks

For a more powerful model, stack Conv → ReLU → Pool blocks. Rule: a layer's out_channels becomes the next layer's in_channels.

nn.Sequential(
nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 16×16
nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 8×8
nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # → 4×4
nn.Flatten(),
nn.Linear(64 * 4 * 4, 10),
)

Channel count typically grows (16 → 32 → 64) — we gain expressivity at the cost of resolution.

nn.Module syntax

For richer architectures, define a class inheriting from nn.Module:

class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
self.pool = nn.MaxPool2d(2)
self.relu = nn.ReLU()
self.fc = nn.Linear(32 * 7 * 7, 10)

def forward(self, x):
x = self.pool(self.relu(self.conv1(x)))
x = self.pool(self.relu(self.conv2(x)))
x = torch.flatten(x, start_dim=1)
x = self.fc(x)
return x

This is the standard pattern in practice. It allows branching, residual connections, layer sharing, etc.

Using the GPU

PyTorch never moves things automatically. Everything must be explicit.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In the training loop, move batches as you fit:

for Xb, yb in train_loader:
Xb = Xb.to(device)
yb = yb.to(device)
optimizer.zero_grad()
logits = model(Xb)
loss = criterion(logits, yb)
loss.backward()
optimizer.step()

At the end, to compute metrics with scikit-learn or matplotlib, bring tensors back to CPU:

y_hat_np = y_hat.cpu().numpy()

Full notebook on Kaggle (forkable) →