Deep Learning 4 — Convolutional networks (2/3)
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
In the previous chapter, we trained CNNs on CSV-stored datasets (MNIST, CIFAR-10). Practical to start, but in real life images almost always come from PNG/JPG files. We also discover two essential techniques to stabilise and regularise deep CNNs: Batch Normalization and Dropout.
Why this chapter?
You'll learn:
- to load images from disk with
ImageFolder; - to speed up the data pipeline (
num_workers,pin_memory); - about Batch Normalization: why and where to place it;
- about Dropout to reduce overfitting.
ImageFolder: organisation by folders
PyTorch expects a very simple organisation: one folder per class.
dataset/
├── train/
│ ├── 0/
│ │ ├── img_001.png
│ │ └── img_002.png
│ ├── 1/
│ └── ...
└── test/
├── 0/
└── ...
The subfolder name acts as the label. torchvision.datasets.ImageFolder does all the work.
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
transform = transforms.Compose([
transforms.Resize((28, 28)),
transforms.ToTensor() # PIL [0,255] → tensor [0,1] (C, H, W)
])
train_dataset = datasets.ImageFolder(root='dataset/train', transform=transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
The magic of ToTensor()
ToTensor() does three things at once:
- converts the PIL Image to a PyTorch tensor;
- divides by 255 → values in ;
- reorders to CHW from HWC.
Practical consequence: with ImageFolder + ToTensor(), you never need reshape or transpose again. Batches arrive directly in format.
| Source | Reshape needed? |
|---|---|
| CSV (chapter 3) | yes |
ImageFolder (here) | no, already BCHW thanks to ToTensor() |
Speeding up loading
When training "feels slow" on GPU, the cause is rarely the model. It's usually the data pipeline. Four levers:
num_workers
By default, num_workers=0: everything happens in the main process. With num_workers > 0, separate processes prepare batches in parallel.
DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=4)
Usual value: 2-4 on Kaggle, 8 on a beefy machine.
pin_memory
With pin_memory=True, tensors are allocated in pinned (page-locked) memory, which speeds up transfer to GPU.
DataLoader(..., pin_memory=True)
for Xb, yb in train_loader:
Xb = Xb.to(device, non_blocking=True)
persistent_workers
Avoids recreating workers at each epoch. Useful when epochs are short.
DataLoader(..., num_workers=4, persistent_workers=True)
Recap
DataLoader(
train_dataset,
batch_size=128,
shuffle=True,
num_workers=4,
persistent_workers=True,
pin_memory=True,
)
The goal: the GPU should never wait for a batch.
Batch Normalization
The problem: at each gradient step, weights change, so the activation distributions feeding the next layers also shift. The model has to constantly readjust — slow and unstable training.
The solution: normalise activations per batch. For each channel, recentre (mean ≈ 0) and rescale (variance ≈ 1), then apply a learned affine transformation (two parameters and per channel).
Benefits
- much more stable training;
- faster convergence (often allows doubling or tripling the
lr); - less sensitivity to initialisation;
- mild regularisation effect.
Where to place BN
Standard placement: after the convolution, before the activation.
Conv2d → BatchNorm2d → ReLU
self.conv = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn = nn.BatchNorm2d(64) # 64 channels out of conv
self.relu = nn.ReLU()
PyTorch automatically handles the dual mode:
- in
model.train(): statistics computed on the current batch; - in
model.eval(): averaged statistics accumulated during training.
:::warning Always call eval() at evaluation
Without model.eval(), the model would keep using current-batch stats — changing predictions. Form the habit.
:::
Dropout
The problem: a deep network may overfit by leaning too heavily on certain neuron combinations, creating fragile pathways.
The solution: during training, randomly disable a fraction of neurons at each step (typically for dense, - for conv). The network learns to depend on no single neuron.
At evaluation, dropout is automatically disabled by model.eval().
Where to place Dropout
Mainly in dense layers (Linear) at the end of the network.
Linear → ReLU → Dropout
self.fc = nn.Linear(512, 256)
self.dropout = nn.Dropout(p=0.5)
More rarely after conv layers (BN already plays a regularising role there).
BN, Dropout, or both?
| Technique | When to use |
|---|---|
| BatchNorm only | Modern standard in CNNs. Sufficient in many cases. |
| Dropout only | When BatchNorm causes problems (small batches, RNNs, some transformers). |
| Both | Compatible. BN in the conv part, Dropout in the dense part. |
These two techniques are as essential as ReLU in the deep learning toolkit.
Typical modern CNN architecture
class CNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.5),
nn.Linear(256, 10),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
This is the skeleton found in most modern "homemade" architectures.