Deep Learning 4 — Convolutional networks (2/3)

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

In the previous chapter, we trained CNNs on CSV-stored datasets (MNIST, CIFAR-10). Practical to start, but in real life images almost always come from PNG/JPG files. We also discover two essential techniques to stabilise and regularise deep CNNs: Batch Normalization and Dropout.

Why this chapter?

You'll learn:

to load images from disk with ImageFolder;
to speed up the data pipeline (num_workers, pin_memory);
about Batch Normalization: why and where to place it;
about Dropout to reduce overfitting.

ImageFolder: organisation by folders

PyTorch expects a very simple organisation: one folder per class.

dataset/
├── train/
│   ├── 0/
│   │   ├── img_001.png
│   │   └── img_002.png
│   ├── 1/
│   └── ...
└── test/
    ├── 0/
    └── ...

The subfolder name acts as the label. torchvision.datasets.ImageFolder does all the work.

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.ToTensor()                   # PIL [0,255] → tensor [0,1] (C, H, W)
])

train_dataset = datasets.ImageFolder(root='dataset/train', transform=transform)
train_loader  = DataLoader(train_dataset, batch_size=32, shuffle=True)

The magic of ToTensor()

ToTensor() does three things at once:

converts the PIL Image to a PyTorch tensor;
divides by 255 → values in $[0, 1]$ ;
reorders to CHW from HWC.

Practical consequence: with ImageFolder + ToTensor(), you never need reshape or transpose again. Batches arrive directly in $(B, C, H, W)$ format.

Source	Reshape needed?
CSV (chapter 3)	yes
`ImageFolder` (here)	no, already BCHW thanks to `ToTensor()`

Speeding up loading

When training "feels slow" on GPU, the cause is rarely the model. It's usually the data pipeline. Four levers:

num_workers

By default, num_workers=0: everything happens in the main process. With num_workers > 0, separate processes prepare batches in parallel.

DataLoader(train_dataset, batch_size=256, shuffle=True, num_workers=4)

Usual value: 2-4 on Kaggle, 8 on a beefy machine.

pin_memory

With pin_memory=True, tensors are allocated in pinned (page-locked) memory, which speeds up transfer to GPU.

DataLoader(..., pin_memory=True)

for Xb, yb in train_loader:
    Xb = Xb.to(device, non_blocking=True)

persistent_workers

Avoids recreating workers at each epoch. Useful when epochs are short.

DataLoader(..., num_workers=4, persistent_workers=True)

Recap

DataLoader(
    train_dataset,
    batch_size=128,
    shuffle=True,
    num_workers=4,
    persistent_workers=True,
    pin_memory=True,
)

The goal: the GPU should never wait for a batch.

Batch Normalization

The problem: at each gradient step, weights change, so the activation distributions feeding the next layers also shift. The model has to constantly readjust — slow and unstable training.

The solution: normalise activations per batch. For each channel, recentre (mean ≈ 0) and rescale (variance ≈ 1), then apply a learned affine transformation (two parameters $\gamma$ and $\beta$ per channel).

Benefits

much more stable training;
faster convergence (often allows doubling or tripling the lr);
less sensitivity to initialisation;
mild regularisation effect.

Where to place BN

Standard placement: after the convolution, before the activation.

Conv2d → BatchNorm2d → ReLU

self.conv = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn   = nn.BatchNorm2d(64)         # 64 channels out of conv
self.relu = nn.ReLU()

PyTorch automatically handles the dual mode:

in model.train(): statistics computed on the current batch;
in model.eval(): averaged statistics accumulated during training.

:::warning Always call eval() at evaluation Without model.eval(), the model would keep using current-batch stats — changing predictions. Form the habit. :::

Dropout

The problem: a deep network may overfit by leaning too heavily on certain neuron combinations, creating fragile pathways.

The solution: during training, randomly disable a fraction $p$ of neurons at each step (typically $p = 0.5$ for dense, $0.2$ - $0.3$ for conv). The network learns to depend on no single neuron.

At evaluation, dropout is automatically disabled by model.eval().

Where to place Dropout

Mainly in dense layers (Linear) at the end of the network.

Linear → ReLU → Dropout

self.fc      = nn.Linear(512, 256)
self.dropout = nn.Dropout(p=0.5)

More rarely after conv layers (BN already plays a regularising role there).

BN, Dropout, or both?

Technique	When to use
BatchNorm only	Modern standard in CNNs. Sufficient in many cases.
Dropout only	When BatchNorm causes problems (small batches, RNNs, some transformers).
Both	Compatible. BN in the conv part, Dropout in the dense part.

These two techniques are as essential as ReLU in the deep learning toolkit.

Typical modern CNN architecture

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(), nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256), nn.ReLU(), nn.Dropout(0.5),
            nn.Linear(256, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

This is the skeleton found in most modern "homemade" architectures.

Full notebook on Kaggle (forkable) →

Why this chapter?​

ImageFolder: organisation by folders​

The magic of ToTensor()​

Speeding up loading​

num_workers​

pin_memory​

persistent_workers​

Recap​

Batch Normalization​

Benefits​

Where to place BN​

Dropout​

Where to place Dropout​

BN, Dropout, or both?​

Typical modern CNN architecture​