DL · Chapter 4

Deep learning 4 — Improving a CNN: data pipeline, regularisation and transfer learning

The previous chapter left us with a working object: a small convolutional network capable of classifying MNIST digits with respectable accuracy, by stacking two or three Conv2d — ReLU — MaxPool2d blocks followed by a linear classifier. On MNIST, this recipe is enough. On problems that are barely more demanding — Fashion-MNIST, and especially CIFAR-10 — it begins to run out of breath. Three symptoms appear quickly, and they are the real subjects of this chapter.

The first is a plumbing problem: on a GPU, the network ends up waiting for data. Reading PNG files from disk, decoding them, applying transforms, and finally transferring the batches to GPU memory take longer than the forward and backward passes themselves. The second is a training stability problem: as the network grows deeper, the activation distributions drift, the learning rate becomes finicky, and convergence slows down. The third is a generalisation problem: the network memorises the training set and loses ground on the test set. In the background, a more fundamental observation looms: on CIFAR-10, training a randomly initialised network from scratch is a wasteful strategy when, for free, we have access to models pre-trained on tens of millions of images.

This chapter introduces, in this order: the mechanics of the PyTorch DataLoader and the levers that accelerate the data pipeline; Batch Normalization as a training stabiliser; Dropout as a regulariser; data augmentation as a structural regulariser; learning rate scheduling; and finally transfer learning — how to recycle a ResNet or a VGG trained on ImageNet for a ten-class classification problem. By the end of the chapter, you should be able to put together a CNN that comfortably exceeds 80% accuracy on CIFAR-10 without heroic effort.

The data pipeline: Dataset, DataLoader, and bottlenecks

When the data no longer fits in memory, or when it arrives as PNG files on disk, we can no longer load a single global tensor before the training loop starts. We need a mechanism that reads each image on demand, applies the preprocessing, and groups examples into mini-batches. PyTorch splits this mechanism into two complementary objects.

Dataset and ImageFolder

A PyTorch Dataset is essentially an object with two methods: __len__() (how many examples?) and __getitem__(i) (what should I return for example $i$ ?). For images, torchvision ships a ready-to-use implementation called ImageFolder. It assumes a class-per-subfolder layout:

dataset/
├── train/
│   ├── 0/        # all images of class 0
│   ├── 1/        # all images of class 1
│   └── ...
└── test/
    ├── 0/
    ├── 1/
    └── ...

The folder name becomes the label. ImageFolder walks the tree at instantiation, indexes the files, and returns a (PIL_image, int_label) pair on demand. At that stage the images are still in PIL format, with integer values, and not necessarily at the size expected by the network. We therefore plug a transformation pipeline in front of it.

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Grayscale(num_output_channels=1),
    transforms.Resize((28, 28)),
    transforms.ToTensor(),
])

train_dataset = datasets.ImageFolder(root="dataset/train", transform=transform)
test_dataset  = datasets.ImageFolder(root="dataset/test",  transform=transform)

ToTensor() does two essential things: it converts the PIL image (H, W, C) into a PyTorch tensor (C, H, W), and it divides by 255 so that values land in $[0, 1]$ . If the image is already grayscale, the channel dimension is added automatically and the output has shape (1, H, W). Compared with the CSV-based MNIST loader of the previous chapter, no manual reshape and no unsqueeze(1) are needed — the transformation pipeline takes care of the channel dimension on its own.

DataLoader and mini-batches

A Dataset does not know how to group examples into batches. That is the job of the DataLoader, which takes a Dataset and orchestrates four things: building mini-batches, shuffling the data between epochs, parallelising the loading, and managing the copy strategy to the GPU. The minimal version is unsurprising:

train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)
test_loader  = DataLoader(test_dataset,  batch_size=256, shuffle=False)

Iterating over train_loader yields batches (Xb, yb) of shape ((B, C, H, W), (B,)), where $B$ is the batch size. It is good practice to verify these shapes once with next(iter(train_loader)) before plugging the loader into a training loop — many bugs in image pipelines (a misplaced Grayscale, a forgotten ToTensor, an off-by-one resize) are caught in seconds by printing the shape of the first batch.

Why the GPU waits

Once the model lives on the GPU, the bottleneck is rarely arithmetic. A modern GPU can chew through a CIFAR-10 batch in milliseconds; what kills throughput is feeding it. Reading a PNG file from disk, decoding it, applying transforms (resizing, colour conversion, augmentation), and finally copying the batch to GPU memory all happen on the CPU. If num_workers=0, the default, all of this happens in the main Python process, sequentially with the training step. The GPU computes a batch, then waits idle while the CPU prepares the next one. With nvidia-smi you would see GPU utilisation hovering at 30–50% — and that is the symptom of a starved pipeline.

The first lever is num_workers: spawning $k$ parallel processes that prepare batches in advance.

train_loader = DataLoader(
    train_dataset,
    batch_size=256,
    shuffle=True,
    num_workers=4,
)

A reasonable starting point is 2 to 4. Going higher trades CPU time and memory against latency, and beyond a certain point the contention costs more than the parallelism. The pragmatic rule is to time one epoch with num_workers = 0, 2, 4, 8 and keep the value that minimises wall time on your machine.

The second lever is pin_memory=True. PyTorch then allocates the prepared batches into pinned (page-locked) memory, which the CUDA driver can transfer to the GPU through DMA without an intermediate copy. The transfer itself becomes faster.

The third lever is non_blocking=True in the .to() call inside the loop. With pinned memory, the copy can overlap with computation:

for Xb, yb in train_loader:
    Xb = Xb.to(device, non_blocking=True)
    yb = yb.to(device, non_blocking=True)
    ...

The CPU launches the copy and continues without waiting; by the time the GPU finishes the previous step, the next batch is already on board. Without pin_memory=True, non_blocking=True is essentially a no-op.

A fourth lever, useful when epochs are short (Fashion-MNIST, MNIST), is persistent_workers=True. By default, PyTorch tears down and re-spawns the workers at every epoch; with persistent workers, they stay alive between epochs, saving the spin-up overhead. A typical "fast" configuration thus reads:

train_loader = DataLoader(
    train_dataset,
    batch_size=256,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True,
)

Pitfall — over-tuning the loader. More workers is not always better. On a small dataset whose images already fit in OS page cache, num_workers=0 is sometimes faster than num_workers=8 because the IPC overhead dominates. And pin_memory=True only helps when you are actually transferring batches to a GPU — on CPU-only training it just consumes pinned memory for nothing. Always benchmark before committing.

Cost of the transforms

Transforms run on the CPU. Some are cheap (ToTensor, Normalize); others are expensive (Resize to a much larger size, colour conversions, complex random augmentations). Two reflexes help: do not resize images that are already the right size, and do not apply Grayscale to images that already have a single channel. The augmentation pipeline that we will introduce shortly should be added on the training loader only; the test loader keeps a deterministic, minimal pipeline.

Batch Normalization: stabilising training

The deeper a network goes, the more the distributions of intermediate activations drift during training — what Ioffe and Szegedy called internal covariate shift. Each layer is constantly chasing a moving target, and the optimiser becomes very sensitive to the learning rate. Batch Normalization (BN) is the standard counter-measure.

Principle

For each mini-batch, BN normalises the activations of a layer so that they have mean zero and unit variance, then re-scales and re-shifts them with two learned parameters $\gamma$ and $\beta$ :

\hat{x} = \frac{x - \mu_{\mathrm{batch}}}{\sqrt{\sigma^2_{\mathrm{batch}} + \epsilon}}, \qquad y = \gamma \hat{x} + \beta.

The normalisation kills the drift; the affine transform $(\gamma, \beta)$ gives the network back the ability to represent any output distribution it needs. In a CNN, BN is applied per channel: each feature map gets its own pair of statistics. The PyTorch class nn.BatchNorm2d(num_channels) takes the number of output channels of the previous convolution.

Where to place it

The textbook placement is after the convolution and before the activation:

Conv2d → BatchNorm2d → ReLU

A typical block thus reads:

self.conv = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.bn   = nn.BatchNorm2d(64)
self.relu = nn.ReLU()

with the forward:

x = self.relu(self.bn(self.conv(x)))

The bias of Conv2d becomes redundant when followed by BN (BN has its own learnable shift $\beta$ ), so practitioners often pass bias=False to the convolution. The effect on accuracy is negligible; the win is mostly cosmetic.

Train mode vs eval mode

This is the most common gotcha with BN. During training, BN uses the batch statistics $(\mu_{\mathrm{batch}}, \sigma^2_{\mathrm{batch}})$ . In parallel, it maintains a running average of these statistics. At evaluation time, BN switches to those running statistics instead — because at inference we typically want a deterministic answer, independent of which other examples happen to share the batch. The switch happens when you call model.eval(); back to training-time behaviour with model.train().

Pitfall — forgetting model.eval(). If you run inference while the model is still in training mode, BN normalises the test batch by its own statistics. With a balanced batch, the answer is roughly correct. With a homogeneous batch (all class 0, for instance), the in-batch statistics are wildly wrong and accuracy collapses. The same trap exists for dropout, which we will see next. Wrap every evaluation block in model.eval() followed by with torch.no_grad():.

Dropout: regularising the classifier

A network with millions of parameters trained on a few tens of thousands of images will overfit unless something stops it. Dropout is the simplest and most popular tool for this.

Principle

During training, each neuron in a dropout layer is randomly switched off with probability $p$ . Each forward pass therefore uses a different random subnetwork; the backward pass updates only the active parameters. The intuition is that the network cannot afford to depend on a single neuron — that neuron might be off in the next iteration — and is forced to spread the representation across multiple paths. At inference time, dropout is disabled (all neurons are kept), and the activations are scaled down to compensate for the larger active population. PyTorch handles the scaling automatically; you only have to remember to switch to model.eval().

Where to place it

Dropout is mostly used in fully connected layers, where overfitting hits hardest. Convolutional layers, with their weight sharing, already act as a regulariser of sorts; dropping neurons there can hurt more than it helps. The standard pattern is:

Linear → ReLU → Dropout

with nn.Dropout(p=0.5) for hidden layers and p=0.2 to 0.3 if you want a lighter touch:

self.fc1     = nn.Linear(64 * 8 * 8, 256)
self.dropout = nn.Dropout(p=0.5)
self.fc2     = nn.Linear(256, num_classes)

with forward:

x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)

Pitfall — dropout right before the output. Putting dropout immediately before the final Linear of a classifier is fine. Putting it between the final logits and the softmax (or, equivalently, dropping logits) makes no sense — you would be randomly silencing class predictions. Dropout belongs in the hidden representation, not at the output.

A better CNN for CIFAR-10

CIFAR-10 contains 60 000 RGB images of size 32×32 distributed across ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), with 50 000 for training and 10 000 for testing. It is a sweet spot: small enough to iterate quickly, but rich enough that a naive CNN will plateau around 70% accuracy and the regularisation tricks of this chapter make a measurable difference.

A bare baseline, with three convolutions and no regularisation, looks like this:

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool  = nn.MaxPool2d(2)
        self.relu  = nn.ReLU()
        self.flatten = nn.Flatten()
        self.fc    = nn.Linear(64 * 8 * 8, num_classes)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))   # 32 -> 16
        x = self.relu(self.conv2(x))
        x = self.pool(self.relu(self.conv3(x)))   # 16 -> 8
        return self.fc(self.flatten(x))

Trained with Adam at $10^{-3}$ for ten epochs, this baseline lands somewhere between 65% and 70% test accuracy, with a clear gap between training and test loss — the classic signature of overfitting.

We now apply BN inside the convolutional blocks and Dropout in the classifier:

class SimpleCNN_BN_DO(nn.Module):
    def __init__(self, num_classes=10, dropout_p=0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(64)

        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(2)
        self.flatten = nn.Flatten()

        self.fc1 = nn.Linear(64 * 8 * 8, 256)
        self.dropout = nn.Dropout(p=dropout_p)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.pool(self.relu(self.bn1(self.conv1(x))))   # 32 -> 16
        x = self.relu(self.bn2(self.conv2(x)))
        x = self.pool(self.relu(self.bn3(self.conv3(x))))   # 16 -> 8
        x = self.flatten(x)
        x = self.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

The training loop is unchanged. Two things improve immediately. The training loss decreases more rapidly during the first epochs — BN stabilises the gradients and lets the optimiser take effective steps from the start. And the test accuracy after ten epochs jumps by several points, while the train/test gap narrows — Dropout is doing its job.

To complete the picture, the input transform should normalise the images with the per-channel statistics of CIFAR-10:

transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.4914, 0.4822, 0.4465),
                         std=(0.2470, 0.2435, 0.2616)),
])

This standardisation is independent from BN: BN normalises intermediate activations across a batch, while transforms.Normalize standardises the inputs using fixed dataset statistics. They complement each other.

Data augmentation: regularising through the data

BN and Dropout are model-side regularisers. Data augmentation is data-side: at every epoch, each training image is shown to the network in a slightly different form — flipped horizontally, randomly cropped after a small padding, perhaps with a small colour jitter. The training set effectively becomes infinite, and the model learns invariances that we know to be true (a horse is still a horse if you flip the image left-right) instead of memorising pixel-perfect copies.

In torchvision, augmentation is just additional steps in the transform pipeline of the training loader. The two cheapest and most effective augmentations on CIFAR-10 are random crops with reflective padding and random horizontal flips:

train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.4914, 0.4822, 0.4465),
                         std=(0.2470, 0.2435, 0.2616)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.4914, 0.4822, 0.4465),
                         std=(0.2470, 0.2435, 0.2616)),
])

The test transform stays minimal and deterministic — we want to evaluate on the genuine test images, not on augmented variants of them. Adding these two augmentations alone typically buys 4 to 6 points of test accuracy on CIFAR-10 with the same architecture and the same number of epochs.

Other useful augmentations are transforms.ColorJitter(brightness=0.1, contrast=0.1), transforms.RandomRotation(degrees=10), and the more aggressive transforms.RandAugment() introduced in recent versions. The general principle: only augment with transformations that preserve the label. A digit 6 rotated by 180° becomes a 9; a horizontal flip of B is no longer B. On CIFAR-10 a horizontal flip is harmless because both orientations are equally plausible in the natural distribution, but on traffic signs it would be catastrophic.

Pitfall — augmentation on the test set. Applying RandomHorizontalFlip "to be consistent" on the test loader silently injects randomness into your evaluation, making accuracy non-reproducible from one run to the next. Worse, it changes the meaning of the score. Always keep the test pipeline deterministic.

Learning rate scheduling

The learning rate $\eta$ is the most important hyperparameter of training. A fixed value is a compromise: large enough to make progress at the beginning, small enough not to oscillate around the minimum at the end. Scheduling lets us have both — start with a large $\eta$ and decrease it gradually as training progresses.

PyTorch offers a family of schedulers in torch.optim.lr_scheduler. Two stand out for image classification.

StepLR divides $\eta$ by a factor gamma every step_size epochs:

optimizer = torch.optim.SGD(model.parameters(), lr=1e-1, momentum=0.9)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

for epoch in range(epochs):
    train_one_epoch(...)
    scheduler.step()

It is simple and widely used in classical ResNet recipes (e.g. drops at epoch 30 and 60 for a 90-epoch training).

CosineAnnealingLR smoothly anneals $\eta$ along a cosine from its initial value down to (almost) zero over a fixed horizon T_max:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)

Cosine annealing has become the default in many modern pipelines. It avoids the abrupt drops of StepLR and tends to give marginally better final accuracy without any extra tuning.

The scheduler must be advanced once per epoch, after optimizer.step() calls. A common bug is to call scheduler.step() inside the inner batch loop — the LR then collapses far too quickly.

Transfer learning: recycling a pre-trained model

Training a CNN from scratch on a small dataset is a losing proposition. The first convolutional layers of any well-trained image network learn very generic filters — edges, colour blobs, simple textures — that are essentially the same whatever the task. Why re-learn them on 50 000 CIFAR images when models trained on millions of ImageNet images already know them?

Transfer learning consists of taking a network pre-trained on a large dataset (typically ImageNet, 1.28 million images, 1000 classes) and reusing its learned features on our smaller problem. Two standard variants exist.

Feature extraction (frozen backbone)

We freeze all the parameters of the pre-trained network — they will not be updated by the optimiser — and replace only the final classification head with a fresh Linear layer for our number of classes. Only that head is trained. This is the cheapest variant, and it is often surprisingly effective when the source and target distributions are not too different.

import torchvision.models as models

backbone = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

for p in backbone.parameters():
    p.requires_grad = False

backbone.fc = nn.Linear(512, 10)        # only this layer is trainable

backbone = backbone.to(device)
optimizer = torch.optim.Adam(backbone.fc.parameters(), lr=1e-3)

Note two details. First, model.fc is the existing final layer of resnet18; we overwrite it with a new layer whose parameters automatically have requires_grad=True, so they will be trained while the rest stays frozen. Second, we pass backbone.fc.parameters() (not backbone.parameters()) to the optimiser — there is no point in even tracking the gradients of frozen parameters, although PyTorch handles that case correctly if you forget.

Because ResNet was trained on 224×224 RGB inputs normalised with ImageNet statistics, the input transform must match:

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.485, 0.456, 0.406),
                         std=(0.229, 0.224, 0.225)),
])

These are the ImageNet statistics, not the CIFAR-10 ones. Mixing them up is a common mistake — a frozen backbone expects the exact input distribution it was trained on.

Fine-tuning (unfrozen backbone)

We unfreeze the entire network — or part of it — and continue training with a smaller learning rate, typically $10^{-4}$ or $10^{-5}$ . The pre-trained features are gently adapted to the target distribution. The result is almost always better than feature extraction, at the cost of more compute.

for p in backbone.parameters():
    p.requires_grad = True

optimizer = torch.optim.Adam(backbone.parameters(), lr=1e-4)

A refinement is differential learning rates: a tiny LR for the early layers (which already know what they are doing), a moderate LR for the middle, and a larger LR for the head (which started from random initialisation):

optimizer = torch.optim.Adam([
    {"params": backbone.layer1.parameters(), "lr": 1e-5},
    {"params": backbone.layer2.parameters(), "lr": 1e-5},
    {"params": backbone.layer3.parameters(), "lr": 1e-4},
    {"params": backbone.layer4.parameters(), "lr": 1e-4},
    {"params": backbone.fc.parameters(),     "lr": 1e-3},
])

Pitfall — frozen BatchNorm during fine-tuning. When you call model.train() to fine-tune, the BN layers of the backbone go back to training mode and start updating their running statistics with your tiny CIFAR batches — even if you set requires_grad=False on their parameters. Those statistics may diverge from the well-tuned ImageNet ones and degrade accuracy. The clean fix is to manually set the BN layers to eval mode after model.train():
model.train()
for m in model.modules():
    if isinstance(m, nn.BatchNorm2d):
        m.eval()
The same problem applies, in reverse, when you forget to call model.train() at the start of fine-tuning: BN then uses (potentially stale) running statistics during training, which prevents the network from learning batch-normalised representations.

Pitfall — mismatched input pipelines. Pre-trained torchvision models expect inputs in [0, 1] followed by ImageNet normalisation, in RGB order, at 224×224. Feeding them grayscale images, or BGR-ordered tensors, or images normalised with CIFAR statistics, silently destroys their accuracy. The first thing to check when transfer learning underperforms is the input pipeline.

The general rule of thumb: with a few thousand images per class, feature extraction is enough. With tens of thousands, fine-tuning gives a clearer win. Below a thousand, augmentation becomes critical and the choice of pre-trained backbone matters more than fine-tuning depth.

Summary: composing a solid training run

A modern CIFAR-10 training script combines all the levers of this chapter:

an ImageFolder dataset with a transform pipeline that includes augmentations on the training side and a deterministic pipeline on the test side;
a DataLoader configured with num_workers > 0, pin_memory=True, and non_blocking=True in the loop, so that the GPU is never starved;
a CNN with BatchNorm2d after each convolution and Dropout in the classifier;
an optimiser (Adam or SGD with momentum) coupled with a CosineAnnealingLR scheduler over the full training horizon;
or, if compute is tight, a pre-trained ResNet from torchvision.models, fine-tuned with differential learning rates.

These five moves transform a 65%-accuracy baseline into a 90%-accuracy model on CIFAR-10 with no architectural innovation — only good engineering of the pipeline, the regularisation, and the optimisation schedule.

Exercises

Exercise 1 — Measuring the impact of the pipeline

On Fashion-MNIST PNG, measure the total time of one epoch in four configurations:

num_workers=0;
num_workers=2;
num_workers=4, pin_memory=True;
configuration 3 + non_blocking=True in the loop.

Time the full training loop with time.perf_counter(). Plot a bar chart of the times. Conclude: where is the bottleneck, and what gain can you expect on this particular task?

Exercise 2 — BN + Dropout on CIFAR-10

Take the SimpleCNN of the previous chapter adapted to CIFAR-10 (3 input channels, 10 classes). Build three variants:

A — without BN, without Dropout;
B — with BatchNorm after each convolution;
C — with BatchNorm + Dropout p=0.5 in the classifier.

Train each variant for 10 epochs with Adam, LR $10^{-3}$ , batch size 256. Compare the loss curves and the test accuracy. Comment: which lever brings the clearest gain? Is there overfitting in variant A?

Exercise 3 — Data augmentation

Starting from variant C of Exercise 2, add an augmentation pipeline to the training set (RandomCrop with padding 4, RandomHorizontalFlip). The test set stays unchanged (ToTensor + Normalize only). Retrain and compare the test accuracy with the non-augmented version. Plot side by side the train/test curves of both runs to visualise the reduction in the generalisation gap.

Exercise 4 — Learning rate scheduling

Take the model of Exercise 3 and train for 30 epochs in two conditions:

constant LR at $10^{-3}$ ;
CosineAnnealingLR(optimizer, T_max=30) initialised at $10^{-3}$ .

Plot the LR curve and the validation loss. At which epoch does the scheduler bring the largest gain?

Exercise 5 — Transfer learning on CIFAR-10

Load a pre-trained resnet18 from torchvision.models. Freeze all parameters, replace model.fc with a nn.Linear(512, 10), and train only the new head (on images resized to 224×224 and normalised with ImageNet statistics). 5 epochs with Adam, LR $10^{-3}$ , batch size 64. Measure the test accuracy and compare it to your best CNN trained from scratch. How many epochs from scratch would it take to match this performance?

Exercise 6 — Fine-tuning

Continue from Exercise 5: now unfreeze the entire network and train for 5 more epochs with Adam at LR $10^{-4}$ . What marginal gain does this fine-tuning bring? Then try with a differential LR (LR $10^{-5}$ for layer1 and layer2, $10^{-4}$ for layer3 and layer4, $10^{-3}$ for fc). Is the gain significant?

Exercise 7 — Mode pitfalls

For each of the following cases, indicate what happens and why:

You forget to call model.eval() before evaluation, and the test batch only contains images of a single class.
During fine-tuning, you correctly call model.train(), but the frozen BN layers of the backbone keep updating their running statistics on your small batches.
You apply RandomHorizontalFlip on both train and test "to be consistent".
You use transforms.Normalize with ImageNet statistics on a model trained from scratch on CIFAR-10.

Going further

torchvision.models documentation — pytorch.org/vision/stable/models.html lists every available architecture, the pre-trained weights, and their ImageNet top-1/top-5 performance. The reference page when picking a backbone.
torchvision.transforms documentation — pytorch.org/vision/stable/transforms.html. Since PyTorch 2.0 it includes the new transforms.v2 API, which is faster and supports bounding boxes and segmentation masks.
Albumentations — albumentations.ai is a popular alternative augmentation library, faster than torchvision.transforms for some operations and richer in geometric augmentations. Standard in Kaggle competitions.
Sergey Ioffe, Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, ICML 2015 — the foundational paper, surprisingly readable.
Nitish Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR 2014 — the original paper, worth reading for the motivation and the experiments.
Kaiming He et al., Deep Residual Learning for Image Recognition, CVPR 2016 — the ResNet paper, which combines BN, residual connections, and deep training. The model family from which most pre-trained ResNets descend.
Andrej Karpathy, A Recipe for Training Neural Networks — karpathy.github.io/2019/04/25/recipe. A pragmatic, widely-read guide to the pitfalls and good habits of CNN training.
Leslie N. Smith, Cyclical Learning Rates for Training Neural Networks, WACV 2017 — introduces cyclical schedulers (CyclicLR, OneCycleLR), still core to the fastai pipelines.
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow, chapter 14 — covers data augmentation, transfer learning and fine-tuning from a practical angle.
Jeremy Howard, Sylvain Gugger, Deep Learning for Coders with fastai and PyTorch — the entire book is built around transfer learning as the default strategy, with a strong emphasis on cyclical LR schedulers and aggressive data augmentation.

In the next chapter, we stay on images but leave the territory of generic classification for two more structured problems: object detection (where are the objects in the image, and which class?) and semantic segmentation (which class does each pixel belong to?). The convolutional blocks and the regularisation techniques we have just laid out remain central; what changes is the head of the network and the loss function.