Deep Learning 5 — Convolutional networks (3/3)

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

Today, hardly anyone trains CNNs from scratch on real data: we start from a pre-trained model. This last chapter covers transfer learning, then opens onto object detection with YOLO.

Why this chapter?

You'll see:

the model zoo: ResNet, VGG, MobileNet, EfficientNet;
foundation models as the modern evolution of the zoo;
the two transfer learning strategies: feature extraction and fine-tuning;
the move from classification to object detection;
the idea and use of YOLO.

The model zoo

Rather than building an architecture from scratch, we lean on a model zoo of pre-trained networks. In vision, the standard is torchvision.models:

Family	Variants	Characteristic
ResNet	18, 50, 101, 152	Residual connections, the workhorse
VGG	11, 16, 19	Very deep stacking, many parameters
DenseNet	121, 161, 201	Dense connections between layers
MobileNet	V2, V3	Optimised for mobile / edge
EfficientNet	B0 to B7	Optimised efficiency / accuracy ratio

Each was trained on ImageNet (~1.2M images, 1000 classes) — a huge compute cost we benefit from for free.

Foundation models

The ImageNet zoo is the ancestor of a broader idea: foundation models. Trained on massive datasets (often via self-supervision), they serve as starting points for dozens of tasks, with very little task-specific data.

Vision: CLIP, DINO, SAM
Text: GPT, Llama, Mistral
Multimodal: all the current bubbling

Transfer learning: the central idea

The early layers of a CNN learn general representations (edges, textures, simple patterns); the later layers learn task-specific decisions.

Consequence: if we have an ImageNet-trained model and want to classify medical X-rays or plants, we can reuse the early layers and only retrain the later ones.

Two strategies

1. Feature extraction (freezing)

We freeze the entire pre-trained model, replace only the last layer, and train just that layer.

from torchvision import models

model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Freeze everything
for param in model.parameters():
    param.requires_grad = False

# Replace the head (1000 ImageNet classes → new n_classes)
model.fc = nn.Linear(model.fc.in_features, n_classes)

# Only model.fc parameters are trainable
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Pros: very fast, low overfitting risk, works even with very little data.

2. Fine-tuning

We unfreeze everything (or just the later layers) and continue training with a smaller lr than usual, to avoid breaking the good learned representations.

for param in model.parameters():
    param.requires_grad = True

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)   # small lr!

Pros: better performance than feature extraction. Cons: more costly, more risky (overfitting if little data).

When to use which strategy?

Situation	Strategy
Very little data (< 1000)	Feature extraction (full freeze)
Lots of data (> 10k)	Full fine-tuning
Domain very different from ImageNet	Fine-tuning, unfreeze later layers first
Embedded production	Fine-tune a MobileNet or EfficientNet

Object detection

Classification vs detection

Classification answers "what's in this image?" with a single label.

Detection answers "which objects are present, where are they, and of what type?" with potentially many outputs per image.

A bounding box is defined by its position $(x, y)$ and size $(w, h)$ . The network must predict, for each object: class, box, and a confidence score.

Why a classifier alone isn't enough

A classification CNN progressively compresses the image through MaxPool, loses fine spatial information, and ends with a single global output. Good for recognising, bad for localising.

The YOLO idea

YOLO (You Only Look Once, Redmon et al., 2016) proposes a simple revolution:

Do all detection in a single forward pass of the network.

Instead of cropping and reclassifying, YOLO:

processes the whole image at once;
implicitly cuts it into a grid (e.g. 13×13);
for each grid cell, directly predicts possible classes, several candidate boxes, and their confidence scores.

A single forward pass produces all detections. As a result, YOLO can run in real time (30+ FPS on a modern GPU).

YOLO format and Ultralytics training

A YOLO-format dataset:

dataset_yolo/
├── images/
│   ├── train/
│   ├── val/
├── labels/
│   ├── train/
│   ├── val/
└── data.yaml

Each image has a .txt file (same base name) in labels/, with format:

class_id  x_centre  y_centre  width  height

Coordinates normalised between 0 and 1.

Training with Ultralytics

from ultralytics import YOLO

model = YOLO("yolov8n.pt")        # 'n' = nano

model.train(
    data="data.yaml",
    imgsz=640,
    epochs=20,
    batch=16,
)

The backbone is pre-trained on ImageNet — free transfer learning.

Detection metrics

Metric	Description
mAP50	Mean Average Precision with IoU threshold 0.5
mAP50-95	Average over IoU thresholds 0.5 to 0.95 (more demanding)
Precision / recall	Per class

IoU (Intersection over Union) measures overlap between predicted and ground truth boxes. A detection is considered correct if IoU ≥ threshold.

Wrap-up

You now have the foundations for most modern computer vision problems:

understanding how a CNN learns spatial representations;
choosing between feature extraction and fine-tuning based on data quantity;
distinguishing classification and detection;
the reflex of "start from a pre-trained model" rather than reinventing.

To go further: segmentation (Mask R-CNN, U-Net), generation (Stable Diffusion), vision-language (CLIP, BLIP), vision transformers (ViT, DETR).

Full notebook on Kaggle (forkable) →

Why this chapter?​

The model zoo​

Foundation models​

Transfer learning: the central idea​

Two strategies​

1. Feature extraction (freezing)​

2. Fine-tuning​

When to use which strategy?​

Object detection​

Classification vs detection​

Why a classifier alone isn't enough​

The YOLO idea​

YOLO format and Ultralytics training​

Training with Ultralytics​

Detection metrics​

Wrap-up​