Deep Learning 5 — Convolutional networks (3/3)
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
Today, hardly anyone trains CNNs from scratch on real data: we start from a pre-trained model. This last chapter covers transfer learning, then opens onto object detection with YOLO.
Why this chapter?
You'll see:
- the model zoo: ResNet, VGG, MobileNet, EfficientNet;
- foundation models as the modern evolution of the zoo;
- the two transfer learning strategies: feature extraction and fine-tuning;
- the move from classification to object detection;
- the idea and use of YOLO.
The model zoo
Rather than building an architecture from scratch, we lean on a model zoo of pre-trained networks. In vision, the standard is torchvision.models:
| Family | Variants | Characteristic |
|---|---|---|
| ResNet | 18, 50, 101, 152 | Residual connections, the workhorse |
| VGG | 11, 16, 19 | Very deep stacking, many parameters |
| DenseNet | 121, 161, 201 | Dense connections between layers |
| MobileNet | V2, V3 | Optimised for mobile / edge |
| EfficientNet | B0 to B7 | Optimised efficiency / accuracy ratio |
Each was trained on ImageNet (~1.2M images, 1000 classes) — a huge compute cost we benefit from for free.
Foundation models
The ImageNet zoo is the ancestor of a broader idea: foundation models. Trained on massive datasets (often via self-supervision), they serve as starting points for dozens of tasks, with very little task-specific data.
- Vision: CLIP, DINO, SAM
- Text: GPT, Llama, Mistral
- Multimodal: all the current bubbling
Transfer learning: the central idea
The early layers of a CNN learn general representations (edges, textures, simple patterns); the later layers learn task-specific decisions.
Consequence: if we have an ImageNet-trained model and want to classify medical X-rays or plants, we can reuse the early layers and only retrain the later ones.
Two strategies
1. Feature extraction (freezing)
We freeze the entire pre-trained model, replace only the last layer, and train just that layer.
from torchvision import models
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Freeze everything
for param in model.parameters():
param.requires_grad = False
# Replace the head (1000 ImageNet classes → new n_classes)
model.fc = nn.Linear(model.fc.in_features, n_classes)
# Only model.fc parameters are trainable
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
Pros: very fast, low overfitting risk, works even with very little data.
2. Fine-tuning
We unfreeze everything (or just the later layers) and continue training with a smaller lr than usual, to avoid breaking the good learned representations.
for param in model.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) # small lr!
Pros: better performance than feature extraction. Cons: more costly, more risky (overfitting if little data).
When to use which strategy?
| Situation | Strategy |
|---|---|
| Very little data (< 1000) | Feature extraction (full freeze) |
| Lots of data (> 10k) | Full fine-tuning |
| Domain very different from ImageNet | Fine-tuning, unfreeze later layers first |
| Embedded production | Fine-tune a MobileNet or EfficientNet |
Object detection
Classification vs detection
Classification answers "what's in this image?" with a single label.
Detection answers "which objects are present, where are they, and of what type?" with potentially many outputs per image.
A bounding box is defined by its position and size . The network must predict, for each object: class, box, and a confidence score.
Why a classifier alone isn't enough
A classification CNN progressively compresses the image through MaxPool, loses fine spatial information, and ends with a single global output. Good for recognising, bad for localising.
The YOLO idea
YOLO (You Only Look Once, Redmon et al., 2016) proposes a simple revolution:
Do all detection in a single forward pass of the network.
Instead of cropping and reclassifying, YOLO:
- processes the whole image at once;
- implicitly cuts it into a grid (e.g. 13×13);
- for each grid cell, directly predicts possible classes, several candidate boxes, and their confidence scores.
A single forward pass produces all detections. As a result, YOLO can run in real time (30+ FPS on a modern GPU).
YOLO format and Ultralytics training
A YOLO-format dataset:
dataset_yolo/
├── images/
│ ├── train/
│ ├── val/
├── labels/
│ ├── train/
│ ├── val/
└── data.yaml
Each image has a .txt file (same base name) in labels/, with format:
class_id x_centre y_centre width height
Coordinates normalised between 0 and 1.
Training with Ultralytics
from ultralytics import YOLO
model = YOLO("yolov8n.pt") # 'n' = nano
model.train(
data="data.yaml",
imgsz=640,
epochs=20,
batch=16,
)
The backbone is pre-trained on ImageNet — free transfer learning.
Detection metrics
| Metric | Description |
|---|---|
| mAP50 | Mean Average Precision with IoU threshold 0.5 |
| mAP50-95 | Average over IoU thresholds 0.5 to 0.95 (more demanding) |
| Precision / recall | Per class |
IoU (Intersection over Union) measures overlap between predicted and ground truth boxes. A detection is considered correct if IoU ≥ threshold.
Wrap-up
You now have the foundations for most modern computer vision problems:
- understanding how a CNN learns spatial representations;
- choosing between feature extraction and fine-tuning based on data quantity;
- distinguishing classification and detection;
- the reflex of "start from a pre-trained model" rather than reinventing.
To go further: segmentation (Mask R-CNN, U-Net), generation (Stable Diffusion), vision-language (CLIP, BLIP), vision transformers (ViT, DETR).