Aller au contenu principal

Deep Learning 5 — Convolutional networks (3/3)

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

Today, hardly anyone trains CNNs from scratch on real data: we start from a pre-trained model. This last chapter covers transfer learning, then opens onto object detection with YOLO.

Why this chapter?

You'll see:

  • the model zoo: ResNet, VGG, MobileNet, EfficientNet;
  • foundation models as the modern evolution of the zoo;
  • the two transfer learning strategies: feature extraction and fine-tuning;
  • the move from classification to object detection;
  • the idea and use of YOLO.

The model zoo

Rather than building an architecture from scratch, we lean on a model zoo of pre-trained networks. In vision, the standard is torchvision.models:

FamilyVariantsCharacteristic
ResNet18, 50, 101, 152Residual connections, the workhorse
VGG11, 16, 19Very deep stacking, many parameters
DenseNet121, 161, 201Dense connections between layers
MobileNetV2, V3Optimised for mobile / edge
EfficientNetB0 to B7Optimised efficiency / accuracy ratio

Each was trained on ImageNet (~1.2M images, 1000 classes) — a huge compute cost we benefit from for free.

Foundation models

The ImageNet zoo is the ancestor of a broader idea: foundation models. Trained on massive datasets (often via self-supervision), they serve as starting points for dozens of tasks, with very little task-specific data.

  • Vision: CLIP, DINO, SAM
  • Text: GPT, Llama, Mistral
  • Multimodal: all the current bubbling

Transfer learning: the central idea

The early layers of a CNN learn general representations (edges, textures, simple patterns); the later layers learn task-specific decisions.

Consequence: if we have an ImageNet-trained model and want to classify medical X-rays or plants, we can reuse the early layers and only retrain the later ones.

Two strategies

1. Feature extraction (freezing)

We freeze the entire pre-trained model, replace only the last layer, and train just that layer.

from torchvision import models

model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Freeze everything
for param in model.parameters():
param.requires_grad = False

# Replace the head (1000 ImageNet classes → new n_classes)
model.fc = nn.Linear(model.fc.in_features, n_classes)

# Only model.fc parameters are trainable
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

Pros: very fast, low overfitting risk, works even with very little data.

2. Fine-tuning

We unfreeze everything (or just the later layers) and continue training with a smaller lr than usual, to avoid breaking the good learned representations.

for param in model.parameters():
param.requires_grad = True

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) # small lr!

Pros: better performance than feature extraction. Cons: more costly, more risky (overfitting if little data).

When to use which strategy?

SituationStrategy
Very little data (< 1000)Feature extraction (full freeze)
Lots of data (> 10k)Full fine-tuning
Domain very different from ImageNetFine-tuning, unfreeze later layers first
Embedded productionFine-tune a MobileNet or EfficientNet

Object detection

Classification vs detection

Classification answers "what's in this image?" with a single label.

Detection answers "which objects are present, where are they, and of what type?" with potentially many outputs per image.

A bounding box is defined by its position (x,y)(x, y) and size (w,h)(w, h). The network must predict, for each object: class, box, and a confidence score.

Why a classifier alone isn't enough

A classification CNN progressively compresses the image through MaxPool, loses fine spatial information, and ends with a single global output. Good for recognising, bad for localising.

The YOLO idea

YOLO (You Only Look Once, Redmon et al., 2016) proposes a simple revolution:

Do all detection in a single forward pass of the network.

Instead of cropping and reclassifying, YOLO:

  • processes the whole image at once;
  • implicitly cuts it into a grid (e.g. 13×13);
  • for each grid cell, directly predicts possible classes, several candidate boxes, and their confidence scores.

A single forward pass produces all detections. As a result, YOLO can run in real time (30+ FPS on a modern GPU).

YOLO format and Ultralytics training

A YOLO-format dataset:

dataset_yolo/
├── images/
│ ├── train/
│ ├── val/
├── labels/
│ ├── train/
│ ├── val/
└── data.yaml

Each image has a .txt file (same base name) in labels/, with format:

class_id x_centre y_centre width height

Coordinates normalised between 0 and 1.

Training with Ultralytics

from ultralytics import YOLO

model = YOLO("yolov8n.pt") # 'n' = nano

model.train(
data="data.yaml",
imgsz=640,
epochs=20,
batch=16,
)

The backbone is pre-trained on ImageNet — free transfer learning.

Detection metrics

MetricDescription
mAP50Mean Average Precision with IoU threshold 0.5
mAP50-95Average over IoU thresholds 0.5 to 0.95 (more demanding)
Precision / recallPer class

IoU (Intersection over Union) measures overlap between predicted and ground truth boxes. A detection is considered correct if IoU ≥ threshold.

Wrap-up

You now have the foundations for most modern computer vision problems:

  • understanding how a CNN learns spatial representations;
  • choosing between feature extraction and fine-tuning based on data quantity;
  • distinguishing classification and detection;
  • the reflex of "start from a pre-trained model" rather than reinventing.

To go further: segmentation (Mask R-CNN, U-Net), generation (Stable Diffusion), vision-language (CLIP, BLIP), vision transformers (ViT, DETR).


Full notebook on Kaggle (forkable) →