DL · Chapter 5

Deep Learning 5 — Convolutional Networks and Image Processing, Part 3

This final chapter on convolutional networks closes the cycle started in Chapter DL-3. We have seen how a CNN learns local features layer by layer (DL-3), how regularization techniques such as Batch Normalization and Dropout stabilize training (DL-4), and how to scale these architectures up to colour images of meaningful resolution. In this chapter we move beyond building a network from scratch and ask a more pragmatic question: how do practitioners actually solve real vision problems today?

The answer revolves around two ideas that have reshaped the field over the last decade. The first is transfer learning: instead of training a fresh network on every new dataset, we recycle the immense knowledge already crystallized in models pre-trained on giant generic corpora. The second is end-to-end object detection: we move from telling the network what is in an image to asking it where the objects are and what kind they are, all in a single forward pass. Both shifts have a common philosophical flavour — they trade brute-force training for clever reuse and integration.

The case study that runs through this chapter is the Intel Image Classification dataset, a collection of 224×224 RGB photographs labelled across six natural-scene categories (buildings, forest, glacier, mountain, sea, street). It is a typical "real-world" problem: the resolution is modest, the classes overlap visually, and a small CNN trained from scratch quickly hits a performance ceiling. We will use it first as a benchmark for transfer learning with ResNet, then transition to the broader question of object detection with YOLO.

A baseline CNN on Intel Image Classification

Before invoking the heavy artillery of pre-trained models, it is instructive to train a modest CNN from scratch and observe its limits. This baseline serves three purposes: it forces us to set up a clean data pipeline, it surfaces the practical issues of working at 224×224, and it provides a yardstick against which transfer learning gains can be measured.

Loading images with ImageFolder

PyTorch's torchvision.datasets.ImageFolder is a small marvel of convenience. Pointing it at a directory whose subfolders are named after classes is enough to obtain a fully indexed dataset — labels are inferred from folder names, and images are loaded lazily.

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
])

train_dataset = datasets.ImageFolder(
    root="/kaggle/input/intel-image-classification/seg_train/seg_train",
    transform=transform,
)
test_dataset = datasets.ImageFolder(
    root="/kaggle/input/intel-image-classification/seg_test/seg_test",
    transform=transform,
)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
                          num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_dataset,  batch_size=32, shuffle=False,
                          num_workers=2, pin_memory=True)

A few practical remarks deserve attention. The Resize((224, 224)) ensures every image arrives at the network with the same spatial extent — necessary because fully connected heads expect a fixed flattened size. ToTensor() does double duty: it transposes the image from HWC (NumPy) to CHW (PyTorch) and rescales pixel values from [0, 255] to [0, 1]. Finally, pin_memory=True and num_workers=2 are GPU-friendly defaults: the first pins host memory so that CUDA transfers are faster, and the second parallelizes batch preparation across two CPU processes.

Memory hygiene. Before launching a heavy training, it is good practice to free memory explicitly. Two lines settle the matter:
import gc, torch
gc.collect()
torch.cuda.empty_cache()

A small CNN with global average pooling

Our baseline replaces the gigantic flatten-then-dense pattern of LeNet-style networks with a much leaner head built on global average pooling. The intuition: at the end of the convolutional stack, each feature map already represents a learned semantic cue averaged over the spatial extent. Pooling it to a single number per channel collapses the tensor to a compact descriptor and dramatically reduces the parameter count of the classifier.

class SimpleCNN_BN_DO(nn.Module):
    def __init__(self, num_classes=6, dropout_p=0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.bn1   = nn.BatchNorm2d(16)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.bn2   = nn.BatchNorm2d(32)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.bn3   = nn.BatchNorm2d(64)
        self.relu  = nn.ReLU()
        self.pool  = nn.MaxPool2d(2)
        self.gap   = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(p=dropout_p)
        self.fc = nn.Linear(64, num_classes)

    def forward(self, x):
        x = self.pool(self.relu(self.bn1(self.conv1(x))))   # 224 -> 112
        x = self.relu(self.bn2(self.conv2(x)))              # 112
        x = self.pool(self.relu(self.bn3(self.conv3(x))))   # 112 -> 56
        x = self.gap(x)                                     # -> (B,64,1,1)
        x = torch.flatten(x, 1)                             # -> (B,64)
        x = self.dropout(x)
        return self.fc(x)

AdaptiveAvgPool2d((1, 1)) is the key trick here. Whatever the spatial size of the input feature map, it averages it down to a single 1×1 grid, yielding a tensor of shape (B, C, 1, 1). After flattening, the classifier only needs to learn a 64 -> num_classes mapping — far fewer parameters than the 64*56*56 -> num_classes matrix that a naive flatten would have demanded.

Trained four epochs with Adam and cross-entropy, this network reaches a respectable but unspectacular accuracy. The point of the exercise is not to win a Kaggle leaderboard, but to set the stage: we are about to see how a pre-trained giant can blow this baseline away with a fraction of the training budget.

Model zoos, foundation models, and transfer learning

The model zoo

A model zoo is a curated collection of neural networks, pre-trained by their authors on large public datasets and made available for direct download. Every major deep learning framework ships such a zoo; in PyTorch, it lives under torchvision.models.

Why a zoo matters. A pre-trained model bundles three things that took months of engineering and millions of GPU-hours to produce:

a validated architecture,

weights already optimized on a meaningful task,

benchmark performance that gives a clear baseline.

Typical members of the vision zoo include ResNet (in many depths: 18, 34, 50, 101, 152), VGG (a deep classic, now mostly historical), DenseNet, MobileNet (mobile-optimized), and EfficientNet (scaled along three axes simultaneously). Each comes with weights trained on ImageNet, a dataset of roughly 1.3 million images across 1000 classes — large enough that the network has internalized a genuinely general visual vocabulary.

The slogan to remember is simple: do not start from scratch.

From model zoos to foundation models

The zoo idea has been generalized in recent years into the concept of foundation models. A foundation model is trained on vast amounts of data, often through self-supervised or weakly supervised objectives, and is designed to be adapted to many downstream tasks. ImageNet-pretrained CNNs were an early, small-scale incarnation of this idea. Modern foundation models — CLIP, DINO, SAM in vision; LLaMA, GPT, Claude in language — push the same logic to far larger scales, with far stronger transfer abilities.

The practical takeaway is that the workflow is the same regardless of scale: train (or download) a general-purpose model, then adapt it cheaply to your specific task.

The principle of transfer learning

Transfer learning is the technique that operationalizes this insight. Its central observation comes from years of probing trained CNNs:

Early layers learn general representations (edges, textures, colour blobs), while later layers learn task-specific decisions (the difference between a cat and a dog).

If the early layers are general, why retrain them? We can simply reuse them and concentrate our training budget on the parts of the network that genuinely need to specialize.

Two classic strategies

Two strategies dominate practice. They sit on a spectrum from "minimal intervention" to "maximal adaptation".

Feature extraction. Freeze the entire pre-trained network and replace only the final classification layer. Train just that new head on your data. This approach is fast, frugal in data, and at low risk of overfitting — but it caps the achievable performance because the backbone cannot adapt to your domain.

Fine-tuning. Unfreeze some (or all) layers of the pre-trained network and let them adjust to the new task, usually with a lower learning rate to preserve their hard-won knowledge. Fine-tuning yields better performance and finer domain adaptation, at the cost of longer training and a higher risk of overfitting on small datasets.

A pragmatic rule of thumb: start with feature extraction to get a quick baseline, then progressively unfreeze the deepest blocks if you have enough data and compute.

Implementation with PyTorch

Loading a ResNet-18 from the zoo and adapting it to a 6-class problem is a four-line affair:

from torchvision import models
import torch.nn as nn

weights = models.ResNet18_Weights.DEFAULT
model = models.resnet18(weights=weights)
model.fc = nn.Linear(model.fc.in_features, num_classes=6)

Note the weights=... argument — modern PyTorch deprecates the older pretrained=True and prefers explicit weight enums, which carry information about the dataset, normalization statistics, and license.

To enter feature extraction mode, freeze every parameter and re-enable gradients only on the new head:

for param in model.parameters():
    param.requires_grad = False
for param in model.fc.parameters():
    param.requires_grad = True

Then optimize only the head:

optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)

To switch to partial fine-tuning, unfreeze the last residual block (layer4) along with the head:

for name, param in model.named_parameters():
    if name.startswith("layer4") or name.startswith("fc"):
        param.requires_grad = True

optimizer = torch.optim.Adam(
    [p for p in model.parameters() if p.requires_grad],
    lr=1e-4,        # smaller LR to preserve pre-trained features
)

Notice the deliberate drop from 1e-3 to 1e-4 for fine-tuning. Pre-trained weights are precious; pushing them with a large learning rate can erase years of accumulated knowledge in a few badly aimed gradient steps.

Normalization: respecting the model's diet

A subtle but consequential point: a pre-trained model expects inputs that look like the data it was trained on. ImageNet-trained networks expect RGB images normalized with the ImageNet statistics:

transforms.Normalize(mean=(0.485, 0.456, 0.406),
                     std=(0.229, 0.224, 0.225))

Forgetting this normalization is one of the most common transfer learning mistakes. The network will still produce outputs, but its internal representations will be operating outside their trained range and accuracy will silently suffer. When training from scratch, by contrast, any consistent normalization (or even none, if ToTensor() is enough) will work — the network will simply learn its own input scale.

Yesterday and today

The whole landscape of modern AI can be read as the same trick repeated at increasing scales:

Yesterday	Today
CNN on ImageNet	Foundation Models
Transfer learning	Adaptation / fine-tuning
Feature extraction	Prompting, adapters, LoRA

The fundamental principle is unchanged: learn a general representation, then adapt it efficiently.

Visualizing a pre-trained model

Curious about what's actually inside resnet18? Two complementary tools help. A simple print(model) lists every module in textual form. For a graphical view of the computational graph, torchviz traces a forward pass and renders it:

!pip install torchviz

from torchviz import make_dot
import torch

x = torch.randn(1, 3, 224, 224).to(device)
y = model(x)
make_dot(y, params=dict(model.named_parameters()))

The resulting graph reveals the elegant skip-connection structure that makes ResNets trainable at depth — convolutional blocks whose outputs are added back to their inputs, allowing gradients to flow unimpeded through dozens of layers.

Object detection: from "what" to "where + what"

So far, every model we have built answers a single question: what is in this image? This is classification. But many real applications need a richer answer: an autonomous car needs to know not only that there is a pedestrian in view, but where the pedestrian is and how many of them there are. This is object detection, and it is a qualitatively harder problem.

Classification vs. detection

In classification, the output is a single label per image:

image -> [class]

In detection, the output is a list of pairs, each containing a class and a bounding box:

image -> [(class, box), (class, box), ...]

A bounding box is conventionally encoded as a position (x, y) and a size (width, height), sometimes with the position referring to the centre and sometimes to the top-left corner.

Why a classification CNN is not enough

A CNN trained for classification has been engineered to throw away spatial information. Each pooling layer halves the resolution; each stride does the same; the global pooling that we praised earlier flattens whatever spatial structure remains. By the time the classifier fires, the network has compressed the entire image into a single description.

This makes a CNN excellent at recognizing but bad at localizing. Asking a vanilla ResNet "where is the dog?" is asking the wrong question to the wrong tool.

The naive approach: sliding windows

A historically natural idea is to convert a classifier into a detector by scanning the image with a sliding window: extract a small region, resize it to the network's expected input size, classify it, slide the window by a few pixels, and repeat. Doing this at multiple scales and aspect ratios eventually finds objects.

The arithmetic is unforgiving. A 600×600 image with a 50-pixel stride and four scales requires thousands of forward passes per image. Real-time video, which demands tens of frames per second, is utterly out of reach. Worse, the approach scales badly: doubling the resolution quadruples the cost.

Sliding windows are conceptually simple, but practically unusable for any modern application.

Why "transposing" a CNN is hard

A more sophisticated thought is to repurpose the convolutional stack itself for dense prediction. After all, a convolutional layer is already applied at every spatial position. The challenge is that detection demands several things simultaneously:

a dense prediction at each spatial location,
across multiple scales to handle objects of different sizes,
with multiple candidate boxes per location to handle overlaps,
and a way to suppress redundant detections.

A naive attempt to attach detection heads to a classifier produces messy, slow, and unstable systems. Two-stage detectors like Faster R-CNN solve these issues with a region proposal network, but they remain relatively heavy. The breakthrough that brought detection to real-time speeds came from a different direction.

The key idea of YOLO

YOLO stands for You Only Look Once, and the name is the algorithm in a slogan:

Perform detection through the network in a single forward pass.

Instead of cropping the image and classifying each region, YOLO processes the image in one go, divides it implicitly into a regular grid, and predicts directly, for each grid cell:

the classes of any object whose centre falls in that cell,
the bounding boxes of those objects,
a confidence score for each prediction.

The conceptual diagram is delightfully simple:

image -> CNN -> tensor of predictions (classes + boxes)

Why YOLO is fast

YOLO's speed comes from its uncompromising design:

a single forward pass, regardless of how many objects are in the image,
no sliding window, no region proposals,
the whole pipeline is end-to-end differentiable and trained jointly.

The paradigm shift is profound. We no longer "move the CNN over the image"; we project the detection task into the network's output space. The network's output tensor is the detection — there is no post-hoc classification step.

Training YOLO on a Kaggle dataset

The Ultralytics implementation of YOLO (currently at version 8) has made training a custom detector accessible to anyone with a few hundred annotated images. We sketch the workflow on a Kaggle dataset prepared in YOLO format.

The YOLO data format

A YOLO dataset has a fixed directory structure:

dataset_yolo/
|-- images/
|   |-- train/
|   |-- val/
|   `-- test/        (optional)
|-- labels/
|   |-- train/
|   |-- val/
|   `-- test/        (optional)
`-- data.yaml

Each image has a same-named .txt label file in labels/. Each line of the label file describes one object:

class_id  x_center  y_center  width  height

with all coordinates normalized between 0 and 1 relative to the image size, and the origin at the top-left corner. The data.yaml file lists the class names and points to the train/val/test directories.

Annotating with Roboflow

In practice, hand-writing label files is tedious and error-prone. Roboflow has become the de facto annotation platform: it lets you upload images, draw boxes (or masks, or keypoints) in a browser, manage train/val splits, and export directly in YOLO format. The majority of "turnkey" Kaggle datasets in YOLO format originate from Roboflow exports.

The typical workflow is:

import images,
annotate (or import existing annotations),
choose the YOLOv8 export,
download the zip or push it to Kaggle.

Installing and training

Ultralytics provides a clean Python API and CLI:

!pip install -q ultralytics

from ultralytics import YOLO
import torch

print("CUDA available:", torch.cuda.is_available())

model = YOLO("yolov8n.pt")   # 'n' = nano, smallest model

model.train(
    data="/kaggle/input/my-dataset-yolo/data.yaml",
    imgsz=640,
    epochs=20,
    batch=16,
    workers=2,
)

yolov8n.pt is the nano variant — small enough to fit comfortably in a Kaggle GPU's memory and fast enough to train in minutes on a small dataset. Larger variants (s, m, l, x) trade speed for accuracy.

Evaluation and prediction

Validation reports the standard detection metrics:

metrics = model.val(
    data="/kaggle/input/my-dataset-yolo/data.yaml",
    imgsz=640,
)
print(metrics)

The numbers to know are:

mAP50: mean Average Precision at IoU threshold 0.5 — a relatively forgiving metric.
mAP50-95: mean over thresholds from 0.5 to 0.95 — much stricter, the modern standard.
precision and recall for each class.

Running predictions on new images writes annotated outputs to runs/detect/predict/:

model.predict(
    source="/kaggle/input/my-dataset-yolo/images/val",
    imgsz=640,
    conf=0.25,
    save=True,
)

Practical advice on Kaggle

A few rules of thumb worth memorizing:

start with yolov8n.pt, imgsz=640, batch=8 or 16,
if you hit a CUDA out-of-memory error, halve batch first, then drop imgsz to 512,
YOLO is dramatically faster than any sliding-window approach — you can train and validate on a meaningful dataset in a single Kaggle session.

Closing thoughts

This chapter closes the convolutional networks cycle on a forward-looking note. We started with a small CNN built from scratch on a six-class scene classification task, achieved a baseline, and then watched a pre-trained ResNet-18 surpass it with minimal effort thanks to transfer learning. We then changed registers entirely, moving from "what is in this image?" to "where are the objects in this image?", and discovered that the right answer to the second question is not a clever extension of the first but a fundamentally different architecture — one that predicts the entire detection in a single pass.

The guiding lesson is one of leverage. The deep learning practitioner of 2026 rarely starts from a blank slate: they stand on the shoulders of pre-trained backbones, foundation models, and turn-key frameworks like Ultralytics. The skill is not in re-implementing every layer but in choosing the right tool, adapting it carefully to one's data, and respecting the small but consequential details — normalization statistics, learning rates, batch sizes — that separate a successful transfer from a silent failure.

Exercises

Exercise 1 - Baseline CNN on Intel Image Classification. Set up the ImageFolder pipeline on the Intel dataset (224x224 RGB, batch size 32). Train the SimpleCNN_BN_DO baseline for four epochs with Adam (lr=1e-3). Report the test accuracy and the confusion matrix. What classes does the model confuse the most, and does that match your intuition about the dataset?

Exercise 2 - ResNet from scratch. Replace the baseline with models.resnet18(weights=None) (random initialization) and adapt the fc head to six classes. Train for four epochs with the same optimizer. Compare the test accuracy to the baseline. Is the gain - or the loss - what you expected?

Exercise 3 - Transfer learning, feature extraction. Load resnet18 with ResNet18_Weights.DEFAULT, freeze the backbone, and train only the new fc head for three epochs. Use the ImageNet normalization statistics in your transforms. Report the accuracy and contrast it with the previous two experiments. How much wall-clock time did you save?

Exercise 4 - Light fine-tuning. Starting from the model trained in Exercise 3, unfreeze layer4 along with fc and continue training for two epochs with lr=1e-4. Why is a smaller learning rate appropriate here? Report the final accuracy.

Exercise 5 - Visualization. Use print(model) and torchviz.make_dot to inspect the architecture of resnet18. Identify a residual connection in the textual print and locate it in the graphical representation.

Exercise 6 - Object detection. Choose a small Roboflow-exported YOLO dataset on Kaggle (for example, a face-mask or pothole detection dataset). Install ultralytics, train yolov8n.pt for 20 epochs, evaluate mAP50 and mAP50-95, and run predictions on a handful of validation images. Open the saved annotated images in runs/detect/predict/ and qualitatively assess the results.

Exercise 7 - Reflective question. A colleague proposes to "convert a classifier into a detector by sliding a window". Estimate the number of forward passes their approach would require on a 1280x720 image with a 32-pixel stride and three scales, and compare it with the single forward pass of YOLO. What does this tell you about the value of architectural innovation, beyond raw computational power?

Going further

Original ResNet paper. He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition, CVPR 2016. The foundational reference for residual connections.
Original YOLO paper. Redmon, Divvala, Girshick, Farhadi, You Only Look Once: Unified, Real-Time Object Detection, CVPR 2016. The original idea, still readable and inspiring.
Ultralytics documentation. https://docs.ultralytics.com - the canonical reference for YOLOv8 training, evaluation, and deployment.
PyTorch transfer learning tutorial. https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html - a hands-on counterpart to this chapter.
torchvision model zoo. https://pytorch.org/vision/stable/models.html - the full catalogue of available architectures and weights.
Roboflow Universe. https://universe.roboflow.com - a public repository of YOLO-formatted datasets, ideal for prototyping.
Foundation models survey. Bommasani et al., On the Opportunities and Risks of Foundation Models, Stanford CRFM, 2021. A long but illuminating perspective on where transfer learning is heading.
CLIP and DINO. Radford et al. (CLIP, 2021) and Caron et al. (DINO, 2021) are good entry points to modern vision foundation models, where the boundary between classification, detection, and representation learning starts to dissolve.