Deep Learning 1 — Linear neurons

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

First contact with neural networks, starting from the simplest case: a linear neuron. Don't expect deep learning here — we're building the foundations that will serve every later chapter.

Why this chapter?

Before CNNs and Transformers, we need to understand what a neuron is and how it learns. You'll see:

what a linear neuron is and its link to linear regression;
gradient descent coded by hand;
why we need to normalise inputs;
the introduction to PyTorch: tensors, nn.Linear, optimisers, DataLoader;
the importance of non-linear activation functions.

The linear neuron

It's exactly the linear regression from chapter 2 ML, in another form:

$u = X w + b$

with $X \in \mathbb{R}^{n \times m}$ the inputs, $w \in \mathbb{R}^m$ the weights, $b$ the bias, $u$ the output.

Learning means finding the right $w$ and $b$ that minimise a loss function. For regression:

$E(w, b) = \frac{1}{n}\sum_{i=1}^n (y_i - u_i)^2$

This is the MSE, already encountered.

Gradient descent

When the analytical solution doesn't exist or is too costly, we adjust parameters step by step in the direction that lowers $E$ :

$w \leftarrow w - \eta \, \frac{\partial E}{\partial w}, \quad b \leftarrow b - \eta \, \frac{\partial E}{\partial b}$

where $\eta$ is the learning rate.

For linear regression, the calculation gives:

$\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y), \quad \frac{\partial E}{\partial b} = \frac{1}{n} \sum_{i=1}^n (u_i - y_i)$

The term $(u - y)$ is the prediction error. The gradient propagates this error to $w$ via $X^T$ .

Three variants

Batch: gradient computed on the full dataset at each step. Slow on large datasets.
SGD (Stochastic): one example. Fast but noisy.
Minibatch: a subset (32, 64, ...). The compromise used in practice.

Why normalise?

Without normalisation, large-scale variables dominate the gradient descent dynamics and force a tiny learning rate. Convergence becomes very slow.

For linear regression, the curvature of the loss in $w$ is proportional to $\sum x_i^2$ . If $x$ has large values, curvature is large and $\eta$ must be small.

Solution: StandardScaler brings all variables to mean 0 and standard deviation 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

:::warning fit_transform vs transform Fit only on train (fit_transform), then apply to test (transform). Otherwise: data leakage. :::

PyTorch: autograd and optimisers

PyTorch automates two things we used to do by hand:

gradient computation (autograd);
parameter updates (optimisers: SGD, Adam, etc.).

PyTorch tensors are the equivalent of NumPy arrays, with the added ability to store a gradient and run on GPU.

import torch
import torch.nn as nn

# NumPy → tensor conversion
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)

.view(-1, 1) shapes y as (n, 1) to match the output of an nn.Linear(m, 1) layer.

Linear neuron in PyTorch

model = nn.Linear(m, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

The standard training loop

for _ in range(epochs):
    optimizer.zero_grad()         # 1. zero the gradients
    y_hat = model(X_train_t)      # 2. forward
    loss = criterion(y_hat, y_train_t)
    loss.backward()               # 3. autograd computes all gradients
    optimizer.step()              # 4. update parameters

Memorise these 4 steps — they reappear in every later chapter.

DataLoader: automatic minibatching

To avoid manually selecting minibatches:

from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader  = DataLoader(train_dataset, batch_size=32, shuffle=True)

for _ in range(epochs):
    for Xb, yb in train_loader:
        optimizer.zero_grad()
        y_hat = model(Xb)
        loss = criterion(y_hat, yb)
        loss.backward()
        optimizer.step()

shuffle=True reshuffles data at each epoch — almost always desired in train.

Stacking and non-linearity

Stacking two nn.Linear layers doesn't make a more powerful model — algebraically, two linear transformations compose into a single linear transformation:

$u = (X W_1 + b_1) W_2 + b_2 = X (W_1 W_2) + (b_1 W_2 + b_2)$

To go further, we must insert a non-linearity between layers.

Three classic activations

Sigmoid: $\sigma(z) = \dfrac{1}{1 + e^{-z}}$ , output in $(0, 1)$ . Saturates for large $|z|$ .
Tanh: $\tanh(z)$ , output in $(-1, 1)$ , centred at 0.
ReLU: $\max(0, z)$ . No saturation for $z > 0$ . Default activation in modern deep learning.

model = nn.Sequential(
    nn.Linear(m, 16),
    nn.ReLU(),           # non-linearity
    nn.Linear(16, 1),
)

Depth only makes sense with non-linearity.

Full notebook on Kaggle (forkable) →

Why this chapter?​

The linear neuron​

Gradient descent​

Three variants​

Why normalise?​

PyTorch: autograd and optimisers​

Linear neuron in PyTorch​

The standard training loop​

DataLoader: automatic minibatching​

Stacking and non-linearity​

Three classic activations​