Aller au contenu principal

Deep Learning 1 — Linear neurons

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

First contact with neural networks, starting from the simplest case: a linear neuron. Don't expect deep learning here — we're building the foundations that will serve every later chapter.

Why this chapter?

Before CNNs and Transformers, we need to understand what a neuron is and how it learns. You'll see:

  • what a linear neuron is and its link to linear regression;
  • gradient descent coded by hand;
  • why we need to normalise inputs;
  • the introduction to PyTorch: tensors, nn.Linear, optimisers, DataLoader;
  • the importance of non-linear activation functions.

The linear neuron

It's exactly the linear regression from chapter 2 ML, in another form:

u=Xw+bu = X w + b

with XRn×mX \in \mathbb{R}^{n \times m} the inputs, wRmw \in \mathbb{R}^m the weights, bb the bias, uu the output.

Learning means finding the right ww and bb that minimise a loss function. For regression:

E(w,b)=1ni=1n(yiui)2E(w, b) = \frac{1}{n}\sum_{i=1}^n (y_i - u_i)^2

This is the MSE, already encountered.

Gradient descent

When the analytical solution doesn't exist or is too costly, we adjust parameters step by step in the direction that lowers EE:

wwηEw,bbηEbw \leftarrow w - \eta \, \frac{\partial E}{\partial w}, \quad b \leftarrow b - \eta \, \frac{\partial E}{\partial b}

where η\eta is the learning rate.

For linear regression, the calculation gives:

Ew=1nXT(uy),Eb=1ni=1n(uiyi)\frac{\partial E}{\partial w} = \frac{1}{n} X^T (u - y), \quad \frac{\partial E}{\partial b} = \frac{1}{n} \sum_{i=1}^n (u_i - y_i)

The term (uy)(u - y) is the prediction error. The gradient propagates this error to ww via XTX^T.

Three variants

  • Batch: gradient computed on the full dataset at each step. Slow on large datasets.
  • SGD (Stochastic): one example. Fast but noisy.
  • Minibatch: a subset (32, 64, ...). The compromise used in practice.

Why normalise?

Without normalisation, large-scale variables dominate the gradient descent dynamics and force a tiny learning rate. Convergence becomes very slow.

For linear regression, the curvature of the loss in ww is proportional to xi2\sum x_i^2. If xx has large values, curvature is large and η\eta must be small.

Solution: StandardScaler brings all variables to mean 0 and standard deviation 1.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

:::warning fit_transform vs transform Fit only on train (fit_transform), then apply to test (transform). Otherwise: data leakage. :::

PyTorch: autograd and optimisers

PyTorch automates two things we used to do by hand:

  • gradient computation (autograd);
  • parameter updates (optimisers: SGD, Adam, etc.).

PyTorch tensors are the equivalent of NumPy arrays, with the added ability to store a gradient and run on GPU.

import torch
import torch.nn as nn

# NumPy → tensor conversion
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)

.view(-1, 1) shapes y as (n, 1) to match the output of an nn.Linear(m, 1) layer.

Linear neuron in PyTorch

model = nn.Linear(m, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

The standard training loop

for _ in range(epochs):
optimizer.zero_grad() # 1. zero the gradients
y_hat = model(X_train_t) # 2. forward
loss = criterion(y_hat, y_train_t)
loss.backward() # 3. autograd computes all gradients
optimizer.step() # 4. update parameters

Memorise these 4 steps — they reappear in every later chapter.

DataLoader: automatic minibatching

To avoid manually selecting minibatches:

from torch.utils.data import TensorDataset, DataLoader

train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

for _ in range(epochs):
for Xb, yb in train_loader:
optimizer.zero_grad()
y_hat = model(Xb)
loss = criterion(y_hat, yb)
loss.backward()
optimizer.step()

shuffle=True reshuffles data at each epoch — almost always desired in train.

Stacking and non-linearity

Stacking two nn.Linear layers doesn't make a more powerful model — algebraically, two linear transformations compose into a single linear transformation:

u=(XW1+b1)W2+b2=X(W1W2)+(b1W2+b2)u = (X W_1 + b_1) W_2 + b_2 = X (W_1 W_2) + (b_1 W_2 + b_2)

To go further, we must insert a non-linearity between layers.

Three classic activations

  • Sigmoid: σ(z)=11+ez\sigma(z) = \dfrac{1}{1 + e^{-z}}, output in (0,1)(0, 1). Saturates for large z|z|.
  • Tanh: tanh(z)\tanh(z), output in (1,1)(-1, 1), centred at 0.
  • ReLU: max(0,z)\max(0, z). No saturation for z>0z > 0. Default activation in modern deep learning.
model = nn.Sequential(
nn.Linear(m, 16),
nn.ReLU(), # non-linearity
nn.Linear(16, 1),
)

Depth only makes sense with non-linearity.


Full notebook on Kaggle (forkable) →