Deep Learning 1 — Linear neurons
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
First contact with neural networks, starting from the simplest case: a linear neuron. Don't expect deep learning here — we're building the foundations that will serve every later chapter.
Why this chapter?
Before CNNs and Transformers, we need to understand what a neuron is and how it learns. You'll see:
- what a linear neuron is and its link to linear regression;
- gradient descent coded by hand;
- why we need to normalise inputs;
- the introduction to PyTorch: tensors,
nn.Linear, optimisers,DataLoader; - the importance of non-linear activation functions.
The linear neuron
It's exactly the linear regression from chapter 2 ML, in another form:
with the inputs, the weights, the bias, the output.
Learning means finding the right and that minimise a loss function. For regression:
This is the MSE, already encountered.
Gradient descent
When the analytical solution doesn't exist or is too costly, we adjust parameters step by step in the direction that lowers :
where is the learning rate.
For linear regression, the calculation gives:
The term is the prediction error. The gradient propagates this error to via .
Three variants
- Batch: gradient computed on the full dataset at each step. Slow on large datasets.
- SGD (Stochastic): one example. Fast but noisy.
- Minibatch: a subset (32, 64, ...). The compromise used in practice.
Why normalise?
Without normalisation, large-scale variables dominate the gradient descent dynamics and force a tiny learning rate. Convergence becomes very slow.
For linear regression, the curvature of the loss in is proportional to . If has large values, curvature is large and must be small.
Solution: StandardScaler brings all variables to mean 0 and standard deviation 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
:::warning fit_transform vs transform
Fit only on train (fit_transform), then apply to test (transform). Otherwise: data leakage.
:::
PyTorch: autograd and optimisers
PyTorch automates two things we used to do by hand:
- gradient computation (
autograd); - parameter updates (optimisers: SGD, Adam, etc.).
PyTorch tensors are the equivalent of NumPy arrays, with the added ability to store a gradient and run on GPU.
import torch
import torch.nn as nn
# NumPy → tensor conversion
X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
.view(-1, 1) shapes y as (n, 1) to match the output of an nn.Linear(m, 1) layer.
Linear neuron in PyTorch
model = nn.Linear(m, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
The standard training loop
for _ in range(epochs):
optimizer.zero_grad() # 1. zero the gradients
y_hat = model(X_train_t) # 2. forward
loss = criterion(y_hat, y_train_t)
loss.backward() # 3. autograd computes all gradients
optimizer.step() # 4. update parameters
Memorise these 4 steps — they reappear in every later chapter.
DataLoader: automatic minibatching
To avoid manually selecting minibatches:
from torch.utils.data import TensorDataset, DataLoader
train_dataset = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
for _ in range(epochs):
for Xb, yb in train_loader:
optimizer.zero_grad()
y_hat = model(Xb)
loss = criterion(y_hat, yb)
loss.backward()
optimizer.step()
shuffle=True reshuffles data at each epoch — almost always desired in train.
Stacking and non-linearity
Stacking two nn.Linear layers doesn't make a more powerful model — algebraically, two linear transformations compose into a single linear transformation:
To go further, we must insert a non-linearity between layers.
Three classic activations
- Sigmoid: , output in . Saturates for large .
- Tanh: , output in , centred at 0.
- ReLU: . No saturation for . Default activation in modern deep learning.
model = nn.Sequential(
nn.Linear(m, 16),
nn.ReLU(), # non-linearity
nn.Linear(16, 1),
)
Depth only makes sense with non-linearity.