Machine Learning 2 — Regression
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
The heart of the course starts here. We tackle regression: predicting a numerical value from other values. We first code linear regression by hand to understand the inner workings, then move to scikit-learn for speed.
Why this chapter?
Regression is the first major family of ML problems. You'll learn:
- linear regression by least squares and gradient descent;
- metrics for evaluating numerical predictions;
- train/test split and cross-validation;
- polynomial regression and Ridge regularisation;
- k-nearest-neighbours and the importance of normalisation.
Linear regression
The simplest idea: find a line that passes as close as possible to a cloud of points.
We want the line to minimise the residuals — the gap between observed and predicted value.
The least squares method
We choose and to minimise the sum of squared residuals:
Why squared? Two reasons: it penalises large errors more strongly, and it yields a clean analytical solution.
The solution
Setting the partial derivatives to zero gives:
The line always passes through the mean point .
Gradient descent
When there are many variables (or for more complex models you'll see later), the analytical solution doesn't exist anymore. We turn to gradient descent: progressively adjust the parameters in the direction that lowers the loss.
where is the learning rate. Too small, learning crawls; too large, the loss diverges. The right choice is often found by trial and error.
Three variants depending on how many examples are used per update:
- Batch: all examples at each step. Accurate but slow on large datasets.
- SGD (Stochastic): one example. Fast but noisy.
- Minibatch: a subset (typically 32 or 64). The compromise used in practice.
Evaluation metrics
Once trained, how do we know if the model predicts well? Several metrics:
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | $\frac{1}{n}\sum | y_i - \hat{y}_i |
| MSE | penalises large errors more | |
| RMSE | like MSE but in units of | |
| MAPE | $\frac{1}{n}\sum | (y_i - \hat{y}_i)/y_i |
| R² | share of variance explained |
R² is 1 for a perfect model, 0 for a model that does no better than predicting the mean, and can be negative for a model worse than the mean.
:::warning MAE vs RMSE If MAE and RMSE diverge strongly, there are outliers. RMSE blows up due to the square; MAE handles them better. :::
Train/test and cross-validation
Evaluating a model on the data used to train it is like grading a student on the past papers they memorised. Always reserve part of the data for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
To avoid depending on a single split, k-fold cross-validation trains times on different splits and averages the scores:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(scores.mean(), scores.std())
Polynomial regression and regularisation
When the relationship isn't linear, we add powers of as new variables:
The model stays linear in its coefficients but becomes curved. The higher the degree, the more the model can fit the data — including noise. This is overfitting.
Ridge: penalising large coefficients
To prevent coefficients from growing wild, we add a penalty term to the loss:
This is Ridge regression. The parameter controls the regularisation strength: larger keeps coefficients smaller.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
Close variants: Lasso (absolute value penalty, performs variable selection), ElasticNet (Ridge + Lasso combined).
k-nearest neighbours (k-NN)
The k-NN regressor is a very different method: no global formula, we look at the closest neighbours in the training set and average their target values.
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=5)
Choosing is a tradeoff: small → flexible model but noise-sensitive; large → smoother predictions but may miss details.
Normalisation: essential for k-NN
k-NN relies entirely on distances. If one variable has a 1000× larger scale than another, it dominates the distance computation.
Solution: put all variables on the same scale.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
:::warning Data leakage trap
Fit the scaler only on train (fit_transform), then apply to test (transform). Otherwise you leak information from test into training.
:::
scikit-learn Pipeline
To avoid mistakes and chain steps cleanly:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor(n_neighbors=5)),
])
pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)
The Pipeline ensures the scaler is fit only on train, even during cross-validation.