Machine Learning 2 — Regression

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

The heart of the course starts here. We tackle regression: predicting a numerical value from other values. We first code linear regression by hand to understand the inner workings, then move to scikit-learn for speed.

Why this chapter?

Regression is the first major family of ML problems. You'll learn:

linear regression by least squares and gradient descent;
metrics for evaluating numerical predictions;
train/test split and cross-validation;
polynomial regression and Ridge regularisation;
k-nearest-neighbours and the importance of normalisation.

Linear regression

The simplest idea: find a line that passes as close as possible to a cloud of points.

$\hat{y} = a x + b$

We want the line to minimise the residuals $e_i = y_i - (a x_i + b)$ — the gap between observed and predicted value.

The least squares method

We choose $a$ and $b$ to minimise the sum of squared residuals:

$J(a, b) = \sum_{i=1}^{n} (y_i - a x_i - b)^2$

Why squared? Two reasons: it penalises large errors more strongly, and it yields a clean analytical solution.

The solution

Setting the partial derivatives to zero gives:

$a = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)}, \quad b = \bar{y} - a\,\bar{x}$

The line always passes through the mean point $(\bar{x}, \bar{y})$ .

Gradient descent

When there are many variables (or for more complex models you'll see later), the analytical solution doesn't exist anymore. We turn to gradient descent: progressively adjust the parameters in the direction that lowers the loss.

$w \leftarrow w - \eta \, \frac{\partial J}{\partial w}$

where $\eta$ is the learning rate. Too small, learning crawls; too large, the loss diverges. The right choice is often found by trial and error.

Three variants depending on how many examples are used per update:

Batch: all examples at each step. Accurate but slow on large datasets.
SGD (Stochastic): one example. Fast but noisy.
Minibatch: a subset (typically 32 or 64). The compromise used in practice.

Evaluation metrics

Once trained, how do we know if the model predicts well? Several metrics:

Metric	Formula	Interpretation
MAE	$\frac{1}{n}\sum	y_i - \hat{y}_i
MSE	$\frac{1}{n}\sum (y_i - \hat{y}_i)^2$	penalises large errors more
RMSE	$\sqrt{\mathrm{MSE}}$	like MSE but in units of $y$
MAPE	$\frac{1}{n}\sum	(y_i - \hat{y}_i)/y_i
R²	$1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$	share of variance explained

R² is 1 for a perfect model, 0 for a model that does no better than predicting the mean, and can be negative for a model worse than the mean.

:::warning MAE vs RMSE If MAE and RMSE diverge strongly, there are outliers. RMSE blows up due to the square; MAE handles them better. :::

Train/test and cross-validation

Evaluating a model on the data used to train it is like grading a student on the past papers they memorised. Always reserve part of the data for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

To avoid depending on a single split, k-fold cross-validation trains $k$ times on different splits and averages the scores:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(scores.mean(), scores.std())

Polynomial regression and regularisation

When the relationship $y(x)$ isn't linear, we add powers of $x$ as new variables:

$\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots$

The model stays linear in its coefficients but becomes curved. The higher the degree, the more the model can fit the data — including noise. This is overfitting.

Ridge: penalising large coefficients

To prevent coefficients from growing wild, we add a penalty term to the loss:

$J(w) = \sum_i (y_i - \hat{y}_i)^2 + \alpha \sum_j w_j^2$

This is Ridge regression. The parameter $\alpha$ controls the regularisation strength: larger $\alpha$ keeps coefficients smaller.

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

Close variants: Lasso (absolute value penalty, performs variable selection), ElasticNet (Ridge + Lasso combined).

k-nearest neighbours (k-NN)

The k-NN regressor is a very different method: no global formula, we look at the $k$ closest neighbours in the training set and average their target values.

$\hat{y}(x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i$

from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=5)

Choosing $k$ is a tradeoff: small $k$ → flexible model but noise-sensitive; large $k$ → smoother predictions but may miss details.

Normalisation: essential for k-NN

k-NN relies entirely on distances. If one variable has a 1000× larger scale than another, it dominates the distance computation.

Solution: put all variables on the same scale.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

:::warning Data leakage trap Fit the scaler only on train (fit_transform), then apply to test (transform). Otherwise you leak information from test into training. :::

scikit-learn Pipeline

To avoid mistakes and chain steps cleanly:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model',  KNeighborsRegressor(n_neighbors=5)),
])

pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)

The Pipeline ensures the scaler is fit only on train, even during cross-validation.

Full notebook on Kaggle (forkable) →

Why this chapter?​

Linear regression​

The least squares method​

The solution​

Gradient descent​

Evaluation metrics​

Train/test and cross-validation​

Polynomial regression and regularisation​

Ridge: penalising large coefficients​

k-nearest neighbours (k-NN)​

Normalisation: essential for k-NN​

scikit-learn Pipeline​