Aller au contenu principal

Machine Learning 2 — Regression

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

The heart of the course starts here. We tackle regression: predicting a numerical value from other values. We first code linear regression by hand to understand the inner workings, then move to scikit-learn for speed.

Why this chapter?

Regression is the first major family of ML problems. You'll learn:

  • linear regression by least squares and gradient descent;
  • metrics for evaluating numerical predictions;
  • train/test split and cross-validation;
  • polynomial regression and Ridge regularisation;
  • k-nearest-neighbours and the importance of normalisation.

Linear regression

The simplest idea: find a line that passes as close as possible to a cloud of points.

y^=ax+b\hat{y} = a x + b

We want the line to minimise the residuals ei=yi(axi+b)e_i = y_i - (a x_i + b) — the gap between observed and predicted value.

The least squares method

We choose aa and bb to minimise the sum of squared residuals:

J(a,b)=i=1n(yiaxib)2J(a, b) = \sum_{i=1}^{n} (y_i - a x_i - b)^2

Why squared? Two reasons: it penalises large errors more strongly, and it yields a clean analytical solution.

The solution

Setting the partial derivatives to zero gives:

a=Cov(x,y)Var(x),b=yˉaxˉa = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)}, \quad b = \bar{y} - a\,\bar{x}

The line always passes through the mean point (xˉ,yˉ)(\bar{x}, \bar{y}).

Gradient descent

When there are many variables (or for more complex models you'll see later), the analytical solution doesn't exist anymore. We turn to gradient descent: progressively adjust the parameters in the direction that lowers the loss.

wwηJww \leftarrow w - \eta \, \frac{\partial J}{\partial w}

where η\eta is the learning rate. Too small, learning crawls; too large, the loss diverges. The right choice is often found by trial and error.

Three variants depending on how many examples are used per update:

  • Batch: all examples at each step. Accurate but slow on large datasets.
  • SGD (Stochastic): one example. Fast but noisy.
  • Minibatch: a subset (typically 32 or 64). The compromise used in practice.

Evaluation metrics

Once trained, how do we know if the model predicts well? Several metrics:

MetricFormulaInterpretation
MAE$\frac{1}{n}\sumy_i - \hat{y}_i
MSE1n(yiy^i)2\frac{1}{n}\sum (y_i - \hat{y}_i)^2penalises large errors more
RMSEMSE\sqrt{\mathrm{MSE}}like MSE but in units of yy
MAPE$\frac{1}{n}\sum(y_i - \hat{y}_i)/y_i
1(yiy^i)2(yiyˉ)21 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}share of variance explained

is 1 for a perfect model, 0 for a model that does no better than predicting the mean, and can be negative for a model worse than the mean.

:::warning MAE vs RMSE If MAE and RMSE diverge strongly, there are outliers. RMSE blows up due to the square; MAE handles them better. :::

Train/test and cross-validation

Evaluating a model on the data used to train it is like grading a student on the past papers they memorised. Always reserve part of the data for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

To avoid depending on a single split, k-fold cross-validation trains kk times on different splits and averages the scores:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(scores.mean(), scores.std())

Polynomial regression and regularisation

When the relationship y(x)y(x) isn't linear, we add powers of xx as new variables:

y^=β0+β1x+β2x2+β3x3+\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \dots

The model stays linear in its coefficients but becomes curved. The higher the degree, the more the model can fit the data — including noise. This is overfitting.

Ridge: penalising large coefficients

To prevent coefficients from growing wild, we add a penalty term to the loss:

J(w)=i(yiy^i)2+αjwj2J(w) = \sum_i (y_i - \hat{y}_i)^2 + \alpha \sum_j w_j^2

This is Ridge regression. The parameter α\alpha controls the regularisation strength: larger α\alpha keeps coefficients smaller.

from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)

Close variants: Lasso (absolute value penalty, performs variable selection), ElasticNet (Ridge + Lasso combined).

k-nearest neighbours (k-NN)

The k-NN regressor is a very different method: no global formula, we look at the kk closest neighbours in the training set and average their target values.

y^(x)=1kiNk(x)yi\hat{y}(x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i

from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=5)

Choosing kk is a tradeoff: small kk → flexible model but noise-sensitive; large kk → smoother predictions but may miss details.

Normalisation: essential for k-NN

k-NN relies entirely on distances. If one variable has a 1000× larger scale than another, it dominates the distance computation.

Solution: put all variables on the same scale.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

:::warning Data leakage trap Fit the scaler only on train (fit_transform), then apply to test (transform). Otherwise you leak information from test into training. :::

scikit-learn Pipeline

To avoid mistakes and chain steps cleanly:

from sklearn.pipeline import Pipeline

pipe = Pipeline([
('scaler', StandardScaler()),
('model', KNeighborsRegressor(n_neighbors=5)),
])

pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)

The Pipeline ensures the scaler is fit only on train, even during cross-validation.


Full notebook on Kaggle (forkable) →