ML · Chapter 2

Machine Learning 2 — Regression

After tackling classification in the first chapter, we now turn to regression, the second great family of supervised learning problems. Where classification asks which class does this observation belong to?, regression asks which numerical value should we predict?. The target variable is no longer a label drawn from a finite set, but a continuous quantity: the price of a house, the age of a mollusc, the consumption of a vehicle, the temperature tomorrow afternoon. The change of nature in the target has profound consequences on how models are formulated, how they are trained, and how their quality is measured.

This chapter starts from the most elementary regression model — the least-squares straight line — and builds upwards. We derive the closed-form coefficients of simple linear regression by hand, then generalise to several explanatory variables with scikit-learn. We introduce the metrics that quantify prediction quality (MAE, MSE, RMSE, MAPE, R²), the train/test discipline that protects us from optimistic illusions, and the cross-validation procedure that stabilises performance estimates. We then meet two non-linear models — polynomial regression and the k-nearest-neighbours regressor — and discover that some algorithms are extremely sensitive to feature scale. The chapter closes on Pipeline, the abstraction that ties pre-processing and modelling together into a single, leak-proof object.

The running case studies are deliberately small and visual. The co2_mini dataset relates fuel consumption to CO₂ emissions; abalone_mini describes physical measurements of abalones whose age is encoded by the number of rings on the shell; house_mini collects sale prices of houses in King County. These datasets are large enough to be interesting and small enough to be plotted entirely on screen, which is exactly what we want when learning a method.

Least-squares linear regression

The simplest possible model

We are given paired observations $(x_i, y_i)$ for $i = 1, \dots, n$ and we look for a straight line

\hat{y} = a x + b

that approximates the data as well as possible. Two questions immediately follow: what does as well as possible mean, and how do we compute the coefficients $a$ and $b$ that achieve it?

The standard answer is the least-squares criterion. For each observation, we measure the residual — the vertical signed distance between the actual value and the prediction:

e_i = y_i - (a x_i + b),

and we choose $a, b$ to minimise the sum of the squared residuals:

\min_{a,b} \; J(a,b) = \sum_{i=1}^{n} \bigl(y_i - a x_i - b\bigr)^2.

Why squares? Squaring makes errors positive, penalises large gaps more strongly than small ones, and yields a smooth differentiable cost function. The minimisation problem has a unique closed-form solution.

Deriving the coefficients

The cost $J(a,b)$ is a quadratic, convex function of two unknowns. Its global minimum is reached where both partial derivatives vanish.

Differentiating with respect to $b$ gives

\frac{\partial J}{\partial b} = -2 \sum_{i=1}^{n} \bigl(y_i - a x_i - b\bigr) = 0,

which, after dividing by $n$ , yields the elegant identity

\bar{y} = a \bar{x} + b \quad\Longrightarrow\quad b = \bar{y} - a \bar{x}.

In other words, the regression line always passes through the centroid $(\bar{x}, \bar{y})$ of the cloud. Differentiating with respect to $a$ and substituting the expression for $b$ produces the slope:

a = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} = \frac{\mathrm{Cov}(x, y)}{\mathrm{Var}(x)}.

Statistical interpretation. The numerator measures how $x$ and $y$ vary together; the denominator measures the dispersion of $x$ . The slope is large when $x$ explains the variations of $y$ well.

Implementing it by hand

Implementing the formulas in Python is a one-page affair. The class below mirrors the scikit-learn API — a constructor, a fit method, a predict method — but uses only NumPy.

import numpy as np

class MyLinearRegression:
    def __init__(self):
        self.a = None
        self.b = None

    def fit(self, x, y):
        x, y = np.asarray(x), np.asarray(y)
        x_mean, y_mean = x.mean(), y.mean()
        self.a = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum()
        self.b = y_mean - self.a * x_mean
        return self

    def predict(self, x):
        return self.a * np.asarray(x) + self.b

Tested on co2_mini with x = consumption and y = emissions, the class produces a slope and an intercept identical to those of sklearn.linear_model.LinearRegression. The exercise of writing it once is well worth the time: it demystifies what fit and predict actually do.

Plotting a line on a cloud with Plotly

Plotly Express creates a figure containing one or more traces. To overlay the fitted line on the original cloud, we add a second trace to the existing figure rather than building a new one:

fig = px.scatter(x=x, y=y)
fig.add_scatter(x=x, y=y_hat, name="Predicted values")
fig.show()

The figure keeps the same axes and scale, which is exactly what we need for a faithful visual comparison.

Residuals: the model's after-image

Once a model is fitted, plotting its residuals $r_i = y_i - \hat{y}_i$ as a function of $x$ is one of the most informative diagnostics available.

Reading a residual plot. A well-specified linear model produces residuals that scatter randomly around zero, with no visible trend and roughly constant dispersion. Any visible curve, funnel, or block of same-sign residuals is a clue that the linear assumption is too narrow, that outliers distort the fit, or that the variance is not constant.

A positive residual signals that the model underestimates the truth; a negative one, that it overestimates. The sum of residuals over the training set is mechanically zero — a direct consequence of the first normal equation.

Metrics: how wrong are we?

Once predictions $\hat{y}$ are available, we want a single number that summarises how wrong the model is. Several metrics exist, each emphasising a different aspect of the error.

MAE, MSE, RMSE, MAPE

The Mean Absolute Error averages the absolute residuals:

\mathrm{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|.

It is expressed in the same units as $y$ and is comparatively robust to outliers. The Mean Squared Error

\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2

penalises large errors quadratically; its square root, the Root Mean Squared Error

\mathrm{RMSE} = \sqrt{\mathrm{MSE}},

restores the natural units while keeping the strong sensitivity to outliers. Finally, the Mean Absolute Percentage Error

\mathrm{MAPE} = \frac{1}{n} \sum_{i=1}^n \left| \frac{y_i - \hat{y}_i}{y_i} \right|

reports a relative error, which is convenient when targets span several orders of magnitude — but be wary when some $y_i$ are close to zero.

MAE versus RMSE. When errors are compact, the two metrics agree. When a few observations carry very large errors, RMSE jumps while MAE stays moderate: the gap between them is itself a diagnostic.

The R² score

The coefficient of determination compares the model to a trivial baseline that always predicts the mean $\bar{y}$ :

R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}.

A perfect fit gives $R^2 = 1$ . Predicting the mean yields $R^2 = 0$ . A model worse than the mean produces a negative $R^2$ — perfectly possible on a held-out test set, even if surprising at first.

Warning. A high $R^2$ is no guarantee of a good predictive model. It does not detect overfitting and ignores the business context. Always read $R^2$ alongside MAE/RMSE and a residual plot.

Computing metrics in practice

The four functions below are imported from sklearn.metrics:

from sklearn.metrics import (
    mean_absolute_error, mean_squared_error,
    mean_absolute_percentage_error, r2_score,
)

mae  = mean_absolute_error(y, y_hat)
mse  = mean_squared_error(y, y_hat)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y, y_hat)
r2   = r2_score(y, y_hat)

Reporting the four together gives a balanced view: an absolute floor (MAE), a squared-error figure (RMSE), a relative figure (MAPE), and a normalised gain over the mean (R²).

Multiple linear regression with `scikit-learn`

Two variables and a plane

Real problems rarely involve a single explanatory variable. With two predictors $x_1$ and $x_2$ the model becomes

\hat{y} = a_1 x_1 + a_2 x_2 + b,

and in three dimensions $(x_1, x_2, \hat{y})$ traces a plane. Training consists in choosing $a_1, a_2, b$ to minimise the same kind of squared error as before. The interpretation of each coefficient gains an important nuance:

$a_1$ measures the effect of $x_1$ for equal $x_2$ , and conversely. Multiple regression separates the contributions of several factors held simultaneously into the model.

The `abalone_mini` dataset

We illustrate this with abalones — sea molluscs whose age is approximated by the number of growth rings. The target Rings is to be predicted from Length, Diameter, Height and Weight. A 3D scatter of Length, Weight and Rings already suggests a positive but noisy relationship.

Calling `LinearRegression`

The scikit-learn workflow always follows the same three-step pattern:

from sklearn.linear_model import LinearRegression

X = df[['Length', 'Weight']]   # 2D feature matrix
y = df['Rings']                # 1D target

model = LinearRegression()
model.fit(X, y)
y_hat = model.predict(X)

print(model.coef_, model.intercept_)

Shape rule. X must always be two-dimensional, even with a single feature (use double brackets: df[['Length']]). y must be one-dimensional.

After fit, the attribute model.coef_ holds the vector $(\beta_1, \beta_2, \dots, \beta_p)$ and model.intercept_ holds $\beta_0$ . Predicting on the very same data we trained on, as in the snippet above, is purely illustrative — Section Train/test explains why a held-out evaluation is mandatory.

General case with $n$ variables

With $n$ features the model reads

\hat{y} = b + \sum_{j=1}^{n} a_j x_j,

or in matrix form $\hat{y} = X a + b$ . Stacking a column of ones into $X$ to absorb the intercept, the least-squares solution admits the closed form

a = (X^\top X)^{-1} X^\top y.

This normal equation generalises the one-variable formula and works perfectly for small problems. For very large datasets, or when $X^\top X$ is poorly conditioned, sklearn switches to numerical solvers behind the scenes — the user-facing API does not change.

Train/test split: the discipline of generalisation

Why two sets are needed

Imagine an exam where the questions are exactly the ones repeated in class. A perfect mark would measure memorisation, not understanding. Evaluating a model on the data used to train it produces the same illusion: the score reported is optimistic and misleading.

Generalisation. A model is judged by its behaviour on data it has never seen during training, because that is what we will use it for in production.

`train_test_split`

The function sklearn.model_selection.train_test_split shuffles the rows, then partitions them into two disjoint sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
)

model = LinearRegression()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

test_size=0.2 reserves 20 % of the rows for evaluation; random_state=42 makes the split reproducible. The four returned arrays must be used in the obvious way: X_train, y_train for fit; X_test, y_test for the metric calls.

Gotcha. Comparing the metrics on X_train with those on X_test is the simplest detector of overfitting. A large gap (good on train, poor on test) is the textbook symptom.

Cross-validation

A single split is not the end of the story. If you change random_state, the metrics typically wobble — sometimes substantially. The evaluation depends strongly on the random draw, and a single partition can be either too easy or too hard.

$k$ -fold cross-validation addresses this by repeating the experiment $k$ times. The data is divided into $k$ folds of comparable size; for each iteration, one fold serves as test set and the other $k-1$ folds as training set. We obtain $k$ scores that are summarised by their mean (overall performance) and standard deviation (stability).

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring="r2")
print(scores.mean(), scores.std())

Negative scores? By convention cross_val_score always maximises. Error metrics — which we want to minimise — are therefore returned in negative form: "neg_mean_absolute_error", "neg_root_mean_squared_error", etc. To recover the actual error magnitude, simply flip the sign: mae = -scores.

A low standard deviation across folds indicates a stable model whose performance does not depend on a lucky draw. A large standard deviation is a warning that the chosen estimator might be overly sensitive to which observations end up in the training fold.

Polynomial regression: linearity in disguise

Simple linear regression assumes a relationship of the form $\hat{y} = ax + b$ . Many physical relationships are not strictly linear — think of the link between Length and Weight of an abalone, where weight grows faster than length (closer to a cubic law). A simple, powerful trick is to augment the feature space with powers of the original variable:

\hat{y} = \beta_0 + \beta_1 x + \beta_2 x^2.

This is polynomial regression, but observe carefully: the model remains linear in the coefficients $\beta_0, \beta_1, \beta_2$ . Everything we already know about least-squares estimation, normal equations, and LinearRegression therefore applies unchanged.

In practice, we add the new column manually or with PolynomialFeatures:

df['Length2'] = df['Length'] ** 2
X = df[['Length', 'Length2']]
y = df['Weight']

model = LinearRegression()
model.fit(X, y)
y_hat = model.predict(X)

Visualising the result on a 2D plot (Length vs. Weight) usually reveals a clearly curved fit, much closer to the cloud than the straight line of degree 1. A complementary 3D plot, with Length on $x$ , Length2 on $y$ and Weight on $z$ , shows that the curve is in fact a plane in the augmented feature space — exactly what we expect from a model that is linear in its augmented features.

Degree as a hyperparameter. Higher polynomial degrees fit training data more tightly but quickly start to oscillate between points. The right degree is found by cross-validation — never by eye on the training set.

$k$ -Nearest Neighbours regression

A non-parametric, local view

The k-Nearest Neighbours regressor (k-NN) takes an entirely different stance. It does not try to learn a global formula relating features to target. To predict the value at a new point $x$ , it simply looks for the $k$ training observations closest to $x$ and returns the average of their target values:

\hat{y}(x) = \frac{1}{k} \sum_{i \in \mathcal{N}_k(x)} y_i,

where $\mathcal{N}_k(x)$ denotes the set of the $k$ nearest neighbours of $x$ .

Non-parametric and local. k-NN stores the training data and consults it at prediction time. There is no fixed set of weights to learn. The model adapts locally to the structure of each region.

The role of distance

The notion of neighbour depends entirely on a chosen distance. By default, sklearn uses the Minkowski distance

d(x, x') = \left( \sum_{j=1}^p |x_j - x'_j|^p \right)^{1/p},

which collapses to the Euclidean distance for $p=2$ and to the Manhattan distance for $p=1$ . Optionally, neighbours can be weighted by inverse distance so that closer points exert more influence:

\hat{y}(x) = \frac{\sum_{i \in \mathcal{N}_k(x)} w_i\, y_i}{\sum_{i \in \mathcal{N}_k(x)} w_i}, \qquad w_i = \frac{1}{d(x, x_i)}.

Using `KNeighborsRegressor`

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

Three hyperparameters deserve attention. n_neighbors controls the bias-variance trade-off: a small $k$ produces a flexible model that hugs noise, a large $k$ produces a smooth model that risks underfitting. metric lets you change the distance. weights="distance" activates the inverse-distance weighting described above.

Sensitivity to feature scale

The problem

The k-NN regressor is acutely sensitive to scale because every prediction routes through a distance computation. Suppose two features behave very differently:

sqft_living ranges from 300 to 4000,
bathrooms ranges from 1 to 5.

Without rescaling, a difference of 500 in living area dwarfs any conceivable difference in bathrooms. The neighbours are selected almost solely on surface area, even though the number of bathrooms is informative. The notion of nearest neighbour loses its meaning.

Distance-based methods need rescaling. k-NN, SVR, k-means, PCA — anything whose internal arithmetic compares two observations through a distance or an inner product is affected. Linear regression, in contrast, is invariant to per-feature scaling (the coefficients adjust); but its interpretation can still benefit from comparable units.

Three scalers

scikit-learn ships several scalers with the same fit/transform interface. The most common is standardisation:

x_{\text{scaled}} = \frac{x - \mu}{\sigma}.

After standardisation each feature has zero mean and unit variance.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

Critical detail. Use fit_transform on the training set only, then transform on the test set. Calling fit on the test set would let test statistics leak into the training procedure — an information leak that inflates the apparent test score.

Two siblings serve different purposes: MinMaxScaler rescales each feature to a fixed interval (typically $[0,1]$ ); RobustScaler uses the median and the inter-quartile range, making it less sensitive to outliers than StandardScaler. The reverse transformation is available through scaler.inverse_transform, which is useful when you need to display predictions back in the original units.

`Pipeline`: pre-processing without leaks

A Pipeline chains several processing steps and a final estimator into a single object. The contract is simple: when you call fit, each preceding step is fit_transform-ed in order; when you call predict, each step is transform-ed and the last estimator predicts. This brings three concrete benefits.

First, methodological correctness. With a hand-rolled fit_transform / transform you can easily forget that the scaler must be fitted on X_train only. Inside a pipeline, the rule is enforced automatically — for the initial split and for every fold of cross-validation.

Second, simplicity. Three lines instead of a paragraph:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  KNeighborsRegressor(n_neighbors=5)),
])

pipe.fit(X_train, y_train)
y_hat = pipe.predict(X_test)

Third, composability with cross-validation. Passing pipe directly to cross_val_score re-runs the whole chain on each fold — scaler refitted, k-NN retrained, evaluation done — without any glue code on our side.

The named steps remain accessible:

pipe.named_steps["scaler"]  # the trained StandardScaler
pipe.named_steps["model"]   # the trained KNeighborsRegressor

This is convenient for inspecting the learned mean and standard deviation, or for retrieving a learned coefficient vector at the end of the pipeline.

Rule of thumb. As soon as a model needs preprocessing — and almost every model does — write it as a pipeline from the start. The few extra lines pay for themselves the first time you switch to cross-validation or to a grid search.

Wrap-up

In a single chapter we have travelled from the closed-form solution of a one-variable linear regression to a fully pipelined k-NN model with cross-validated normalisation. The trail of ideas is worth retracing: a model is a function family parametrised by coefficients; we choose the coefficients by minimising a cost; we judge the model with metrics on data it has never seen; we stabilise that judgement with cross-validation; we extend the model family by augmenting features (polynomials) or by switching paradigm altogether (k-NN); and we keep our experiments honest by wrapping pre-processing into a Pipeline. The same skeleton — fit, predict, evaluate, cross-validate, scale, pipeline — will reappear unchanged in every subsequent chapter, with new estimators inserted in the modelling slot.

Exercises

Exercise 1 — Implement a simple linear regression

The objective is to implement a simple linear regression manually, without using scikit-learn, in order to understand the calculation of the coefficients. We consider $\hat{y} = ax + b$ . Write a MyLinearRegression class containing:

a constructor __init__,
a method fit(x, y) that computes the coefficients $a$ and $b$ using the closed-form formulas,
a method predict(x) that returns the predicted values $\hat{y}$ .

Constraints: do not use scikit-learn; only Python and NumPy are allowed. The inputs x and y may be lists, Pandas Series or NumPy arrays. Test your class on co2_mini by predicting co2_emissions from consumption, and overlay the fitted line on the scatter plot using Plotly.

Exercise 2 — Simple linear regression on Abalone

We now want to predict the weight (Weight) of an abalone from its length (Length).

Load the Abalone dataset.
Train a LinearRegression to predict Weight from Length.
Compute the predicted values.
Visualise on the same graph the actual values and the values predicted by the model.
Plot the residuals as a function of Length. What do you observe? Comment on the suitability of a linear model.

Exercise 3 — Multiple regression on Abalone

Predict Rings from Length and Weight with LinearRegression, then with all available features. For each model, report model.coef_, model.intercept_ and the four metrics MAE, RMSE, MAPE, R². Visualise the result of the two-variable model in 3D by overlaying the predicted plane on the cloud of (Length, Weight, Rings).

Exercise 4 — Train/test split and cross-validation

Repeat the previous exercise, but this time split the data with train_test_split(test_size=0.2, random_state=42), train the model on X_train only and report metrics on X_test. Then evaluate the same model with cross_val_score(cv=5, scoring="r2") and report the mean and standard deviation of the five scores. Comment on the difference between the two protocols.

Exercise 5 — Polynomial regression

Improve the prediction of Weight from Length for the Abalone dataset by introducing a polynomial feature.

Load the Abalone dataset.
Add a column Length2 = Length ** 2.
Build the feature matrix X from [Length, Length2].
Train a LinearRegression to predict Weight.
Compute the predicted values.
Produce a 3D visualisation with Length on $x$ , Length2 on $y$ , Weight (real and predicted) on $z$ .
Produce a 2D visualisation Length vs. Weight with both real and predicted values.

What is the gain in MAE and R² compared to the linear model of Exercise 2?

Exercise 6 — k-NN regression on Abalone

Predict the weight of an abalone from its length with a KNeighborsRegressor. Visualise the actual versus predicted curves. Then predict Rings from all the other features with a k-NN and report the metrics on a held-out test set. Try $k \in \{1, 3, 5, 11, 25\}$ and observe how the metrics evolve.

Exercise 7 — House prices, linear and k-NN

Explore the house_mini dataset (dimensions, statistics, distribution of price).

Predict price with a LinearRegression. Use a train/test split, report metrics, plot sqft_living versus actual and predicted price, and inspect the residuals.
Repeat the exercise with a KNeighborsRegressor. Comment on the comparative performance.

Exercise 8 — Normalisation

Normalise the features of the previous exercise with a StandardScaler and re-train the k-NN. What changes in the metrics? Why is the linear regression essentially insensitive to the rescaling, while the k-NN is not?

Exercise 9 — Pipeline

Re-do Exercise 8 using a Pipeline that chains a StandardScaler with a KNeighborsRegressor. Verify that the cross-validated R² obtained with cross_val_score(pipe, X, y, cv=5) is consistent with the held-out evaluation. Why is the pipeline approach safer than fitting the scaler manually?

Exercise 10 — Linear vs. k-NN with cross-validation

Compare the performance of LinearRegression and a pipelined k-NN on house_mini using 5-fold cross-validation with scoring="neg_mean_absolute_error". Report the mean and standard deviation of the absolute error for both models. Which one would you recommend, and why?

Going further

scikit-learn — Linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
scikit-learn — KNeighborsRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
scikit-learn — Polynomial features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
scikit-learn — Pipelines and composite estimators: https://scikit-learn.org/stable/modules/compose.html
scikit-learn — Cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html
scikit-learn — Preprocessing data: https://scikit-learn.org/stable/modules/preprocessing.html
scipy.optimize.curve_fit for non-linear least squares: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd ed., O'Reilly, 2022) — Chapter 4 on training models complements this chapter with regularised variants (Ridge, Lasso, Elastic-Net) and the gradient-descent perspective on least squares.