ML · Chapter 2
Machine Learning 2 — Regression
After tackling classification in the first chapter, we now turn to regression, the second great family of supervised learning problems. Where classification asks which class does this observation belong to?, regression asks which numerical value should we predict?. The target variable is no longer a label drawn from a finite set, but a continuous quantity: the price of a house, the age of a mollusc, the consumption of a vehicle, the temperature tomorrow afternoon. The change of nature in the target has profound consequences on how models are formulated, how they are trained, and how their quality is measured.
This chapter starts from the most elementary regression model — the least-squares straight line — and builds upwards. We derive the closed-form coefficients of simple linear regression by hand, then generalise to several explanatory variables with scikit-learn. We introduce the metrics that quantify prediction quality (MAE, MSE, RMSE, MAPE, R²), the train/test discipline that protects us from optimistic illusions, and the cross-validation procedure that stabilises performance estimates. We then meet two non-linear models — polynomial regression and the k-nearest-neighbours regressor — and discover that some algorithms are extremely sensitive to feature scale. The chapter closes on Pipeline, the abstraction that ties pre-processing and modelling together into a single, leak-proof object.
The running case studies are deliberately small and visual. The co2_mini dataset relates fuel consumption to CO₂ emissions; abalone_mini describes physical measurements of abalones whose age is encoded by the number of rings on the shell; house_mini collects sale prices of houses in King County. These datasets are large enough to be interesting and small enough to be plotted entirely on screen, which is exactly what we want when learning a method.
Least-squares linear regression
The simplest possible model
We are given paired observations for and we look for a straight line
that approximates the data as well as possible. Two questions immediately follow: what does as well as possible mean, and how do we compute the coefficients and that achieve it?
The standard answer is the least-squares criterion. For each observation, we measure the residual — the vertical signed distance between the actual value and the prediction:
and we choose to minimise the sum of the squared residuals:
Why squares? Squaring makes errors positive, penalises large gaps more strongly than small ones, and yields a smooth differentiable cost function. The minimisation problem has a unique closed-form solution.
Deriving the coefficients
The cost is a quadratic, convex function of two unknowns. Its global minimum is reached where both partial derivatives vanish.
Differentiating with respect to gives
which, after dividing by , yields the elegant identity
In other words, the regression line always passes through the centroid of the cloud. Differentiating with respect to and substituting the expression for produces the slope:
Statistical interpretation. The numerator measures how and vary together; the denominator measures the dispersion of . The slope is large when explains the variations of well.
Implementing it by hand
Implementing the formulas in Python is a one-page affair. The class below mirrors the scikit-learn API — a constructor, a fit method, a predict method — but uses only NumPy.
import numpy as np class MyLinearRegression: def __init__(self): self.a = None self.b = None def fit(self, x, y): x, y = np.asarray(x), np.asarray(y) x_mean, y_mean = x.mean(), y.mean() self.a = ((x - x_mean) * (y - y_mean)).sum() / ((x - x_mean) ** 2).sum() self.b = y_mean - self.a * x_mean return self def predict(self, x): return self.a * np.asarray(x) + self.b
Tested on co2_mini with x = consumption and y = emissions, the class produces a slope and an intercept identical to those of sklearn.linear_model.LinearRegression. The exercise of writing it once is well worth the time: it demystifies what fit and predict actually do.
Plotting a line on a cloud with Plotly
Plotly Express creates a figure containing one or more traces. To overlay the fitted line on the original cloud, we add a second trace to the existing figure rather than building a new one:
fig = px.scatter(x=x, y=y) fig.add_scatter(x=x, y=y_hat, name="Predicted values") fig.show()
The figure keeps the same axes and scale, which is exactly what we need for a faithful visual comparison.
Residuals: the model's after-image
Once a model is fitted, plotting its residuals as a function of is one of the most informative diagnostics available.
Reading a residual plot. A well-specified linear model produces residuals that scatter randomly around zero, with no visible trend and roughly constant dispersion. Any visible curve, funnel, or block of same-sign residuals is a clue that the linear assumption is too narrow, that outliers distort the fit, or that the variance is not constant.
A positive residual signals that the model underestimates the truth; a negative one, that it overestimates. The sum of residuals over the training set is mechanically zero — a direct consequence of the first normal equation.
Metrics: how wrong are we?
Once predictions are available, we want a single number that summarises how wrong the model is. Several metrics exist, each emphasising a different aspect of the error.
MAE, MSE, RMSE, MAPE
The Mean Absolute Error averages the absolute residuals:
It is expressed in the same units as and is comparatively robust to outliers. The Mean Squared Error
penalises large errors quadratically; its square root, the Root Mean Squared Error
restores the natural units while keeping the strong sensitivity to outliers. Finally, the Mean Absolute Percentage Error
reports a relative error, which is convenient when targets span several orders of magnitude — but be wary when some are close to zero.
MAE versus RMSE. When errors are compact, the two metrics agree. When a few observations carry very large errors, RMSE jumps while MAE stays moderate: the gap between them is itself a diagnostic.
The R² score
The coefficient of determination compares the model to a trivial baseline that always predicts the mean :
A perfect fit gives . Predicting the mean yields . A model worse than the mean produces a negative — perfectly possible on a held-out test set, even if surprising at first.
Warning. A high is no guarantee of a good predictive model. It does not detect overfitting and ignores the business context. Always read alongside MAE/RMSE and a residual plot.
Computing metrics in practice
The four functions below are imported from sklearn.metrics:
from sklearn.metrics import ( mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score, ) mae = mean_absolute_error(y, y_hat) mse = mean_squared_error(y, y_hat) rmse = np.sqrt(mse) mape = mean_absolute_percentage_error(y, y_hat) r2 = r2_score(y, y_hat)
Reporting the four together gives a balanced view: an absolute floor (MAE), a squared-error figure (RMSE), a relative figure (MAPE), and a normalised gain over the mean (R²).
Multiple linear regression with scikit-learn
Two variables and a plane
Real problems rarely involve a single explanatory variable. With two predictors and the model becomes
and in three dimensions traces a plane. Training consists in choosing to minimise the same kind of squared error as before. The interpretation of each coefficient gains an important nuance:
measures the effect of for equal , and conversely. Multiple regression separates the contributions of several factors held simultaneously into the model.
The abalone_mini dataset
We illustrate this with abalones — sea molluscs whose age is approximated by the number of growth rings. The target Rings is to be predicted from Length, Diameter, Height and Weight. A 3D scatter of Length, Weight and Rings already suggests a positive but noisy relationship.
Calling LinearRegression
The scikit-learn workflow always follows the same three-step pattern:
from sklearn.linear_model import LinearRegression X = df[['Length', 'Weight']] # 2D feature matrix y = df['Rings'] # 1D target model = LinearRegression() model.fit(X, y) y_hat = model.predict(X) print(model.coef_, model.intercept_)
Shape rule.
Xmust always be two-dimensional, even with a single feature (use double brackets:df[['Length']]).ymust be one-dimensional.
After fit, the attribute model.coef_ holds the vector and model.intercept_ holds . Predicting on the very same data we trained on, as in the snippet above, is purely illustrative — Section Train/test explains why a held-out evaluation is mandatory.
General case with variables
With features the model reads
or in matrix form . Stacking a column of ones into to absorb the intercept, the least-squares solution admits the closed form
This normal equation generalises the one-variable formula and works perfectly for small problems. For very large datasets, or when is poorly conditioned, sklearn switches to numerical solvers behind the scenes — the user-facing API does not change.
Train/test split: the discipline of generalisation
Why two sets are needed
Imagine an exam where the questions are exactly the ones repeated in class. A perfect mark would measure memorisation, not understanding. Evaluating a model on the data used to train it produces the same illusion: the score reported is optimistic and misleading.
Generalisation. A model is judged by its behaviour on data it has never seen during training, because that is what we will use it for in production.
train_test_split
The function sklearn.model_selection.train_test_split shuffles the rows, then partitions them into two disjoint sets:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, ) model = LinearRegression() model.fit(X_train, y_train) y_hat = model.predict(X_test)
test_size=0.2 reserves 20 % of the rows for evaluation; random_state=42 makes the split reproducible. The four returned arrays must be used in the obvious way: X_train, y_train for fit; X_test, y_test for the metric calls.
Gotcha. Comparing the metrics on
X_trainwith those onX_testis the simplest detector of overfitting. A large gap (good on train, poor on test) is the textbook symptom.
Cross-validation
A single split is not the end of the story. If you change random_state, the metrics typically wobble — sometimes substantially. The evaluation depends strongly on the random draw, and a single partition can be either too easy or too hard.
-fold cross-validation addresses this by repeating the experiment times. The data is divided into folds of comparable size; for each iteration, one fold serves as test set and the other folds as training set. We obtain scores that are summarised by their mean (overall performance) and standard deviation (stability).
from sklearn.model_selection import cross_val_score scores = cross_val_score(model, X, y, cv=5, scoring="r2") print(scores.mean(), scores.std())
Negative scores? By convention
cross_val_scorealways maximises. Error metrics — which we want to minimise — are therefore returned in negative form:"neg_mean_absolute_error","neg_root_mean_squared_error", etc. To recover the actual error magnitude, simply flip the sign:mae = -scores.
A low standard deviation across folds indicates a stable model whose performance does not depend on a lucky draw. A large standard deviation is a warning that the chosen estimator might be overly sensitive to which observations end up in the training fold.
Polynomial regression: linearity in disguise
Simple linear regression assumes a relationship of the form . Many physical relationships are not strictly linear — think of the link between Length and Weight of an abalone, where weight grows faster than length (closer to a cubic law). A simple, powerful trick is to augment the feature space with powers of the original variable:
This is polynomial regression, but observe carefully: the model remains linear in the coefficients . Everything we already know about least-squares estimation, normal equations, and LinearRegression therefore applies unchanged.
In practice, we add the new column manually or with PolynomialFeatures:
df['Length2'] = df['Length'] ** 2 X = df[['Length', 'Length2']] y = df['Weight'] model = LinearRegression() model.fit(X, y) y_hat = model.predict(X)
Visualising the result on a 2D plot (Length vs. Weight) usually reveals a clearly curved fit, much closer to the cloud than the straight line of degree 1. A complementary 3D plot, with Length on , Length2 on and Weight on , shows that the curve is in fact a plane in the augmented feature space — exactly what we expect from a model that is linear in its augmented features.
Degree as a hyperparameter. Higher polynomial degrees fit training data more tightly but quickly start to oscillate between points. The right degree is found by cross-validation — never by eye on the training set.
-Nearest Neighbours regression
A non-parametric, local view
The k-Nearest Neighbours regressor (k-NN) takes an entirely different stance. It does not try to learn a global formula relating features to target. To predict the value at a new point , it simply looks for the training observations closest to and returns the average of their target values:
where denotes the set of the nearest neighbours of .
Non-parametric and local. k-NN stores the training data and consults it at prediction time. There is no fixed set of weights to learn. The model adapts locally to the structure of each region.
The role of distance
The notion of neighbour depends entirely on a chosen distance. By default, sklearn uses the Minkowski distance
which collapses to the Euclidean distance for and to the Manhattan distance for . Optionally, neighbours can be weighted by inverse distance so that closer points exert more influence:
Using KNeighborsRegressor
from sklearn.neighbors import KNeighborsRegressor model = KNeighborsRegressor(n_neighbors=5) model.fit(X_train, y_train) y_hat = model.predict(X_test)
Three hyperparameters deserve attention. n_neighbors controls the bias-variance trade-off: a small produces a flexible model that hugs noise, a large produces a smooth model that risks underfitting. metric lets you change the distance. weights="distance" activates the inverse-distance weighting described above.
Sensitivity to feature scale
The problem
The k-NN regressor is acutely sensitive to scale because every prediction routes through a distance computation. Suppose two features behave very differently:
sqft_livingranges from 300 to 4000,bathroomsranges from 1 to 5.
Without rescaling, a difference of 500 in living area dwarfs any conceivable difference in bathrooms. The neighbours are selected almost solely on surface area, even though the number of bathrooms is informative. The notion of nearest neighbour loses its meaning.
Distance-based methods need rescaling. k-NN, SVR, k-means, PCA — anything whose internal arithmetic compares two observations through a distance or an inner product is affected. Linear regression, in contrast, is invariant to per-feature scaling (the coefficients adjust); but its interpretation can still benefit from comparable units.
Three scalers
scikit-learn ships several scalers with the same fit/transform interface. The most common is standardisation:
After standardisation each feature has zero mean and unit variance.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
Critical detail. Use
fit_transformon the training set only, thentransformon the test set. Callingfiton the test set would let test statistics leak into the training procedure — an information leak that inflates the apparent test score.
Two siblings serve different purposes: MinMaxScaler rescales each feature to a fixed interval (typically ); RobustScaler uses the median and the inter-quartile range, making it less sensitive to outliers than StandardScaler. The reverse transformation is available through scaler.inverse_transform, which is useful when you need to display predictions back in the original units.
Pipeline: pre-processing without leaks
A Pipeline chains several processing steps and a final estimator into a single object. The contract is simple: when you call fit, each preceding step is fit_transform-ed in order; when you call predict, each step is transform-ed and the last estimator predicts. This brings three concrete benefits.
First, methodological correctness. With a hand-rolled fit_transform / transform you can easily forget that the scaler must be fitted on X_train only. Inside a pipeline, the rule is enforced automatically — for the initial split and for every fold of cross-validation.
Second, simplicity. Three lines instead of a paragraph:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsRegressor pipe = Pipeline([ ("scaler", StandardScaler()), ("model", KNeighborsRegressor(n_neighbors=5)), ]) pipe.fit(X_train, y_train) y_hat = pipe.predict(X_test)
Third, composability with cross-validation. Passing pipe directly to cross_val_score re-runs the whole chain on each fold — scaler refitted, k-NN retrained, evaluation done — without any glue code on our side.
The named steps remain accessible:
pipe.named_steps["scaler"] # the trained StandardScaler pipe.named_steps["model"] # the trained KNeighborsRegressor
This is convenient for inspecting the learned mean and standard deviation, or for retrieving a learned coefficient vector at the end of the pipeline.
Rule of thumb. As soon as a model needs preprocessing — and almost every model does — write it as a pipeline from the start. The few extra lines pay for themselves the first time you switch to cross-validation or to a grid search.
Wrap-up
In a single chapter we have travelled from the closed-form solution of a one-variable linear regression to a fully pipelined k-NN model with cross-validated normalisation. The trail of ideas is worth retracing: a model is a function family parametrised by coefficients; we choose the coefficients by minimising a cost; we judge the model with metrics on data it has never seen; we stabilise that judgement with cross-validation; we extend the model family by augmenting features (polynomials) or by switching paradigm altogether (k-NN); and we keep our experiments honest by wrapping pre-processing into a Pipeline. The same skeleton — fit, predict, evaluate, cross-validate, scale, pipeline — will reappear unchanged in every subsequent chapter, with new estimators inserted in the modelling slot.
Exercises
Exercise 1 — Implement a simple linear regression
The objective is to implement a simple linear regression manually, without using scikit-learn, in order to understand the calculation of the coefficients. We consider . Write a MyLinearRegression class containing:
- a constructor
__init__, - a method
fit(x, y)that computes the coefficients and using the closed-form formulas, - a method
predict(x)that returns the predicted values .
Constraints: do not use scikit-learn; only Python and NumPy are allowed. The inputs x and y may be lists, Pandas Series or NumPy arrays. Test your class on co2_mini by predicting co2_emissions from consumption, and overlay the fitted line on the scatter plot using Plotly.
Exercise 2 — Simple linear regression on Abalone
We now want to predict the weight (Weight) of an abalone from its length (Length).
- Load the Abalone dataset.
- Train a
LinearRegressionto predictWeightfromLength. - Compute the predicted values.
- Visualise on the same graph the actual values and the values predicted by the model.
- Plot the residuals as a function of
Length. What do you observe? Comment on the suitability of a linear model.
Exercise 3 — Multiple regression on Abalone
Predict Rings from Length and Weight with LinearRegression, then with all available features. For each model, report model.coef_, model.intercept_ and the four metrics MAE, RMSE, MAPE, R². Visualise the result of the two-variable model in 3D by overlaying the predicted plane on the cloud of (Length, Weight, Rings).
Exercise 4 — Train/test split and cross-validation
Repeat the previous exercise, but this time split the data with train_test_split(test_size=0.2, random_state=42), train the model on X_train only and report metrics on X_test. Then evaluate the same model with cross_val_score(cv=5, scoring="r2") and report the mean and standard deviation of the five scores. Comment on the difference between the two protocols.
Exercise 5 — Polynomial regression
Improve the prediction of Weight from Length for the Abalone dataset by introducing a polynomial feature.
- Load the Abalone dataset.
- Add a column
Length2 = Length ** 2. - Build the feature matrix
Xfrom[Length, Length2]. - Train a
LinearRegressionto predictWeight. - Compute the predicted values.
- Produce a 3D visualisation with
Lengthon ,Length2on ,Weight(real and predicted) on . - Produce a 2D visualisation
Lengthvs.Weightwith both real and predicted values.
What is the gain in MAE and R² compared to the linear model of Exercise 2?
Exercise 6 — k-NN regression on Abalone
Predict the weight of an abalone from its length with a KNeighborsRegressor. Visualise the actual versus predicted curves. Then predict Rings from all the other features with a k-NN and report the metrics on a held-out test set. Try and observe how the metrics evolve.
Exercise 7 — House prices, linear and k-NN
Explore the house_mini dataset (dimensions, statistics, distribution of price).
- Predict
pricewith aLinearRegression. Use a train/test split, report metrics, plotsqft_livingversus actual and predicted price, and inspect the residuals. - Repeat the exercise with a
KNeighborsRegressor. Comment on the comparative performance.
Exercise 8 — Normalisation
Normalise the features of the previous exercise with a StandardScaler and re-train the k-NN. What changes in the metrics? Why is the linear regression essentially insensitive to the rescaling, while the k-NN is not?
Exercise 9 — Pipeline
Re-do Exercise 8 using a Pipeline that chains a StandardScaler with a KNeighborsRegressor. Verify that the cross-validated R² obtained with cross_val_score(pipe, X, y, cv=5) is consistent with the held-out evaluation. Why is the pipeline approach safer than fitting the scaler manually?
Exercise 10 — Linear vs. k-NN with cross-validation
Compare the performance of LinearRegression and a pipelined k-NN on house_mini using 5-fold cross-validation with scoring="neg_mean_absolute_error". Report the mean and standard deviation of the absolute error for both models. Which one would you recommend, and why?
Going further
- scikit-learn — Linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- scikit-learn — KNeighborsRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
- scikit-learn — Polynomial features: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
- scikit-learn — Pipelines and composite estimators: https://scikit-learn.org/stable/modules/compose.html
- scikit-learn — Cross-validation: https://scikit-learn.org/stable/modules/cross_validation.html
- scikit-learn — Preprocessing data: https://scikit-learn.org/stable/modules/preprocessing.html
- scipy.optimize.curve_fit for non-linear least squares: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
- Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd ed., O'Reilly, 2022) — Chapter 4 on training models complements this chapter with regularised variants (Ridge, Lasso, Elastic-Net) and the gradient-descent perspective on least squares.