ML · Chapter 5

Machine Learning 5 — Synthesis Projects

The four previous chapters have introduced, step by step, the conceptual machinery and the practical tooling of supervised machine learning: the workflow of preparing a dataset (chapter 1), the family of regression models with their geometric and probabilistic interpretations (chapter 2), the family of classifiers with their dedicated metrics (chapter 3), and finally the unsupervised toolkit of dimensionality reduction and clustering (chapter 4). Each notion was illustrated on small didactic datasets — titanic_mini, cancer_mini, abalone_mini, iris — chosen for their pedagogical clarity rather than for their realism.

This final chapter takes the opposite stance. It presents four synthesis projects built on much larger and messier real-world datasets. The work is no longer guided cell by cell: each project is a brief that you must turn into a complete pipeline, from the first inspection of the raw file to the final cross-validated score. The chapter is therefore the exercise list itself, organised by project. Each section opens with the problem statement, describes the data, recommends a methodological path through the chapters of this manual, lists the classical pitfalls that the dataset is famous for, and proposes a short menu of models worth trying. Reference solution skeletons are kept deliberately minimal: they show the shape of the code rather than fill it in.

The four datasets have been chosen to cover, between them, every situation encountered in the previous chapters. Mercedes-Benz Greener Manufacturing is a high-dimensional regression problem with hundreds of binary features and a handful of unordered categorical codes — a perfect stress test for one-hot encoding and regularisation. Stroke Prediction is a strongly imbalanced binary classification problem on tabular medical data and forces a careful reading of metrics beyond accuracy. House Prices — Ames is the canonical regression project on real estate, with eighty heterogeneous features, real and disguised missing values, and a long-tailed target. MNIST is the historical benchmark of handwritten digit recognition, the gateway from classical machine learning into the world of image data, and a natural bridge towards the deep learning chapters that follow.

A common thread runs through the four projects: a serious ML pipeline is judged by the protocol that produced it, not by a single accuracy figure. A held-out test set, a reproducible random seed, a cross-validation loop on the training portion, and a critical look at the errors made by the model are non-negotiable. The point of these projects is to build that reflex.

Project 1 — Mercedes-Benz Greener Manufacturing

Problem and dataset

The first project is drawn from the Kaggle competition Mercedes-Benz Greener Manufacturing. Mercedes-Benz operates a test bench on which each newly assembled vehicle is subjected to a battery of validation tests before leaving the factory. The duration of these tests depends on the technical configuration of the vehicle: the engine type, the transmission, the assortment of options and electronic components installed. Long tests consume time, energy and money. The industrial goal is therefore to predict the test time y from the configuration alone so that costly bench time can be planned, batched, or in some cases avoided altogether.

The training file mercedes_test.csv contains roughly four thousand vehicle configurations. Each row is one configuration. The target y is a continuous, strictly positive real number — the test time in seconds — which makes the project a regression task.

The features split into two very different groups. A first group of eight columns named X0, X1, X2, X3, X4, X5, X6 and X8 encodes high-level categorical attributes (engine type, transmission type, mechanical or electronic component family, design variation) using letter codes such as a, b, c. The second, much larger group of nearly four hundred columns named X10 through X385 is purely binary ( $0$ or $1$ ): each column is a flag indicating the presence or absence of a specific option, sub-component or compatibility marker.

This shape — handful of categorical columns, hundreds of binary columns, a few thousand rows — is the signature of a high-dimensional, sparse, industrial-encoded regression problem.

Recommended approach

The first methodological choice concerns the encoding of the eight categorical columns. Letter codes a, b, c carry no natural order, so the only honest option is one-hot encoding. Mixing one-hot dummies with the four hundred existing binary columns is harmless: the resulting feature matrix simply becomes a sparse $0/1$ matrix with several hundred columns, exactly the regime in which regularised linear models excel.

The second choice is constant-column pruning. Many of the binary columns are identical for every row of the dataset (always $0$ or always $1$ ). They contribute nothing to prediction, inflate the feature count, and slow down training. A VarianceThreshold(threshold=0.0) from sklearn.feature_selection, or simply df.loc[:, df.nunique() > 1], removes them in one line.

The third choice is the model. Three families deserve to be tried.

A linear baseline with an L2 penalty (Ridge) is the natural starting point. With several hundred features and only a few thousand observations, the unpenalised least-squares estimator is unstable; the ridge penalty trades a small bias for a large reduction in variance and produces a usable score essentially out of the box. An L1-penalised variant (Lasso) is interesting too because it performs implicit feature selection on the binary columns.

A random forest is a sensible non-linear alternative. Forests are insensitive to feature scaling, handle binary inputs without preprocessing, and reveal the most discriminative options through their feature importances.

A gradient boosting model — GradientBoostingRegressor, LGBMRegressor or XGBRegressor — is, on this dataset, typically the winner. Boosted trees handle high-dimensional binary inputs gracefully and tend to lift the cross-validated $R^2$ significantly above ridge regression.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

cat_cols = ["X0", "X1", "X2", "X3", "X4", "X5", "X6", "X8"]
num_cols = [c for c in X.columns if c not in cat_cols]

pre = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ("num", "passthrough", num_cols),
])

pipe = Pipeline([
    ("pre", pre),
    ("var", VarianceThreshold(0.0)),
    ("model", Ridge(alpha=1.0)),
])

scores = cross_val_score(pipe, X, y, cv=5, scoring="r2")

Classical pitfalls

The first pitfall is to forget the handle_unknown="ignore" argument of OneHotEncoder. The Kaggle test set contains category levels that do not appear in the training set; without the option, the pipeline crashes at predict time.

The second pitfall is target leakage through outliers. A small number of rows in mercedes_test.csv have suspiciously large values of y (above two hundred seconds). Inspect a histogram of y before fitting; a log-transform of the target is often beneficial.

The third pitfall is score evaluation. The Kaggle metric for this competition is the $R^2$ score. Use scoring="r2" in cross_val_score; do not report a mean squared error in seconds-squared and expect it to be comparable to the leaderboard.

The fourth pitfall is the sparsity trap. With four hundred binary features and four thousand rows, a deep tree can easily memorise spurious patterns. Cap max_depth between $4$ and $8$ for boosted trees, and rely on cross-validation to choose it.

Going further

Kaggle competition: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing
sklearn.preprocessing.OneHotEncoder, VarianceThreshold.
sklearn.linear_model.Ridge and Lasso.
LightGBM and XGBoost regressors.

Project 2 — Stroke Prediction

Problem and dataset

The Kaggle dataset Stroke Prediction Dataset (stroke.csv, published by fedesoriano) collects roughly five thousand patient records, each annotated with whether the individual has suffered a stroke. The target stroke is binary: $1$ for an occurrence, $0$ otherwise. The aim of the project is to build a classifier that estimates the probability of a stroke from demographic, medical and behavioural variables.

The columns split into five thematic blocks. A pure identifier id carries no predictive value and must be dropped before any model is fitted. Two demographic variables — gender (categorical) and age (continuous) — describe who the patient is, with age widely known to be the single most informative feature. The medical history block contains two binary indicators, hypertension and heart_disease, both strongly correlated with age. The social and professional situation is captured by ever_married, work_type (with the modalities Private, Self-employed, Govt_job, children, Never_worked) and Residence_type (urban or rural); these act as proxies for socio-economic context. Lifestyle is summarised by smoking_status with four categories: never smoked, formerly smoked, smokes, Unknown — the last of which is in fact a disguised missing value. Finally, two biometric variables complete the picture: avg_glucose_level and bmi, the latter containing genuine missing values that must be imputed.

The defining feature of this dataset is its class imbalance: only around five per cent of records correspond to a stroke. Naively maximising accuracy is therefore meaningless — predicting "no stroke" for everyone already scores ninety-five per cent.

Recommended approach

The first preprocessing decision concerns the Unknown value of smoking_status. Treating it as a fifth category preserves all the information; it will simply produce a one-hot column that the model can use or ignore. Replacing it by the mode of the other categories is a defensible shortcut but should be documented.

The second decision concerns the bmi column. Its missing values are not informative — they appear randomly across the dataset. A median imputation with SimpleImputer(strategy="median") is the standard and robust choice.

The third decision concerns the encoding of categorical columns. OneHotEncoder(handle_unknown="ignore") applied through a ColumnTransformer is again the safe option. The numerical columns age, avg_glucose_level and bmi benefit from a StandardScaler if a logistic regression is used, although tree-based models do not require it.

The fourth and most important decision is how to fight class imbalance. Three lines of attack are available, and they are complementary rather than mutually exclusive.

Class weights. Most scikit-learn classifiers accept class_weight="balanced", which inversely scales the loss by class frequency. This is the simplest and often most effective lever.
Resampling. The imbalanced-learn library provides SMOTE (synthetic over-sampling of the minority class) and RandomUnderSampler (under-sampling of the majority). They must be applied inside the cross-validation loop, never on the entire dataset prior to splitting, otherwise information leaks from validation folds into training.
Threshold tuning. The default decision threshold of $0.5$ on the predicted probability is rarely optimal under imbalance. Compute the precision–recall curve on the validation set and pick the threshold that hits the desired recall on the positive class.

The fifth decision is the metric. Accuracy is misleading. Use the AUC of the ROC curve for global comparison, the F1-score of the positive class for a single summary number, and read the confusion matrix explicitly to make sense of the trade-off.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, classification_report

num_cols = ["age", "avg_glucose_level", "bmi"]
cat_cols = ["gender", "ever_married", "work_type",
            "Residence_type", "smoking_status"]

pre = ColumnTransformer([
    ("num", Pipeline([
        ("imp", SimpleImputer(strategy="median")),
        ("sc",  StandardScaler())]), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])

clf = Pipeline([("pre", pre),
                ("model", LogisticRegression(class_weight="balanced",
                                              max_iter=1000))])

A logistic regression with balanced class weights is the right baseline: it is fast, fully interpretable through its coefficients, and provides well-calibrated probabilities. A RandomForestClassifier(class_weight="balanced") and a GradientBoostingClassifier (or LGBMClassifier(class_weight="balanced")) are the natural follow-ups when raw AUC matters more than interpretability.

Classical pitfalls

The first pitfall is to keep the id column among the features. It correlates by accident with nothing in particular and only adds noise.

The second pitfall is to read accuracy. A model that predicts the majority class for every patient scores around $95\%$ accuracy and is medically useless. Always report AUC and the confusion matrix.

The third pitfall is to resample before splitting. The order of operations matters. The split into train/test (and within the training portion, the split into folds) must come first; SMOTE or random under-sampling are applied only to the training portion.

The fourth pitfall is to ignore calibration. If the downstream use of the model is to set a clinical threshold, the predicted probabilities must be trustworthy. CalibratedClassifierCV wraps any classifier and recalibrates its probabilities on a held-out fold.

Going further

Kaggle dataset: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
imbalanced-learn — SMOTE, RandomUnderSampler, Pipeline aware of resampling.
sklearn.linear_model.LogisticRegression, metrics.roc_auc_score, metrics.precision_recall_curve.
sklearn.calibration.CalibratedClassifierCV.

Project 3 — House Prices: Ames Housing

Problem and dataset

The third project is built on the Kaggle competition House Prices: Advanced Regression Techniques, hosted on the Ames Housing dataset assembled by Dean De Cock. Each row in house_prices.csv corresponds to a single residential sale in Ames, Iowa, between 2006 and 2010. The target SalePrice is a strictly positive real number — the final transaction price in dollars. The dataset is the canonical playground for advanced regression: with around 1,460 observations and 80 explanatory variables, it lies in the regime where every methodological decision matters.

The variables span the entire descriptive surface of a property. Location is captured by Neighborhood (a high-cardinality categorical) and MSZoning. The land is described by LotArea, LotFrontage (with missing values), LotShape, LandContour, LotConfig and LandSlope. The building is described by MSSubClass (a numeric column that is in fact categorical), BldgType and HouseStyle. The overall quality of the property is summarised by two ordinal variables that play a central role: OverallQual and OverallCond, both rated on a one-to-ten scale.

A second block describes the physical extent of the property: GrLivArea (above-ground living area), TotalBsmtSF, 1stFlrSF, 2ndFlrSF, the bedroom and bathroom counts, the basement variables (BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, …), and the garage variables (GarageType, GarageYrBlt, GarageCars, GarageArea, GarageQual, GarageCond).

Then come construction and renovation years (YearBuilt, YearRemodAdd), exterior and material characteristics (Exterior1st, Exterior2nd, RoofStyle, RoofMatl, Foundation, MasVnrType, MasVnrArea), comfort and utilities (Heating, HeatingQC, CentralAir, Electrical, Utilities), outdoor amenities (WoodDeckSF, OpenPorchSF, EnclosedPorch, ScreenPorch, PoolArea), and finally the sale context (MoSold, YrSold, SaleType, SaleCondition).

The dataset is famous for the subtlety of its missing values. A missing BsmtQual does not mean an unknown quality; it means the house has no basement. The same logic applies to GarageType, PoolQC, Fence, MiscFeature, FireplaceQu and several other columns. Imputing them with the mode would silently destroy information.

Recommended approach

The pipeline must address, in order: the target transformation, the separation of column types, the handling of missing values with their domain-specific meaning, the encoding of ordinal versus nominal categoricals, and finally the choice of model.

The target SalePrice is right-skewed and spans roughly an order of magnitude. A log transformation stabilises the variance and brings the distribution closer to a normal one. Either apply np.log1p(y) and predict the log price (then exponentiate at the end with np.expm1), or wrap the regressor in TransformedTargetRegressor(func=np.log1p, inverse_func=np.expm1). The Kaggle metric is precisely the RMSE on log(SalePrice).

Missing values must be handled column by column. For columns whose absence is structural — BsmtQual, GarageType, PoolQC, Fence, MiscFeature, FireplaceQu, Alley, MasVnrType — replace NaN with the explicit string "None". For genuinely missing numerical values such as LotFrontage and MasVnrArea, use a median imputation, optionally grouped by neighbourhood for LotFrontage. For GarageYrBlt, encode the absence by the year of construction of the house, or by zero, depending on the model used.

The ordinal categoricals (OverallQual, OverallCond, BsmtQual, KitchenQual, ExterQual, HeatingQC, …) carry a natural order and should be mapped to integers respecting that order — for instance Po=1, Fa=2, TA=3, Gd=4, Ex=5. The remaining nominal categoricals (Neighborhood, MSZoning, BldgType, …) go through OneHotEncoder(handle_unknown="ignore").

from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
import numpy as np

reg = TransformedTargetRegressor(
    regressor=Ridge(alpha=10.0),
    func=np.log1p,
    inverse_func=np.expm1,
)

Three model families are worth comparing. A regularised linear regression — Ridge, Lasso, or ElasticNet — produces an excellent log-RMSE once the preprocessing is correct, and the linear coefficients are interpretable. A random forest captures non-linear interactions but tends to under-perform tuned linear models on this dataset because of the moderate sample size. A gradient boosting machine — GradientBoostingRegressor, LGBMRegressor, or XGBRegressor — typically gives the strongest single-model score, especially with mild hyperparameter tuning (n_estimators between 500 and 2000, small learning_rate, max_depth between 3 and 6).

A final winning strategy on Kaggle is to stack a regularised linear model with a boosted-tree model: the two families make different mistakes, and their average is consistently better than either alone.

Classical pitfalls

The first pitfall is to apply a generic SimpleImputer(strategy="most_frequent") on every column. Doing so collapses the structural meaning of NaN for the basement, garage, pool and fence columns and silently destroys explanatory power.

The second pitfall is to forget the log transformation of the target. Without it, the model spends its capacity fitting the most expensive houses, the residuals are heteroscedastic, and the leaderboard score is far from optimal.

The third pitfall is to mishandle MSSubClass, which is stored as an integer but is in fact a categorical building-type code. Cast it to string before encoding.

The fourth pitfall is to evaluate the model with the standard RMSE rather than the log-RMSE. Always evaluate on the same scale on which the leaderboard is computed.

The fifth pitfall is the presence of outliers. Two well-known sales of properties with GrLivArea above 4,000 square feet but extremely low prices distort linear models. Removing them from the training set is a documented and accepted practice.

Going further

Kaggle competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Original paper: De Cock, D. (2011). Ames, Iowa: Alternative to the Boston Housing Data. Journal of Statistics Education.
sklearn.compose.TransformedTargetRegressor.
sklearn.linear_model.ElasticNet, Ridge, Lasso.
sklearn.ensemble.StackingRegressor.

Project 4 — MNIST Handwritten Digits

Problem and dataset

The fourth and final project is the historical benchmark of pattern recognition: MNIST, the dataset of handwritten digits compiled by Yann LeCun, Corinna Cortes and Christopher Burges from earlier NIST scans. The version used here is the CSV conversion published on Kaggle by oddrationale. Each row of mnist.csv is a single handwritten digit, written by a different individual, centred and normalised in a $28 \times 28$ grayscale image. Pixel intensities are integers between $0$ (black) and $255$ (white).

The CSV layout exposes one column per pixel, named by spatial coordinates: 1x1, 1x2, …, 28x28, for a total of $784$ feature columns. The last column, label, holds the digit between $0$ and $9$ . The standard split provides around 60,000 training images and 10,000 test images.

This project is a multi-class classification problem with ten balanced classes. It is the natural bridge from the tabular world of chapters 1–3 to the image-and-pixel world that the deep learning chapters will explore.

Loading, reshaping and visualising

A few lines of code separate the pixels from the label, reshape an image, and display it. These are the operations to internalise before any model is fitted.

X = df.drop(columns="label").to_numpy()
y = df["label"].to_numpy()

image = X[0].reshape(28, 28)

import matplotlib.pyplot as plt
plt.imshow(image, cmap="gray")
plt.title(y[0])
plt.axis("off")

Visualising several digits at once with subplots is a simple but invaluable habit: it lets you check that the labels match the images, that no row is corrupted, and that the dataset has been loaded correctly.

fig, axes = plt.subplots(2, 5, figsize=(10, 4))
k = 0
for i in range(2):
    for j in range(5):
        axes[i, j].imshow(X[k].reshape(28, 28), cmap="gray")
        axes[i, j].set_title(y[k])
        axes[i, j].axis("off")
        k += 1
plt.tight_layout()
plt.show()

Pixel values are typically rescaled to the unit interval by dividing by $255$ . This is essential for distance-based or gradient-based methods (KNN, logistic regression, neural networks), and harmless for tree-based methods.

X = X / 255.0

Recommended approach

A reasonable progression on MNIST traverses three families of classical models, each chosen to highlight a different idea from the previous chapters.

A K-nearest-neighbours classifier with $k=3$ or $k=5$ on the raw normalised pixels reaches around $97\%$ test accuracy. It is the simplest possible baseline and a beautiful illustration of the principle that, in pixel space, similar digits are close to each other. Its drawback is its inference cost: it stores the entire training set and computes 60,000 distances per prediction.

A logistic regression in its multinomial form (LogisticRegression(multi_class="multinomial", solver="lbfgs", max_iter=1000)) reaches around $92\%$ accuracy. The model is small (one weight vector per class) and lends itself to a striking visualisation: each class's coefficient vector, reshaped to $28 \times 28$ , looks like a caricatural template of the corresponding digit.

A random forest with several hundred trees and no depth limit reaches around $97\%$ accuracy. A gradient-boosted ensemble (LGBMClassifier, XGBClassifier) reaches similar territory at higher cost.

The next conceptual step — convolutional neural networks reaching $99.5\%$ and beyond — belongs to the deep learning chapters of this manual.

A final classical idea worth exploring on MNIST is dimensionality reduction before classification. Applying PCA(n_components=50) to the 784-pixel input retains roughly $95\%$ of the variance and divides the training time of KNN or logistic regression by an order of magnitude, with negligible accuracy loss. This is a direct application of chapter 4.

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("pca", PCA(n_components=50)),
    ("knn", KNeighborsClassifier(n_neighbors=3)),
])

Classical pitfalls

The first pitfall is not normalising the pixels when using KNN or logistic regression. Distances and gradients computed on raw $0$ – $255$ values are dominated by saturated pixels and the score collapses.

The second pitfall is to evaluate accuracy on the training set and read it as a generalisation score. MNIST is small and clean enough that almost any model achieves close to $100\%$ training accuracy. Always hold out a test set or use cross-validation.

The third pitfall is to examine only the global accuracy. The confusion matrix on MNIST is far more informative: digits $4$ and $9$ are confused, $3$ and $5$ are confused, $7$ and $1$ are confused. These confusions are systematic and tell you much more about the model than a single number.

The fourth pitfall is inference cost. A KNN with $k=3$ on 60,000 training images is too slow for production. Reduce dimensionality with PCA, or use a BallTree index, or move to a parametric model such as logistic regression or a neural network.

Going further

Kaggle dataset: https://www.kaggle.com/datasets/oddrationale/mnist-in-csv
The original MNIST page: http://yann.lecun.com/exdb/mnist/
sklearn.neighbors.KNeighborsClassifier, sklearn.linear_model.LogisticRegression.
sklearn.decomposition.PCA.
The companion Kaggle competition Digit Recognizer — same data, leaderboard format: https://www.kaggle.com/c/digit-recognizer.

Closing the chapter

The four projects above span the methodological space of supervised machine learning. Mercedes-Benz drills high-dimensional sparse regression and the importance of regularisation. Stroke Prediction drills imbalanced classification and the inadequacy of accuracy as a sole metric. Ames Housing drills domain-aware preprocessing, target transformation and the discipline of model stacking. MNIST drills image-shaped tabular data, dimensionality reduction, and sets the stage for the deep-learning chapters to come.

Treat each project as a self-contained brief. Build the pipeline yourself, with an explicit train/test split and a Pipeline object that bundles preprocessing and model. Keep your random seed fixed. Cross-validate before you tune. Read the residuals or the confusion matrix before you celebrate the score. The number on the leaderboard is the by-product of a clean protocol, not the goal of the exercise.