Machine Learning 5 — Synthesis exercises
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
Four synthesis exercises to put everything into practice. Each dataset stresses a specific skill from earlier chapters. The notebook provides reference solutions: solid baselines, not Kaggle-optimal solutions, but clean skeletons you can refine.
Mercedes-Benz Greener Manufacturing
Type: high-dimensional regression.
The dataset has ~370 mixed features (categorical encoded as letters + binary) to predict y, a performance score from a test bench.
What's expected
- Load the dataset, check the target and column types.
- Build a
ColumnTransformer:OneHotEncoderfor categoricals,VarianceThresholdto drop near-constant binaries. - Train a Ridge or GradientBoostingRegressor in a
Pipeline. - Evaluate with MAE / RMSE / R² on a test split.
Why this exercise?
It's the archetype of a high-dimensional dataset where preprocessing takes longer than the model. Variable selection (via VarianceThreshold or Lasso) plays a key role.
Stroke Prediction
Type: imbalanced classification + missing values.
Predict stroke occurrence from demographic, medical and lifestyle factors. ~5,000 observations, only ~5% positive.
What's expected
- Load, drop
id. Check the imbalance. - Handle
bmi(real NaNs) andsmoking_status='Unknown'(disguised NaN). - Pipeline with
SimpleImputer+OneHotEncoder+StandardScaler. - Model with
class_weight='balanced', orimblearnpipeline with SMOTE. - Evaluate with confusion matrix, precision/recall/F1, and PR curve.
Why this exercise?
The archetype of strong imbalance where accuracy is meaningless. An FN (missed stroke) costs much more than an FP (false alarm). The choice of threshold and validation metric becomes critical.
House Prices — Ames
Type: complex regression with mixed numerical/categorical features.
Predict SalePrice from ~80 characteristics of houses in Ames, Iowa. Many structural missing values (NaN means "no garage", "no basement", etc.).
What's expected
- Load, drop
Id. - Pipeline:
SimpleImputermedian (num) +'missing'(cat), One-Hot, scale. - Model:
GradientBoostingRegressoror XGBoost. - Evaluate using RMSE on log(SalePrice) — the Kaggle competition metric, which penalises relative errors better.
Going further
A few feature engineering ideas that significantly improve the score:
TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF: total surface is the most predictive variable.Age = YrSold - YearBuilt,RemodAge = YrSold - YearRemodAdd.HasGarage = GarageArea > 0,HasPool = PoolArea > 0.
MNIST
Type: image classification, gateway to vision.
10,000 28×28 images of handwritten digits, to classify into 10 classes.
What's expected
- Load a subsample (10,000 rows is enough for k-NN or RF).
- Visualise a few digits with
imshow. - Normalise pixels to
[0, 1]. - Train a
RandomForestClassifierorKNeighborsClassifier. - Evaluate: accuracy + 10×10 confusion matrix (very informative).
Why this exercise?
To show that "tabular" ML models work surprisingly well on simple images (~95% accuracy with an RF on flat pixels). But also to motivate the move to deep learning and CNNs in the next course: a properly tuned CNN climbs above 99% on MNIST without much effort.
The confusion matrix reveals the classic confusions: 4↔9, 3↔5, 7↔1. Visually similar digits.