Aller au contenu principal

Machine Learning 5 — Synthesis exercises

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

Four synthesis exercises to put everything into practice. Each dataset stresses a specific skill from earlier chapters. The notebook provides reference solutions: solid baselines, not Kaggle-optimal solutions, but clean skeletons you can refine.

Mercedes-Benz Greener Manufacturing

Type: high-dimensional regression.

The dataset has ~370 mixed features (categorical encoded as letters + binary) to predict y, a performance score from a test bench.

What's expected

  1. Load the dataset, check the target and column types.
  2. Build a ColumnTransformer: OneHotEncoder for categoricals, VarianceThreshold to drop near-constant binaries.
  3. Train a Ridge or GradientBoostingRegressor in a Pipeline.
  4. Evaluate with MAE / RMSE / R² on a test split.

Why this exercise?

It's the archetype of a high-dimensional dataset where preprocessing takes longer than the model. Variable selection (via VarianceThreshold or Lasso) plays a key role.

Stroke Prediction

Type: imbalanced classification + missing values.

Predict stroke occurrence from demographic, medical and lifestyle factors. ~5,000 observations, only ~5% positive.

What's expected

  1. Load, drop id. Check the imbalance.
  2. Handle bmi (real NaNs) and smoking_status='Unknown' (disguised NaN).
  3. Pipeline with SimpleImputer + OneHotEncoder + StandardScaler.
  4. Model with class_weight='balanced', or imblearn pipeline with SMOTE.
  5. Evaluate with confusion matrix, precision/recall/F1, and PR curve.

Why this exercise?

The archetype of strong imbalance where accuracy is meaningless. An FN (missed stroke) costs much more than an FP (false alarm). The choice of threshold and validation metric becomes critical.

House Prices — Ames

Type: complex regression with mixed numerical/categorical features.

Predict SalePrice from ~80 characteristics of houses in Ames, Iowa. Many structural missing values (NaN means "no garage", "no basement", etc.).

What's expected

  1. Load, drop Id.
  2. Pipeline: SimpleImputer median (num) + 'missing' (cat), One-Hot, scale.
  3. Model: GradientBoostingRegressor or XGBoost.
  4. Evaluate using RMSE on log(SalePrice) — the Kaggle competition metric, which penalises relative errors better.

Going further

A few feature engineering ideas that significantly improve the score:

  • TotalSF = TotalBsmtSF + 1stFlrSF + 2ndFlrSF: total surface is the most predictive variable.
  • Age = YrSold - YearBuilt, RemodAge = YrSold - YearRemodAdd.
  • HasGarage = GarageArea > 0, HasPool = PoolArea > 0.

MNIST

Type: image classification, gateway to vision.

10,000 28×28 images of handwritten digits, to classify into 10 classes.

What's expected

  1. Load a subsample (10,000 rows is enough for k-NN or RF).
  2. Visualise a few digits with imshow.
  3. Normalise pixels to [0, 1].
  4. Train a RandomForestClassifier or KNeighborsClassifier.
  5. Evaluate: accuracy + 10×10 confusion matrix (very informative).

Why this exercise?

To show that "tabular" ML models work surprisingly well on simple images (~95% accuracy with an RF on flat pixels). But also to motivate the move to deep learning and CNNs in the next course: a properly tuned CNN climbs above 99% on MNIST without much effort.

The confusion matrix reveals the classic confusions: 4↔9, 3↔5, 7↔1. Visually similar digits.


Full notebook on Kaggle (forkable) →