Aller au contenu principal

Machine Learning 4 — Data preparation

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

In real life, datasets are rarely clean. Missing values, categorical variables, imbalanced classes, target leakage... This chapter covers everything needed to turn raw data into something a model can digest.

Why this chapter?

You'll learn:

  • to handle missing values (drop vs impute);
  • to encode categorical variables (One-Hot, ordinal, native CatBoost);
  • to use ColumnTransformer + Pipeline: the industrial pattern;
  • to recognise and avoid target leakage;
  • to handle imbalanced classes (SMOTE, class_weight, PR curve);
  • to tune hyperparameters with GridSearchCV.

Missing values

First reflex: locate the NaNs.

df.isna().sum() # NaN count per column
df.isna().any(axis=1).sum() # rows with at least one NaN

Two main strategies:

1. Drop incomplete rows (dropna). Simple, but you lose data and may introduce bias if missing values are not random.

2. Impute (replace with a plausible value). For numeric: median (more outlier-robust than the mean). For categorical: mode (most frequent value).

from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')

Encoding categorical variables

ML algorithms expect numbers. A column species = "Adelie" or island = "Biscoe" must be converted.

One-Hot Encoding

For a variable with kk modalities, create kk binary columns. pd.get_dummies is the easiest:

df_encoded = pd.get_dummies(df, columns=['island'])

But in production, prefer scikit-learn's OneHotEncoder with handle_unknown='ignore': if an unseen category appears at test time, the model doesn't crash.

:::warning No integer encoding without thought Encoding modalities as 1, 2, 3 suggests an order or distance between categories. That's wrong for islands, colours, etc. Always use One-Hot for nominal variables. :::

ColumnTransformer + Pipeline

The industrial pattern to handle numerical and categorical features in the same Pipeline:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_features = X.select_dtypes(include='number').columns.tolist()
cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
cat_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer([
('num', num_pipe, num_features),
('cat', cat_pipe, cat_features),
])

model = Pipeline([
('prep', preprocessor),
('classifier', RandomForestClassifier()),
])

Benefits: no data leakage, handles unseen categories, clean code, works with cross_val_score and GridSearchCV.

Target leakage: the trap

Imagine a Student Performance dataset that contains G1 and G2 (term grades) as features, and G3 (final grade) as target.

First reflex: throw all columns at the model. R² obtained: 0.95.

Too good to be true. G1 and G2 are almost the target — knowing the first two terms trivially predicts the final grade.

This is target leakage: we included a variable that wouldn't actually be available at prediction time. Without G1 and G2, R² drops to 0.2-0.3 — the real performance.

:::warning The reflex to develop For each variable, ask yourself: "Would this variable be available when I want to predict for real?" If not, drop it. :::

Imbalanced classes

On a credit card fraud dataset, only 0.17% of transactions are fraudulent. A model that always says "not fraud" has 99.83% accuracy and is useless.

Three strategies

1. Undersampling: reduce the majority class.

from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler()
X_res, y_res = sampler.fit_resample(X, y)

2. Oversampling: SMOTE generates synthetic points by interpolation between neighbours.

from imblearn.over_sampling import SMOTE
sampler = SMOTE()
X_res, y_res = sampler.fit_resample(X, y)

3. Weighting: tell the model to penalise errors on the minority class more.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced')

For XGBoost, it's scale_pos_weight = n_negative / n_positive.

Adapted metrics

Accuracy is useless on imbalanced classes. Prefer:

  • Precision/Recall/F1 per class (cf. classification_report)
  • PR curve rather than ROC (ROC is over-optimistic when FP is diluted)
  • Explicit confusion matrix

Stratify in train_test_split

To keep the class proportions in train and test:

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)

GridSearchCV: tuning hyperparameters

Instead of setting max_depth=5 by guesswork, GridSearchCV systematically tries all combinations from a grid and returns the best by cross-validation:

from sklearn.model_selection import GridSearchCV

param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5],
}

grid = GridSearchCV(
RandomForestClassifier(class_weight='balanced'),
param_grid,
scoring='f1', # F1 instead of accuracy on imbalanced classes
cv=5,
n_jobs=-1,
)
grid.fit(X_train, y_train)
print(grid.best_params_)

For very large grids, RandomizedSearchCV samples instead of enumerating.

CatBoost: native categorical boosting

When the dataset is dominated by categorical variables (Mushrooms, Adult Census), CatBoost handles them natively, no manual One-Hot encoding needed.

from catboost import CatBoostClassifier

model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6, verbose=False)
model.fit(X_train, y_train, cat_features=cat_indices)

Instead of One-Hot, CatBoost uses regularised target statistics per category, computed on random orderings to prevent information leakage.


Full notebook on Kaggle (forkable) →