Machine Learning 4 — Data preparation
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
In real life, datasets are rarely clean. Missing values, categorical variables, imbalanced classes, target leakage... This chapter covers everything needed to turn raw data into something a model can digest.
Why this chapter?
You'll learn:
- to handle missing values (drop vs impute);
- to encode categorical variables (One-Hot, ordinal, native CatBoost);
- to use
ColumnTransformer+Pipeline: the industrial pattern; - to recognise and avoid target leakage;
- to handle imbalanced classes (SMOTE,
class_weight, PR curve); - to tune hyperparameters with
GridSearchCV.
Missing values
First reflex: locate the NaNs.
df.isna().sum() # NaN count per column
df.isna().any(axis=1).sum() # rows with at least one NaN
Two main strategies:
1. Drop incomplete rows (dropna). Simple, but you lose data and may introduce bias if missing values are not random.
2. Impute (replace with a plausible value). For numeric: median (more outlier-robust than the mean). For categorical: mode (most frequent value).
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
Encoding categorical variables
ML algorithms expect numbers. A column species = "Adelie" or island = "Biscoe" must be converted.
One-Hot Encoding
For a variable with modalities, create binary columns. pd.get_dummies is the easiest:
df_encoded = pd.get_dummies(df, columns=['island'])
But in production, prefer scikit-learn's OneHotEncoder with handle_unknown='ignore': if an unseen category appears at test time, the model doesn't crash.
:::warning No integer encoding without thought
Encoding modalities as 1, 2, 3 suggests an order or distance between categories. That's wrong for islands, colours, etc. Always use One-Hot for nominal variables.
:::
ColumnTransformer + Pipeline
The industrial pattern to handle numerical and categorical features in the same Pipeline:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
num_features = X.select_dtypes(include='number').columns.tolist()
cat_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
num_pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
])
cat_pipe = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
('num', num_pipe, num_features),
('cat', cat_pipe, cat_features),
])
model = Pipeline([
('prep', preprocessor),
('classifier', RandomForestClassifier()),
])
Benefits: no data leakage, handles unseen categories, clean code, works with cross_val_score and GridSearchCV.
Target leakage: the trap
Imagine a Student Performance dataset that contains G1 and G2 (term grades) as features, and G3 (final grade) as target.
First reflex: throw all columns at the model. R² obtained: 0.95.
Too good to be true. G1 and G2 are almost the target — knowing the first two terms trivially predicts the final grade.
This is target leakage: we included a variable that wouldn't actually be available at prediction time. Without G1 and G2, R² drops to 0.2-0.3 — the real performance.
:::warning The reflex to develop For each variable, ask yourself: "Would this variable be available when I want to predict for real?" If not, drop it. :::
Imbalanced classes
On a credit card fraud dataset, only 0.17% of transactions are fraudulent. A model that always says "not fraud" has 99.83% accuracy and is useless.
Three strategies
1. Undersampling: reduce the majority class.
from imblearn.under_sampling import RandomUnderSampler
sampler = RandomUnderSampler()
X_res, y_res = sampler.fit_resample(X, y)
2. Oversampling: SMOTE generates synthetic points by interpolation between neighbours.
from imblearn.over_sampling import SMOTE
sampler = SMOTE()
X_res, y_res = sampler.fit_resample(X, y)
3. Weighting: tell the model to penalise errors on the minority class more.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(class_weight='balanced')
For XGBoost, it's scale_pos_weight = n_negative / n_positive.
Adapted metrics
Accuracy is useless on imbalanced classes. Prefer:
- Precision/Recall/F1 per class (cf.
classification_report) - PR curve rather than ROC (ROC is over-optimistic when FP is diluted)
- Explicit confusion matrix
Stratify in train_test_split
To keep the class proportions in train and test:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
GridSearchCV: tuning hyperparameters
Instead of setting max_depth=5 by guesswork, GridSearchCV systematically tries all combinations from a grid and returns the best by cross-validation:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5],
}
grid = GridSearchCV(
RandomForestClassifier(class_weight='balanced'),
param_grid,
scoring='f1', # F1 instead of accuracy on imbalanced classes
cv=5,
n_jobs=-1,
)
grid.fit(X_train, y_train)
print(grid.best_params_)
For very large grids, RandomizedSearchCV samples instead of enumerating.
CatBoost: native categorical boosting
When the dataset is dominated by categorical variables (Mushrooms, Adult Census), CatBoost handles them natively, no manual One-Hot encoding needed.
from catboost import CatBoostClassifier
model = CatBoostClassifier(iterations=500, learning_rate=0.05, depth=6, verbose=False)
model.fit(X_train, y_train, cat_features=cat_indices)
Instead of One-Hot, CatBoost uses regularised target statistics per category, computed on random orderings to prevent information leakage.