ML · Chapter 4

Machine Learning 4 — Data Preparation

The previous three chapters built models on datasets that had been carefully curated for pedagogical purposes: complete numeric tables, balanced classes, and target variables ready to be predicted. Real datasets rarely arrive in such an obliging form. They contain missing values, mix numeric and categorical columns of disparate scales, and frequently exhibit class imbalance where the event of interest accounts for a tiny fraction of the rows. Any practical machine-learning project devotes the bulk of its effort not to choosing exotic algorithms, but to turning a raw table into a clean, properly encoded, well-balanced dataset on which a standard model can shine.

This chapter walks through the canonical preprocessing pipeline using a sequence of richer datasets than those of the introductory chapters. The Palmer Penguins dataset opens the discussion of missing values and basic categorical encoding. Titanic, with its mix of variable types and well-known missing-age problem, lets us bring those tools together. Mushrooms pushes the conversation towards purely categorical data and motivates CatBoost. The Student Performance and Adult Census datasets exemplify mixed numeric–categorical tables of moderate dimension. Finally, Credit Card Fraud and Telecom Churn confront us with the central difficulty of imbalanced classes and the techniques developed to address it: stratified splits, undersampling, oversampling, SMOTE, class weighting and threshold tuning.

Visualising classes before modelling

Before any preprocessing decision, it is good practice to look at the data, paying particular attention to how well the classes separate along each variable. The Penguins dataset is ideal for this exercise because its three species (Adelie, Chinstrap, Gentoo) admit clean morphological signatures.

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(
    data=df,
    x="bill_length_mm",
    hue="species",
    bins=30,
    kde=True,
    element="step",
)
plt.title("Distribution of bill length by species")
plt.show()

Coloured histograms, boxplots and violin plots all answer the same question — do the classes separate along this variable? — with different visual contracts. The histogram emphasises overlap; the boxplot emphasises medians and outliers; the violin plot reveals the full shape of each distribution. A scatter plot of two variables coloured by class provides a 2D view of the joint separation, while sns.pairplot(df, hue="species") shows every pair at once.

sns.pairplot(
    df,
    vars=["bill_length_mm", "bill_depth_mm",
          "flipper_length_mm", "body_mass_g"],
    hue="species",
    diag_kind="kde",
)

For three variables, a 3D scatter plot with Plotly is often more revealing than three separate 2D plots. Unlike Matplotlib, the resulting figure is interactive — the reader can rotate the cloud and visually identify the planes that would separate the species.

import plotly.express as px

fig = px.scatter_3d(
    df,
    x="bill_length_mm", y="bill_depth_mm", z="flipper_length_mm",
    color="species", opacity=0.8,
)
fig.update_traces(marker=dict(size=5))
fig.show()

Spending ten minutes on these visualisations before fitting any model is rarely wasted: they reveal whether a problem is "easy" (well-separated clusters), "hard" (heavy overlap), or pathological (a single dominant class), and they often suggest which variables deserve the closest preprocessing care.

Handling missing values

Real datasets are full of holes. The Penguins dataset deliberately preserves a handful of incomplete rows so we can practise on them. The first task is always to identify the missing values before deciding what to do about them.

df.isna().sum()              # per-column count of NaN
df.isna().any(axis=1).sum()  # number of incomplete rows

The isna() method returns a Boolean DataFrame in which True marks each missing cell. Summing it column-wise reveals which variables are affected and by how much. In Penguins, the morphological columns (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) and sex carry a few NaN values — typically because the field measurement could not be recorded.

Three families of strategies are available.

Deletion. When the proportion of missing values is small and the dataset is large enough to absorb the loss, the simplest option is to drop the offending rows:

df = df.dropna()

For Penguins, this strategy removes a handful of rows out of ~344, which is acceptable. On larger datasets like Titanic, where roughly 20% of the Age values are missing, dropping rows would discard a significant fraction of the signal and is rarely a good idea.

Constant imputation. Replacing missing values by a fixed scalar — 0, the column mean, the median, or a sentinel like -1 — is fast and interpretable. The median is robust to outliers and is the default choice for most numeric columns:

df["Age"] = df["Age"].fillna(df["Age"].median())

For categorical columns, the mode (most frequent modality) plays the same role.

Model-based imputation. When the relationships between variables carry information about the missing entries, a small predictive model can fill the holes more intelligently. The k-nearest-neighbours imputer is the classical example:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_num_imputed = imputer.fit_transform(df_numeric)

For each row with missing values, KNNImputer finds the five closest complete rows in the available variables, and fills the gap with their average. It is more accurate than a global mean when the variables are correlated, but more costly and only applicable to numeric columns.

The choice of imputation strategy matters more than it seems. Imputing with the global mean shrinks variance and biases downstream correlation estimates; imputing the median preserves the median but distorts higher moments. In a serious project, you should fit the imputer only on the training set to avoid leakage from the test set into the imputed values.

Converting non-numeric variables

Most classical machine-learning algorithms (linear models, k-NN, neural networks, scikit-learn's tree-based models) operate on numeric arrays. Categorical columns must therefore be encoded before training. The dtypes attribute of a DataFrame quickly reveals which columns are textual:

df.dtypes
cat_features = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_features = [c for c in df.columns if c not in cat_features]

Manual mapping with `map`

When a variable takes a small number of well-identified modalities and you want explicit control over the encoding — particularly when the values carry an order — a dictionary passed to Series.map is the clearest approach:

df["sex_num"] = df["sex"].map({"Male": 0, "Female": 1})
df["species_num"] = df["species"].map({"Adelie": 0, "Chinstrap": 1, "Gentoo": 2})

This produces a single integer column per categorical variable. It is appropriate for binary variables and for target variables in classification problems (where the order is meaningless and arbitrary integers do not mislead the model). It is not appropriate for non-ordinal explanatory variables with more than two modalities, because the resulting integers would suggest a false ordering — the model would assume that Gentoo is "twice" Chinstrap, which is nonsense.

One-hot encoding with `get_dummies`

For non-ordinal categorical variables with more than two modalities, the standard solution is one-hot encoding: create one binary column per modality.

X = pd.get_dummies(X, columns=["island", "sex"])

A column island with three modalities Biscoe, Dream, Torgersen becomes three columns island_Biscoe, island_Dream, island_Torgersen, each containing zeros and ones. The model now sees three independent indicators, with no spurious order. The price to pay is the explosion of the column count: a variable with $k$ modalities produces $k$ new columns.

For datasets dominated by categorical variables, the dimensionality cost can become prohibitive. The Mushrooms dataset, for example, contains 22 categorical columns. Inspecting their cardinality is a useful preliminary step:

for col in df.columns:
    print(f"{col:25s}: {df[col].nunique()} distinct values")

In Mushrooms, most columns carry between 2 and 12 modalities, but veil-type is constant (a single value across all rows). Such constant columns carry no information and should simply be dropped before encoding — otherwise they create a redundant column of ones.

A practical recipe: split your columns into cat_features and num_features using select_dtypes, drop constant or near-constant columns, encode the rest, and concatenate. For very high-cardinality categories (say, more than 50 modalities), one-hot encoding becomes impractical and target encoding or CatBoost are usually preferable.

Tree-based methods and multiclass classification

Decision trees and their ensembles handle multiclass classification natively, with no need to decompose the problem into one-versus-rest or one-versus-one binary problems. At each internal node the algorithm searches for the split that most increases class purity (measured by Gini or entropy), regardless of how many classes are involved. Leaves contain a class distribution, and the predicted class is simply the majority of the leaf reached by the observation.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)

forest = RandomForestClassifier(n_estimators=200, max_depth=5)
forest.fit(X_train, y_train)

On a properly preprocessed Penguins or Titanic dataset, a random forest trained with sensible defaults already reaches very high accuracy. The art that follows lies in making the preprocessing as transparent and reusable as possible — which leads us naturally to pipelines and column transformers.

Mushrooms and CatBoost

The Mushrooms dataset is striking in two ways: every explanatory variable is categorical, and the classification problem is almost trivial — most modern algorithms reach 100% accuracy. After dropping the constant veil-type column and one-hot encoding the rest, a random forest with shallow trees solves the problem cleanly:

df = pd.read_csv(".../mushrooms.csv")
X = df.drop(columns=["class", "veil-type"])
y = df["class"].to_numpy()
X = pd.get_dummies(X, columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=200, max_depth=5)
model.fit(X_train, y_train)

Even when the test accuracy is perfect, you should hesitate before claiming you have "solved" the problem. The dataset was built from a botanical key — the model is essentially memorising the very rules that defined the labels. Real foraging is far less forgiving, and a model that is slightly wrong about a deadly Amanita is not a model anyone should trust.

CatBoost offers an interesting alternative for tables dominated by categorical variables. Where get_dummies mechanically explodes the dimensionality, CatBoost handles categorical columns internally using ordered target statistics and a clever variant of target encoding that mitigates leakage. The resulting model is often more compact and more accurate, and it accepts string columns directly:

from catboost import CatBoostClassifier

X = df.drop(columns=["class"])
y = df["class"].to_numpy()
cat_features = X.columns.tolist()  # everything is categorical

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = CatBoostClassifier(verbose=False)
model.fit(X_train, y_train, cat_features=cat_features)

The cat_features argument tells CatBoost which columns to treat as categorical. The model can then be applied to the test set without any manual encoding step, which is both convenient and less error-prone — a single misalignment between the columns of pd.get_dummies on the train and test sets is a classic source of bugs.

Mixed numeric–categorical tables: the Student dataset

The Student Performance dataset records demographic, family, academic and behavioural information about Portuguese high-school students, with the target variable G3 (final-year grade out of 20). It is the prototypical mixed table: a handful of numeric columns (age, studytime, absences, …) coexist with many categorical columns (school, sex, address, Mjob, Fjob, internet, …).

select_dtypes is the cleanest way to separate the two:

cat_features = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
num_features = [c for c in X.columns if c not in cat_features]

For a CatBoost regressor, this single split is all the preprocessing we need:

from catboost import CatBoostRegressor

model = CatBoostRegressor(verbose=False)
model.fit(X_train, y_train, cat_features=cat_features)

For an XGBoost regressor, which does not handle categorical columns natively, we must one-hot encode beforehand:

X = pd.get_dummies(X, columns=cat_features)

from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)

A direct comparison between the two on the Student dataset is instructive: CatBoost typically reaches a slightly lower mean absolute error than XGBoost, with markedly less code. The lesson is not that one library beats the other — XGBoost is excellent — but that matching the algorithm to the structure of the data saves both effort and accuracy.

The Student dataset contains two numeric columns, G1 and G2, that are extremely correlated with the target G3 (intermediate grades during the year). Leaving them in the input gives a misleadingly low error: the model is essentially copying G2. A more honest evaluation drops them. This is a lightweight version of the classical target leakage trap.

Correlations and exploratory analysis

Once the numeric block of a mixed dataset has been identified, a heatmap of correlations is a quick way to inspect redundancy and target dependence.

corr = df[num_features].corr()
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.show()

The corr() method returns a square matrix with values in $[-1, +1]$ . The colour map coolwarm, centred at zero, makes positive correlations red and negative correlations blue, with white for independence. For ranking redundancy, plotting corr.abs() is more readable: only the strength of the relationship matters.

When two variables are nearly perfectly correlated, one of them carries no additional information. sns.clustermap(corr.abs()) reorders rows and columns so that correlated variables sit next to each other, revealing blocks of mutually informative features that you may want to merge or prune.

Three correlation coefficients are commonly used. Pearson measures linear relationships between continuous variables and is sensitive to outliers. Spearman measures monotonic (not necessarily linear) relationships by working on ranks, which makes it robust to outliers and applicable to ordinal variables. Kendall's tau also measures monotonic association via concordant and discordant pairs and is preferred on small samples or with many ties. The default corr() in pandas uses Pearson; pass method="spearman" or method="kendall" to switch.

Imbalanced classes

So far we have implicitly assumed that the classes are roughly balanced. The Credit Card Fraud Detection dataset shatters this assumption: out of 284 807 transactions, only about 492 are fraudulent — roughly 0.17% of the data. In this regime, accuracy stops being a useful metric.

A model that systematically predicts "no fraud" is right 99.83% of the time and detects exactly zero frauds. Its accuracy looks excellent, but its recall on the positive class is zero and the model is useless. Whenever classes are imbalanced, the right metrics to monitor are precision, recall and F1-score on the minority class, supplemented by the confusion matrix and the ROC AUC.

A first attempt with a default decision tree on the raw fraud data already exposes the problem:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y,
)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_hat = model.predict(X_test)

print(classification_report(y_test, y_hat))
print(confusion_matrix(y_test, y_hat))

Note the stratify=y argument: it forces train_test_split to preserve the class proportions in both subsets. Without it, on a dataset with 0.17% positives, a 20% test set might end up with only a handful of frauds — possibly none — making evaluation unstable or meaningless. stratify should be a default reflex on imbalanced classification.

Resampling techniques

The standard library to deal with class imbalance in scikit-learn workflows is imbalanced-learn (imblearn), which exposes resamplers compatible with the scikit-learn API and provides its own Pipeline (necessary because resamplers act on y in addition to X, which the regular sklearn pipeline forbids).

!pip install -U imbalanced-learn

Undersampling

Undersampling reduces the majority class by random subsampling so that the two classes appear in equal numbers during training:

from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

model = Pipeline([
    ("rus", RandomUnderSampler()),
    ("dtc", DecisionTreeClassifier()),
])
model.fit(X_train, y_train)

The advantage is simplicity and speed: training is fast because the resampled training set is small. The disadvantage is that we throw away most of the majority-class data — and with it potentially useful information. Undersampling is typically a good first move on very large imbalanced datasets, where the majority class still has plenty of representatives after subsampling.

Oversampling and SMOTE

Oversampling takes the opposite approach: it augments the minority class until the classes are balanced. The naive form duplicates minority observations, which prevents the loss of majority data but tends to overfit (the same fraud is seen many times during training). SMOTE (Synthetic Minority Oversampling Technique) is the standard remedy. Instead of duplicating, SMOTE generates synthetic minority points by interpolating between an existing minority observation and one of its nearest minority neighbours.

from imblearn.over_sampling import SMOTE

model = Pipeline([
    ("smote", SMOTE()),
    ("dtc", DecisionTreeClassifier()),
])
model.fit(X_train, y_train)

The new points lie inside the convex hull of the minority class in feature space, which usually yields a more diverse training set than mere duplication. SMOTE is implemented in imblearn.over_sampling.SMOTE and integrates seamlessly into the imblearn pipeline.

Resampling must be applied inside the cross-validation fold, not before. Resampling the entire training set and then splitting it would let synthetic minority points generated from a particular observation leak into the validation fold containing that observation, inflating the score. The imblearn pipeline handles this correctly by re-running the resampler on each training fold.

Class weighting

An alternative to resampling is to leave the data alone and instead reweight the loss function so that errors on the minority class are penalised more heavily. Most scikit-learn classifiers expose a class_weight parameter for this purpose:

DecisionTreeClassifier(class_weight="balanced")
RandomForestClassifier(class_weight="balanced")

With "balanced", the weight assigned to each class is inversely proportional to its frequency, so that the total weight of each class is equal. The construction of the tree (Gini or entropy computations) is modified accordingly, and the resulting splits favour the minority class. CatBoost provides the equivalent option auto_class_weights="Balanced".

The advantage of class weighting is that no synthetic data is ever introduced, which makes the procedure simpler and more reproducible. The disadvantage is that the optimisation problem may converge less cleanly, especially on linear models with strong imbalance.

Threshold tuning

A complementary lever, often underused, is the decision threshold applied to predicted probabilities. By default, scikit-learn classifiers predict the positive class when predict_proba(...)[:, 1] >= 0.5. On an imbalanced problem, this default is rarely optimal: a model trained on imbalanced data tends to output probabilities concentrated near zero, and lowering the threshold can trade some precision for a substantial gain in recall.

y_proba = model.predict_proba(X_test)[:, 1]
threshold = 0.1
y_hat = (y_proba > threshold).astype(int)
print(classification_report(y_test, y_hat))

In a fraud-detection setting, missing a fraud (false negative) is far more costly than incorrectly flagging a legitimate transaction (false positive), so lowering the threshold is usually desirable. On Credit Card Fraud, an undersampled decision tree with threshold = 0.1 typically reaches a recall above 0.85 on the test set, at the price of a moderate drop in precision. Conversely, a SMOTE-trained model often produces probabilities that lean too eagerly towards the positive class, and raising the threshold (e.g. to $0.7$ ) is the appropriate corrective move.

Resampling, class weighting and threshold tuning are three independent levers. Use them in combination: a stratified split first, then a resampler or class_weight to fight the imbalance during training, and finally a threshold chosen on the validation set to optimise the metric you actually care about — F1, recall at fixed precision, or expected business cost.

The Telecom Churn dataset

Customer churn — the probability that a customer leaves a service — is a less extreme but still imbalanced problem: in the Telecom Churn dataset, roughly 14% of customers have churned. The dataset mixes numeric usage statistics (account length, total day minutes, total eve calls) and categorical fields (State, International plan, Voice mail plan, Churn). The recipe combines techniques from previous sections:

df = pd.read_csv(".../churn.csv")
X = df.drop(columns=["Churn"])
y = df["Churn"]

cat_features = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
X = pd.get_dummies(X, columns=cat_features)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y,
)

model = RandomForestClassifier(class_weight="balanced")
model.fit(X_train, y_train)

CatBoost makes this even more compact, since we can skip the get_dummies step entirely:

model = CatBoostClassifier(auto_class_weights="Balanced")
model.fit(X_train, y_train, cat_features=cat_features, verbose=False)

On Churn, class_weight="balanced" typically lifts the recall on the positive class (churners) from around 0.5 to around 0.7, with a moderate cost in overall accuracy — exactly the kind of trade-off a marketing team is happy to make.

The Adult Census dataset

The Adult Census Income dataset (also called Census Income) contains roughly 32 000 rows from the US Census, with the binary target income taking values <=50K or >50K. It mixes numeric variables (age, education.num, capital.gain, capital.loss, hours.per.week, fnlwgt) and categorical variables (workclass, education, marital.status, occupation, relationship, race, sex, native.country).

A clean baseline reuses everything we have built so far:

df = pd.read_csv(".../adult.csv")
df["income"] = df["income"].map({"<=50K": 0, ">50K": 1})

X = df.drop(columns=["income"])
y = df["income"]

cat_features = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = CatBoostClassifier(verbose=False)
model.fit(X_train, y_train, cat_features=cat_features)

Switching to XGBoost requires the now-familiar one-hot step:

X = pd.get_dummies(X, columns=cat_features)
model = XGBClassifier()
model.fit(X_train, y_train)

Both reach an accuracy in the high 80s and provide a respectable starting point for the more nuanced metrics — precision and recall on >50K, ROC AUC, calibration — that one would monitor in a real analysis of inequality predictors.

ColumnTransformer and Pipeline: the production-grade workflow

In a serious project, the ad hoc preprocessing snippets shown above should be replaced by a single, fittable, reusable object. ColumnTransformer applies different preprocessing pipelines to different subsets of columns; Pipeline chains preprocessing and modelling so that the entire workflow can be cross-validated, grid-searched, and serialised as a single artefact.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
])

cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer([
    ("num", num_pipe, num_features),
    ("cat", cat_pipe, cat_features),
])

model = Pipeline([
    ("prep", preprocess),
    ("clf",  RandomForestClassifier(n_estimators=200)),
])

model.fit(X_train, y_train)

Three properties make this design powerful. No leakage: scalers and imputers are fitted only on the training fold inside cross-validation. Reusability: the same object can transform new data with model.predict(X_new) without re-running any of the snippets above. Composability: every component is a hyperparameter to be tuned.

GridSearchCV couples to the pipeline naturally, and the step__parameter double-underscore syntax exposes inner hyperparameters for tuning:

from sklearn.model_selection import GridSearchCV

grid = {
    "clf__n_estimators": [100, 200, 400],
    "clf__max_depth":    [None, 5, 10],
}
search = GridSearchCV(model, grid, cv=5, scoring="f1")
search.fit(X_train, y_train)
print(search.best_params_, search.best_score_)

The grid is explored exhaustively across the cross-validation folds, the best combination is retained, and search.best_estimator_ is a fully refitted pipeline ready for prediction. On imbalanced problems, swap the regular Pipeline for imblearn.pipeline.Pipeline and insert a SMOTE or a RandomUnderSampler step before the classifier.

Exercises

Exercise 1 — Missing values in Penguins. Load the Penguins dataset and identify the columns that contain missing values using df.isna().sum() and df.isna().any(axis=1).sum(). Decide which strategy is most appropriate (deletion, mean/median imputation, KNN imputation), justify your choice, and apply it.

Exercise 2 — Decision tree on Penguins. After cleaning, separate species from the explanatory variables, encode the categorical variables (sex, island), and train a DecisionTreeClassifier. Evaluate the accuracy on a test set and visualise the tree with plot_tree. Comment on the rules learnt.

Exercise 3 — Random forest on Penguins. Train a RandomForestClassifier(n_estimators=200, max_depth=5) on the same data. Compare its performance with the single decision tree of Exercise 2.

Exercise 4 — Survival on the Titanic. Predict the Survived column of the Titanic dataset with a model of your choice. Pay particular attention to the Age column (missing values), the Sex and Embarked columns (encoding), and report accuracy, precision, recall and the confusion matrix.

Exercise 5 — Mushrooms with get_dummies and a random forest. Drop the veil-type column, one-hot encode the rest, and train a random forest. Achieve at least 99% accuracy. Comment on whether you would trust this model in real foraging.

Exercise 6 — Mushrooms with CatBoost. Repeat Exercise 5 using CatBoostClassifier with cat_features=X.columns.tolist(). Compare training time and accuracy with the random forest.

Exercise 7 — Student grades, CatBoost vs XGBoost. Predict G3 on the Student dataset, first with CatBoostRegressor and then with XGBRegressor (using pd.get_dummies for the latter). Compare MAE and RMSE. Repeat after dropping G1 and G2 and discuss target leakage.

Exercise 8 — Correlation heatmap. Compute the Pearson correlation matrix of the numeric columns of the Student dataset and display it with sns.heatmap and sns.clustermap (try both corr and corr.abs()). Identify the most correlated pairs and comment on potential redundancy.

Exercise 9 — Fraud detection baseline. Train a DecisionTreeClassifier on Credit Card Fraud with a stratified train/test split. Report the confusion matrix and the per-class precision and recall. Explain why the accuracy is misleading.

Exercise 10 — Resampling and threshold. Build an imblearn.pipeline.Pipeline that combines RandomUnderSampler (or SMOTE) and a DecisionTreeClassifier on Credit Card Fraud. Plot precision and recall as a function of the decision threshold and pick the threshold that maximises the F1-score on the validation set.

Exercise 11 — Class weights on Churn. On the Telecom Churn dataset, compare a RandomForestClassifier() and a RandomForestClassifier(class_weight="balanced"). Report the recall on the positive class for both. Repeat with CatBoostClassifier(auto_class_weights="Balanced").

Exercise 12 — Adult income. Predict income on the Adult Census dataset using a CatBoost classifier. Inspect the confusion matrix and the per-class metrics. Try a class_weight/auto_class_weights variant and a threshold sweep — does either improve recall on >50K?

Exercise 13 — A reusable pipeline. Build a ColumnTransformer that imputes the median for numeric columns and the most frequent value for categorical columns, then scales the numerics and one-hot encodes the categoricals. Plug it into a Pipeline with a RandomForestClassifier and tune n_estimators and max_depth with GridSearchCV. Apply the resulting pipeline to Titanic, Adult and Churn without any per-dataset glue code.

Going further

pandas missing-data guide — isna, dropna, fillna.
sklearn.impute.SimpleImputer and KNNImputer — built-in imputation strategies.
sklearn.preprocessing — StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder, OrdinalEncoder.
pandas.get_dummies — one-hot encoding from a DataFrame.
sklearn.compose.ColumnTransformer — heterogeneous preprocessing across columns.
sklearn.pipeline.Pipeline — chaining preprocessing and modelling.
sklearn.model_selection.GridSearchCV and StratifiedKFold — hyperparameter search with stratified folds.
imbalanced-learn documentation — RandomUnderSampler, RandomOverSampler, SMOTE, SMOTEENN, and the imblearn pipeline.
CatBoost documentation — native handling of categorical variables, auto_class_weights, GPU training.