ML · Chapter 3
Machine Learning 3 — Classification
After two chapters dedicated to regression, we now turn to the second great family of supervised problems: classification. Where regression seeks to predict a continuous quantity, classification seeks to assign each observation to one of a finite set of categories. The simplest and most common case is binary classification, where the target variable takes only two values, traditionally encoded as and . Predicting whether a passenger of the Titanic survived, whether a tumour is benign or malignant, or whether an email is spam are all instances of this fundamental problem.
This chapter introduces classification through a progression of models of growing sophistication. We start with the Bernoulli Naive Bayes classifier, whose probabilistic formulation makes it an ideal pedagogical entry point. We then leave the realm of regression metrics behind and develop the proper toolkit for evaluating classifiers — accuracy, the confusion matrix, precision, recall, F1-score, the ROC curve and the AUC — together with the notion of a decision threshold applied to predicted probabilities. The second half of the chapter is devoted to decision trees and the powerful ensemble methods built on top of them: random forests, gradient boosting, and the modern industrial-strength libraries LightGBM and XGBoost.
Throughout the chapter we use two small didactic datasets. The titanic_mini dataset reduces the classical Titanic problem to three fully binary explanatory variables (Sex, Children, FirstClass) and a binary target (Survived). It is well suited to illustrate the inner mechanics of probability-based and tree-based models. The cancer_mini dataset, derived from the Breast Cancer Wisconsin study, contains continuous numerical features (radius, perimeter, area, texture) and a binary diagnosis target. It allows us to visualise how decision trees handle continuous variables through learnt thresholds.
From data to empirical probabilities
Before introducing any model, it is instructive to compute survival probabilities directly from the data. This empirical exercise reveals that a binary target offers a particularly elegant property: averaging a column yields a proportion, which is itself an estimate of a probability.
In pandas, selecting subgroups of a DataFrame relies on boolean conditions combined with the bitwise operators & (AND), | (OR) and ~ (NOT). Each individual condition must be parenthesised. For example:
femmes_1ere = df[(df["Sex"] == 1) & (df["FirstClass"] == 1)] p_survie = femmes_1ere["Survived"].mean()
The expression df["Sex"] == 1 produces a Boolean Series, and the resulting subset of the DataFrame is itself a DataFrame on which we can compute statistics. Because Survived is encoded as , taking its mean over a subgroup yields an estimate of . Repeating the operation for "men not in first class" provides a sharply contrasting empirical probability, foreshadowing the predictive content of these variables.
Manipulating empirical probabilities by hand prior to fitting a model is an excellent way to anchor intuition. A model that predicts a survival probability of 0.9 for first-class women aligns with what the raw counts already suggest; a model that predicts 0.4 should be questioned.
The Naive Bayes classifier
Bayes theorem and the conditional independence assumption
The Naive Bayes classifier is rooted in Bayes' theorem. For an observation and a class :
The qualifier naive refers to a simplifying — and admittedly strong — assumption: the explanatory variables are assumed to be conditionally independent given the class. Under this assumption,
and the posterior probability factorises as
Classification is then performed by the Maximum A Posteriori (MAP) rule:
Bernoulli Naive Bayes for binary features
When all explanatory variables are binary, as in titanic_mini, the natural choice for the per-feature distribution is a Bernoulli law:
For an observation ,
The parameters are estimated from the training data. To prevent the model from assigning a zero probability to a feature value that never appeared with a particular class — an unforgiving consequence of the product form — we use Laplace smoothing:
where is the number of training examples of class for which , is the total number of training examples of class , and is the smoothing parameter (controlled by the alpha argument in scikit-learn).
Smoothing acts as a soft prior: with , every count is incremented by one, which is equivalent to assuming that we have already seen one example of every combination.
The classical default is known as Laplace smoothing, while smaller values such as correspond to Lidstone smoothing. The smaller is, the more the estimate trusts the raw counts; the larger is, the closer the estimate moves to the uniform . On a small dataset like titanic_mini, smoothing makes a tangible difference: without it, a single perfect-correlation cell in the training set would force a probability to be exactly or and therefore dominate the entire prediction.
Implementation with scikit-learn
The model is exposed as BernoulliNB in sklearn.naive_bayes:
from sklearn.naive_bayes import BernoulliNB model = BernoulliNB(alpha=1.0) model.fit(X_train, y_train) y_hat = model.predict(X_test)
A first attempt to evaluate the predictions with the regression metrics from chapter 2 — mean_absolute_error, mean_squared_error, r2_score — produces numbers, but their interpretation is awkward. The MAE on predictions is mechanically equal to the proportion of misclassified observations, the RMSE is its square root, and mean_absolute_percentage_error divides by zeros whenever the true label is . These metrics are simply not the right language for a classification task; we need a different vocabulary.
Classification metrics
Accuracy
The most immediate metric is accuracy, the proportion of correctly classified observations:
When labels are encoded as , the equivalent and revealing rewriting
shows that accuracy is exactly one minus the error rate. In code:
from sklearn.metrics import accuracy_score accuracy_score(y_test, y_hat)
Confusion matrix
Accuracy provides a single global number but says nothing about the nature of the errors. The confusion matrix disaggregates predictions by their true and predicted class:
The four cells are the true negatives, false positives, false negatives and true positives. They are the foundation on which every other metric is built.
from sklearn.metrics import confusion_matrix confusion_matrix(y_test, y_hat)
Precision, recall, F1
From the confusion matrix we derive two complementary metrics. Precision measures the reliability of positive predictions:
Among the observations the model declares positive, what fraction is actually positive? Precision matters when false positives are costly — for instance, when each positive prediction triggers an expensive intervention.
Recall (also called sensitivity) measures the model's ability to detect positives:
Among the truly positive observations, what fraction does the model successfully retrieve? Recall matters when false negatives are costly — typical in medical screening, where missing a sick patient is far worse than calling back a healthy one.
The two quantities pull in opposite directions: a model that predicts "positive" for everything achieves perfect recall but dismal precision, and vice-versa. The F1-score is their harmonic mean and rewards models that balance both:
The harmonic mean penalises imbalance: is high only if precision and recall are both high. A model with precision and recall obtains , far below the arithmetic mean.
Classification report
scikit-learn bundles all per-class metrics into a single readable summary:
from sklearn.metrics import classification_report print(classification_report(y_test, y_hat))
The report gives precision, recall and F1-score for each class, plus the support (the number of true examples of each class), and aggregate averages. It is the de-facto standard one-glance evaluation of a classifier.
Probability scores and decision thresholds
predict_proba
Most classifiers do not directly emit a class label; they emit a probability for each class, from which a label is then derived. For the Titanic problem,
In scikit-learn, the predict_proba method returns these probabilities as an array of shape :
probas = model.predict_proba(X_test) y_score = probas[:, 1] # probability of the positive class
Thresholding
The final binary prediction is obtained by comparing the score to a decision threshold :
By default scikit-learn uses , but this is an arbitrary convention. Choosing a different threshold reshapes the trade-off between detection and false alarms:
t = 0.7 y_hat = (y_score >= t).astype(int)
A high threshold produces few positive predictions, which yields few false positives but more false negatives. A low threshold produces many positive predictions, which yields many true positives but also many false alarms. The threshold is therefore not a property of the model but a decision made at deployment time, depending on the relative costs of the two error types.
ROC curve and AUC
To compare classifiers without committing to a particular threshold, we examine their behaviour across all possible thresholds. For each we compute the true positive rate (also called recall) and the false positive rate:
Plotting the points as varies from down to traces the ROC curve (Receiver Operating Characteristic). The point corresponds to a very high threshold (no positive prediction), the point to a very low threshold (everything predicted positive). The diagonal is the curve of a random classifier; the closer the curve hugs the upper-left corner, the better the model.
The AUC (Area Under the Curve) condenses the ROC into a single scalar in . A central interpretation is:
The AUC is the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example.
Conventional reading grid: random, – correct, – good, very good.
In code:
from sklearn.metrics import roc_curve, roc_auc_score y_score = model.predict_proba(X_test)[:, 1] fpr, tpr, thresholds = roc_curve(y_test, y_score) auc = roc_auc_score(y_test, y_score)
Note that the ROC curve and the AUC are computed from the probability scores, not from the binary predictions — they capture the quality of the underlying ranking, which is independent of any specific threshold.
Why a curve and not a single number
The ROC curve makes explicit a fact that accuracy hides: a classifier is in reality a family of decisions, parameterised by . Two models can have identical accuracy at yet behave radically differently when the application requires a more conservative or more liberal decision. The ROC curve answers the question "what is the achievable trade-off between TPR and FPR?" without committing to a particular operating point. The choice of operating point is then made after the model has been trained, in light of the relative costs of false positives and false negatives, and possibly informed by an explicit cost function.
Decision trees
Principle
A decision tree classifies an observation by asking a succession of simple questions about its features. At the root, the algorithm chooses the question that best separates the data into homogeneous subgroups; the procedure is then repeated recursively on each subgroup until a stopping criterion is met. Each terminal node — a leaf — stores a class prediction (the majority class of the training points that fell into it) and a probability (the proportion of the positive class in the leaf).
Gini impurity and the choice of a split
The quality of a split is measured by an impurity criterion. For a binary target with in a node, the Gini index is
It vanishes when the node is pure ( or ) and is maximal at . To choose a split, the algorithm enumerates candidate variables . For each one, the dataset is partitioned according to the values of , the Gini impurity is computed in each subgroup, and the weighted average impurity after the split is
The variable that minimises is selected as the split rule of the node. On titanic_mini, comparing the Gini after splitting on Sex, on Children and on FirstClass allows us to identify which variable should be tested at the root of the tree — typically Sex, which is by far the most informative single feature on Titanic survival.
Stopping criteria and overfitting
Recursion stops in three situations: when a node is perfectly pure (); when no candidate split reduces the impurity any further; or when a structural constraint is reached. The most useful structural constraints in scikit-learn are:
max_depth: maximum depth of the tree;min_samples_split: minimum number of observations required to split a node;min_samples_leaf: minimum number of observations in any leaf.
These hyperparameters are essential to prevent the tree from learning the training set by heart. Without them, a sufficiently deep tree memorises every observation, achieving zero training error but generalising poorly. This phenomenon is the textbook example of overfitting.
Continuous variables
Decision trees are not restricted to categorical features. For a numerical variable such as texture in cancer_mini, the algorithm sorts the observed values, considers candidate thresholds between consecutive values, and selects the one that minimises the impurity of the split. The resulting rule has the form texture <= 19.2, with a threshold learned automatically from the data. Even when the input is binary , scikit-learn treats the feature as numerical, which is why the displayed splits often read Sex <= 0.5.
Implementation with scikit-learn
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(max_depth=3) model.fit(X_train, y_train)
Two visualisations help inspect the model. The graphical view shows the tree as a diagram:
from sklearn.tree import plot_tree import matplotlib.pyplot as plt plt.figure(figsize=(14, 6)) plot_tree(model, feature_names=X.columns, class_names=["0", "1"], filled=True, rounded=True, impurity=True) plt.show()
Each node displays its split rule, Gini index, sample count, class distribution, and majority class (encoded by colour). The textual view gives the rules in hierarchical form:
from sklearn.tree import export_text rules = export_text(model, feature_names=list(X.columns)) print(rules)
Cross-validation for max_depth
Choosing max_depth is a classical bias-variance trade-off. Shallow trees underfit; deep trees overfit. To select a depth that generalises well, we evaluate each candidate value with cross-validation:
from sklearn.model_selection import cross_val_score cv_scores = [] for d in range(1, 30): model = DecisionTreeClassifier(max_depth=d) scores = cross_val_score(model, X, y, cv=10, scoring="accuracy") cv_scores.append(scores.mean()) plt.plot(cv_scores)
The resulting curve typically rises, reaches a plateau, and may decline as the tree begins to overfit. The selected depth is the smallest one at which the plateau is reached, following the principle of parsimony.
Random forests
A single decision tree has low bias but high variance: small perturbations in the training data can produce very different trees. Random forests address this weakness by aggregating a large number of trees, each trained on a slightly different version of the data and the variables. The final prediction is obtained by majority vote (for classification) or by averaging probabilities; the variance of the aggregate is much smaller than that of an individual tree, while the low bias is preserved.
Two sources of randomness are combined. First, each tree is fitted on a bootstrap sample of the dataset — a draw with replacement of the same size as the original — so that every tree sees a different mixture of observations. Second, at each node of each tree, only a random subset of the variables is considered as candidate for the split. This second mechanism prevents a single dominant feature from systematically being selected at the top of every tree, which would defeat the diversity gained from the bootstrap.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier( n_estimators=200, max_depth=5, ) model.fit(X_train, y_train)
The main hyperparameters are n_estimators (the number of trees), max_depth (the depth of each tree), max_features (the number of variables tested at each node), min_samples_leaf (the minimum size of a leaf) and bootstrap (whether to bootstrap observations). In practice, increasing n_estimators stabilises the model up to a saturation, and max_depth remains the main lever to control overfitting.
Random forests are the proverbial "first thing to try" on a tabular dataset: they are robust to noise, accept heterogeneous variables without preprocessing, and require little tuning to deliver respectable performance.
Out-of-bag evaluation
A pleasant side-effect of bootstrap sampling is that, for any given tree, roughly one third of the training observations are not used to fit it (they were not drawn in the bootstrap sample). These so-called out-of-bag observations can serve as a free validation set: each observation is predicted by the trees that did not see it during training, and the aggregated prediction is compared to the true label. The resulting OOB score, returned by RandomForestClassifier(..., oob_score=True).oob_score_, provides an estimate of generalisation performance without splitting the data.
Gradient boosting
Where a random forest builds many trees independently and in parallel, gradient boosting builds them sequentially, each tree correcting the errors of the previous ones. The intuition is the following: an initial simple model produces an imperfect prediction; we examine the residual errors; a new small tree is fitted to correct them; the global prediction is updated; the process repeats. After many rounds, the cumulative effect of the small corrections yields a very accurate model.
A central hyperparameter is the learning rate, which scales each correction. A high learning rate produces fast progress but risks overshooting; a low learning rate produces gentle corrections that combine into a smoother, more robust model — at the cost of needing more trees to reach a given level of performance. The standard recipe of gradient boosting therefore favours modest learning rates and a large number of trees.
from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier( n_estimators=200, learning_rate=0.05, max_depth=5, )
Comparing a random forest and a gradient boosting model with cross-validation on cancer_mini typically shows that both reach high accuracy, with gradient boosting often a hair ahead — though the standard deviations across folds are small enough that the comparison can vary from one run to the next.
A note on the loss function
For binary classification, gradient boosting minimises the logistic loss (also called log-loss):
where is the predicted probability of the positive class. Each new tree is fitted to approximate the negative gradient of this loss with respect to the current prediction — hence the name gradient boosting. This perspective explains why the method generalises far beyond squared error: any differentiable loss can in principle be plugged in.
LightGBM and XGBoost
Two libraries have become industry standards for gradient boosting on tabular data: LightGBM and XGBoost. Both implement the same core idea but with substantial engineering refinements that improve speed, memory footprint and predictive performance.
LightGBM (Light Gradient Boosting Machine) grows trees leaf-wise rather than level-wise: at every step it expands the leaf that yields the largest reduction of the loss, rather than completing a full level of the tree. This strategy converges faster and often performs better with fewer trees, but is more prone to overfitting if its complexity is not controlled. Its key hyperparameter is num_leaves — alongside the usual n_estimators, learning_rate, max_depth, and the subsampling parameters subsample (rows) and colsample_bytree (columns).
from lightgbm import LGBMClassifier model = LGBMClassifier( n_estimators=300, learning_rate=0.05, verbosity=-1, )
XGBoost (eXtreme Gradient Boosting) keeps the level-wise tree construction but adds explicit regularisation of the trees (parameters reg_alpha and reg_lambda for the and penalties), a more precise second-order optimisation of the corrections, and a highly tuned implementation that scales gracefully to large datasets. It also offers GPU acceleration via tree_method="gpu_hist", which dramatically reduces training time on large corpora.
from xgboost import XGBClassifier model = XGBClassifier( n_estimators=300, learning_rate=0.05, max_depth=3, )
A typical comparison, run with 10-fold cross-validation on cancer_mini, places the four ensemble methods (random forest, sklearn gradient boosting, LightGBM, XGBoost) within a narrow band of accuracy. On larger or noisier datasets the gap usually widens in favour of LightGBM or XGBoost, which is why these two libraries dominate Kaggle leaderboards on tabular problems.
When in doubt, start with a random forest as a sanity check, then move to LightGBM or XGBoost if you need that extra percentage point of accuracy. Save the bulk of your tuning effort for the most promising of the two.
Exercises
Exercise 1 — Empirical computation of survival probabilities. Using
titanic_miniand Boolean filtering inpandas, compute empirically (i) the survival probability of a woman travelling in first class, and (ii) the survival probability of a man not travelling in first class. Compare the two values and comment. Hint: sinceSurvivedis binary, its mean over a subgroup is the empirical probability you seek.
Exercise 2 — Bernoulli Naive Bayes on Titanic. Train a
BernoulliNB(alpha=1.0)model to predictSurvived. Evaluate the predictions using regression metrics (mean_absolute_error,mean_squared_error,mean_absolute_percentage_error,r2_score) and discuss whether they are meaningful in this setting. Then propose a more appropriate evaluation by computing, by hand if necessary, the proportion of correctly classified observations.
Exercise 3 — Accuracy and classification report. Reuse the Naive Bayes model of Exercise 2 and evaluate it with
accuracy_scoreandclassification_report. Read off precision, recall, F1-score and support for both classes. Comment on the asymmetry between the two classes.
Exercise 4 — TPR at threshold 0.7 and ROC curve. From
y_testandy_score = model.predict_proba(X_test)[:, 1], do the following: (1) compute fromy_testand the prediction at threshold ; (2) also compute ; (3) plot the ROC curve fromy_testandy_scoreand superimpose the point corresponding to the threshold ; (4) display the AUC. Use the formulas and together withnp.sum((y_test == 1) & (y_hat == 1))for the counts.
Exercise 5 — Gini index and the root of the tree. On
titanic_mini, (1) compute the global Gini index of the dataset; (2) compute the Gini index after splitting on each of the variablesSex,Children,FirstClass; (3) compare the values; (4) deduce the variable chosen as the root of the tree. The split onFirstClassis provided as a worked example in the chapter; reproduce the same calculation forSexandChildren.
Exercise 6 — Decision tree on Titanic. Fit a
DecisionTreeClassifier(max_depth=3)ontitanic_mini. Display the tree graphically withplot_treeand as text withexport_text. Comment on the rules learnt and confront them with the empirical probabilities of Exercise 1.
Exercise 7 — Decision tree on
cancer_mini. Fit aDecisionTreeClassifier(max_depth=3)oncancer_mini. Display the tree and the rules. Identify the threshold learnt for the most important continuous variable.
Exercise 8 — Effect of
max_depth. Oncancer_mini, train decision trees with several values ofmax_depth(for instance ). Observe how the accuracy on the test set evolves and comment on overfitting.
Exercise 9 — Cross-validation of
max_depth. Repeat Exercise 8 withcross_val_score(model, X, y, cv=10, scoring="accuracy")formax_depthranging from to . Plot the average score against the depth and identify the smallest depth at which performance saturates.
Exercise 10 — Random forest. Train a
RandomForestClassifier(n_estimators=200, max_depth=5)oncancer_miniand/ortitanic_mini. Compare its accuracy and classification report with those of a single decision tree.
Exercise 11 — Random forest vs gradient boosting. On
cancer_mini, compareRandomForestClassifier(n_estimators=200, max_depth=5)andGradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=5)using 10-fold cross-validation. Report the mean and standard deviation of the accuracy for each model and comment.
Exercise 12 — LightGBM and XGBoost. Extend Exercise 11 by also evaluating
LGBMClassifier(n_estimators=300, learning_rate=0.05, verbosity=-1)andXGBClassifier(n_estimators=300, learning_rate=0.05, max_depth=3). Rank the four ensemble methods oncancer_miniby mean cross-validated accuracy and comment on the spread of results across folds.
Going further
sklearn.naive_bayes.BernoulliNB— Bernoulli Naive Bayes classifier.sklearn.tree.DecisionTreeClassifier— single decision tree.sklearn.tree.plot_treeandexport_text— tree visualisation utilities.sklearn.ensemble.RandomForestClassifier— random forest.sklearn.ensemble.GradientBoostingClassifier— scikit-learn's reference implementation of gradient boosting.- LightGBM documentation and the
LGBMClassifierAPI. - XGBoost documentation and the
XGBClassifierAPI. sklearn.metrics.classification_report,confusion_matrix,roc_curve,roc_auc_score— the metrics toolkit.