Machine Learning 3 — Classification
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
After regression, we tackle classification: predicting a category instead of a numerical value. Titanic survival, medical diagnosis, penguin species — many problems with discrete output.
Why this chapter?
You'll discover:
- the Naive Bayes classifier and empirical probability;
- the confusion matrix and classification-specific metrics (precision, recall, F1);
- the role of the decision threshold and the ROC and precision-recall curves;
- decision trees with the Gini index;
- ensemble methods: random forests and gradient boosting.
Naive Bayes
The Naive Bayes classifier predicts the class based on Bayes' theorem:
The word "naive" comes from the strong assumption that the explanatory variables are conditionally independent given the class:
It's false in practice, but the approximation works surprisingly well and the model is very fast.
from sklearn.naive_bayes import BernoulliNB
model = BernoulliNB(alpha=1.0)
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
The confusion matrix
Accuracy (correct classification rate) is intuitive but misleading. Imagine a spam detector that always says "not spam": if 99% of mails are legitimate, its accuracy is 99%... and it's useless.
To understand where a model goes wrong, we use the confusion matrix:
& \text{Predicted } 0 & \text{Predicted } 1 \\ \hline \text{Actual } 0 & TN & FP \\ \text{Actual } 1 & FN & TP \end{array}$$ - **TP** (true positive): said yes, was yes. - **FP** (false positive): false alarm. - **FN** (false negative): missed detection. - **TN** (true negative): correctly rejected. ## Precision, recall, F1 Three essential metrics derived from the matrix: $$\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}$$ $$F_1 = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$$ - **High precision** = few false alarms (useful when an FP is costly). - **High recall** = few missed detections (useful when an FN is dangerous). - **F1** = compromise between the two. ```python from sklearn.metrics import classification_report, confusion_matrix print(classification_report(y_test, y_hat)) print(confusion_matrix(y_test, y_hat)) ``` ## Probabilities and decision threshold Most classifiers predict a **probability** $\hat{p} = P(Y=1 \mid X)$. The class is then decided by comparison to a **threshold** $t$ (default 0.5). $$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq t \\ 0 & \text{otherwise} \end{cases}$$ ```python probas = model.predict_proba(X_test) y_score = probas[:, 1] y_hat = (y_score >= 0.7).astype(int) # threshold at 0.7 ``` Increasing the threshold → fewer predicted positives, **fewer FPs, more FNs**. Decreasing the threshold → more predicted positives, **more FPs, fewer FNs**. ## ROC curve and AUC The **ROC curve** plots the true positive rate $\mathrm{TPR}(t)$ against the false positive rate $\mathrm{FPR}(t)$, for all possible thresholds $t$. $$\mathrm{TPR} = \frac{TP}{TP + FN}, \quad \mathrm{FPR} = \frac{FP}{FP + TN}$$ The **AUC** (*Area Under the Curve*) summarises the curve as a single number between 0 and 1. | AUC | Quality | |---|---| | 0.5 | random | | 0.7-0.8 | decent | | 0.8-0.9 | good | | > 0.9 | very good | ```python from sklearn.metrics import roc_auc_score, roc_curve y_score = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, y_score) fpr, tpr, _ = roc_curve(y_test, y_score) ``` ## Precision-recall curve The ROC can be misleading on **imbalanced classes**: FPR stays small even when the model misses many positives. The **precision-recall curve** is more honest in that case. ```python from sklearn.metrics import precision_recall_curve, average_precision_score precisions, recalls, thresholds = precision_recall_curve(y_test, y_score) ap = average_precision_score(y_test, y_score) ``` Use it whenever the positive class is rare (fraud, disease, etc.). ## Decision trees A **decision tree** classifies by asking a series of simple questions about the variables. ### The Gini index To measure the purity of a node (proportion of the majority class), we use the **Gini index**: $$G = 2 \, p \, (1 - p)$$ where $p$ is the proportion of the positive class. $G = 0$ if the node is pure, $G = 0.5$ if it's 50/50. The algorithm picks the split that **minimises the weighted impurity** of the two subgroups. ```python from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(max_depth=3) model.fit(X_train, y_train) ``` Key hyperparameters: - `max_depth`: max depth. Small = simple; large = risk of overfitting. - `min_samples_leaf`: minimum samples per leaf. ## Random forests A single tree is unstable. A **random forest** trains many trees on bootstrap samples with a random subset of variables at each split, and **averages** their predictions. ```python from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200, max_depth=5) ``` Robust, little tuning needed, also gives a **feature importance** (`model.feature_importances_`). ## Gradient Boosting Opposite philosophy to forests: train trees **sequentially**, each new tree correcting the errors of the previous ones. ```python from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier( n_estimators=200, learning_rate=0.05, max_depth=3 ) ``` The `learning_rate` controls the strength of each correction. Small (0.05) = stable convergence, large (0.3) = risk of overfitting. | Random forest | Gradient boosting | |---|---| | independent trees | sequential trees | | average / vote | cumulated corrections | | robust | more performant but tuning-sensitive | Modern, more efficient libraries: **XGBoost**, **LightGBM**, **CatBoost** (covered in next chapter). --- [**Full notebook on Kaggle (forkable) →**](https://www.kaggle.com/code/pyim59/en-machine-learning-3-classification-v3-1)