Machine Learning 1 — Starter

:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →

French and Chinese versions available from the home page. :::

The first chapter of the course. We revisit the Python tools we'll use throughout the journey, and end on object-oriented programming — an important detour because in the next chapter we'll code a linear regression by hand, as a class.

Why this chapter?

In Machine Learning, we spend 80% of our time wrangling data and 20% training models. This first chapter gives you the building blocks for that wrangling:

NumPy for numerical computing (vectors, matrices, vectorised operations);
pandas for reading CSV files and tabular manipulation;
Seaborn and Plotly for visualisation;
Python classes to structure models.

NumPy: vectorised numerical computing

A NumPy array is the equivalent of a Python list, but much faster for numerical computing because it uses optimised C code.

import numpy as np

x = np.array([1, 2, 3, 4])           # 1D vector, shape (4,)
X = np.array([[1, 2], [3, 4]])       # 2D matrix, shape (2, 2)

The .shape attribute gives the dimensions. It's the first thing to check when handling an array.

Vectorised operations

NumPy applies operations element by element, without a Python loop:

x + 1     # adds 1 to each element
x * 2     # multiplies each element by 2
x ** 2    # squares each element
np.sum(x), np.mean(x), np.std(x)

Shorter to write and much faster than a for loop.

Matrix product and the `@` operator

Matrix multiplication — used everywhere in ML — uses the @ operator:

$\hat{y} = X w$

X = np.array([[1, 2], [3, 4], [5, 6]])   # (3, 2)
w = np.array([0.5, 1.0])                  # (2,)
X @ w                                     # (3,) — matrix product

:::warning Classic gotcha X * w does an element-wise product (with broadcasting), not a matrix product. Always use @ when you mean $X w$ . :::

pandas: tabular manipulation

pandas is the bridge between data files (CSV, Excel) and Python code. The central structure is the DataFrame, the code equivalent of an Excel sheet.

import pandas as pd

df = pd.read_csv('data.csv')
df.head()         # first rows
df.shape          # (n_rows, n_columns)
df.info()         # types, missing values, memory
df.describe()     # min, max, mean, std per column

Selecting columns

df['co2']                    # one column → Series (1D)
df[['co2']]                  # same column, as DataFrame (2D)
df[['consumption', 'co2']]   # multiple columns

The Series vs DataFrame distinction matters: scikit-learn often expects 2D for explanatory variables, hence the [[...]] you'll see often.

Quick statistics

df['co2'].mean()             # or np.mean(df['co2'])
df['co2'].std()              # standard deviation
df['co2'].min(), df['co2'].max()
df['co2'].describe()         # all at once

Visualising: Seaborn and Plotly

Two complementary libraries:

Seaborn: statistical plots in one line (boxplot, scatter, KDE), DataFrame-oriented syntax, static output.
Plotly Express: interactive plots (zoom, hover, pan), great for exploration.

import seaborn as sns
import plotly.express as px

sns.boxplot(y=df['co2'])             # static
px.scatter(df, x='consumption', y='co2', trendline='ols')   # interactive

A boxplot summarises a distribution with five numbers: minimum, $Q_1$ (25%), median, $Q_3$ (75%), maximum. The box spans from $Q_1$ to $Q_3$ — the interquartile range $IQR = Q_3 - Q_1$ which contains 50% of the values. Isolated points are outliers (beyond $Q_1 - 1.5 \cdot IQR$ or $Q_3 + 1.5 \cdot IQR$ ).

Python classes — OOP

Before the end of the chapter, we revisit object-oriented programming. Why? Because in the next chapter, we'll code a linear neuron as a class with fit, predict, history — exactly like scikit-learn does.

class Calculator:
    def __init__(self):
        self.memory = 10            # instance attribute

    def add(self, x):
        self.memory = self.memory + x   # method that modifies state

calc = Calculator()
calc.add(5)
print(calc.memory)              # 15

Three things to remember:

__init__ is the constructor, called automatically when an object is created.
self represents the current instance. Always the first parameter of methods.
Attributes (self.memory) keep the object's state between calls.

This structure becomes the standard pattern for the models you'll code: __init__ to initialise, fit to learn, predict to predict.

Full notebook on Kaggle (forkable) →

Why this chapter?​

NumPy: vectorised numerical computing​

Vectorised operations​

Matrix product and the @ operator​

pandas: tabular manipulation​

Selecting columns​

Quick statistics​

Visualising: Seaborn and Plotly​

Python classes — OOP​