Machine Learning 1 — Starter
:::tip Kaggle notebook The full executable code for this chapter is on Kaggle: Open →
French and Chinese versions available from the home page. :::
The first chapter of the course. We revisit the Python tools we'll use throughout the journey, and end on object-oriented programming — an important detour because in the next chapter we'll code a linear regression by hand, as a class.
Why this chapter?
In Machine Learning, we spend 80% of our time wrangling data and 20% training models. This first chapter gives you the building blocks for that wrangling:
- NumPy for numerical computing (vectors, matrices, vectorised operations);
- pandas for reading CSV files and tabular manipulation;
- Seaborn and Plotly for visualisation;
- Python classes to structure models.
NumPy: vectorised numerical computing
A NumPy array is the equivalent of a Python list, but much faster for numerical computing because it uses optimised C code.
import numpy as np
x = np.array([1, 2, 3, 4]) # 1D vector, shape (4,)
X = np.array([[1, 2], [3, 4]]) # 2D matrix, shape (2, 2)
The .shape attribute gives the dimensions. It's the first thing to check when handling an array.
Vectorised operations
NumPy applies operations element by element, without a Python loop:
x + 1 # adds 1 to each element
x * 2 # multiplies each element by 2
x ** 2 # squares each element
np.sum(x), np.mean(x), np.std(x)
Shorter to write and much faster than a for loop.
Matrix product and the @ operator
Matrix multiplication — used everywhere in ML — uses the @ operator:
X = np.array([[1, 2], [3, 4], [5, 6]]) # (3, 2)
w = np.array([0.5, 1.0]) # (2,)
X @ w # (3,) — matrix product
:::warning Classic gotcha
X * w does an element-wise product (with broadcasting), not a matrix product. Always use @ when you mean .
:::
pandas: tabular manipulation
pandas is the bridge between data files (CSV, Excel) and Python code. The central structure is the DataFrame, the code equivalent of an Excel sheet.
import pandas as pd
df = pd.read_csv('data.csv')
df.head() # first rows
df.shape # (n_rows, n_columns)
df.info() # types, missing values, memory
df.describe() # min, max, mean, std per column
Selecting columns
df['co2'] # one column → Series (1D)
df[['co2']] # same column, as DataFrame (2D)
df[['consumption', 'co2']] # multiple columns
The Series vs DataFrame distinction matters: scikit-learn often expects 2D for explanatory variables, hence the [[...]] you'll see often.
Quick statistics
df['co2'].mean() # or np.mean(df['co2'])
df['co2'].std() # standard deviation
df['co2'].min(), df['co2'].max()
df['co2'].describe() # all at once
Visualising: Seaborn and Plotly
Two complementary libraries:
- Seaborn: statistical plots in one line (boxplot, scatter, KDE), DataFrame-oriented syntax, static output.
- Plotly Express: interactive plots (zoom, hover, pan), great for exploration.
import seaborn as sns
import plotly.express as px
sns.boxplot(y=df['co2']) # static
px.scatter(df, x='consumption', y='co2', trendline='ols') # interactive
A boxplot summarises a distribution with five numbers: minimum, (25%), median, (75%), maximum. The box spans from to — the interquartile range which contains 50% of the values. Isolated points are outliers (beyond or ).
Python classes — OOP
Before the end of the chapter, we revisit object-oriented programming. Why? Because in the next chapter, we'll code a linear neuron as a class with fit, predict, history — exactly like scikit-learn does.
class Calculator:
def __init__(self):
self.memory = 10 # instance attribute
def add(self, x):
self.memory = self.memory + x # method that modifies state
calc = Calculator()
calc.add(5)
print(calc.memory) # 15
Three things to remember:
__init__is the constructor, called automatically when an object is created.selfrepresents the current instance. Always the first parameter of methods.- Attributes (
self.memory) keep the object's state between calls.
This structure becomes the standard pattern for the models you'll code: __init__ to initialise, fit to learn, predict to predict.