teach.pascalyim.com
Contents

ML · Chapter 1

Machine Learning 1 — First steps with Python and data

Open on Kaggle

This first chapter sets the scene for the entire machine learning track. Before we discuss models, predictions or learning algorithms, we need a working environment in which numerical data can be loaded, inspected, transformed and visualised. The tools that will accompany us throughout the course — Python, Jupyter, NumPy, pandas, Seaborn and Plotly — are introduced here in their simplest form, just enough to manipulate a real dataset with confidence.

The chapter is deliberately practical. It assumes that you are already familiar with elementary programming and walks through the specific idioms that data scientists use every day: array shapes and broadcasting in NumPy, column selection and descriptive statistics in pandas, exploratory plots with Seaborn and interactive figures with Plotly Express. We close on a brief reminder of object-oriented programming in Python, which will be useful as soon as we start writing custom estimators.

The dataset used to illustrate every section is co2_mini, a small extract of real vehicle measurements containing two numerical variables: fuel consumption (in litres per 100 km) and co2 emissions (in grams per kilometre). The relationship between these two quantities is almost linear, which makes it an ideal companion for the regression chapter to follow.

The Kaggle and Jupyter environment

All notebooks in this course are designed to be opened on Kaggle, an online platform dedicated to data science and machine learning. Kaggle removes the burden of local installation: a Python kernel runs in the browser, the most common scientific libraries (NumPy, pandas, scikit-learn, Plotly) are pre-installed, and datasets can be attached to a notebook in two clicks. Once a dataset is attached, its files become available under a standard path of the form /kaggle/input/<dataset-name>/, and any file produced during execution is written to /kaggle/working/. The input directory is read-only, which guarantees that the original data are never accidentally modified.

A Kaggle notebook is a Jupyter notebook. It alternates two kinds of cells: code cells, which contain executable Python, and Markdown cells, which carry the explanations, the equations in LaTeX and the structuring titles. The output of the last instruction in a code cell is displayed automatically below it, which avoids systematic calls to print. Cells are executed with Shift + Enter (run and move on), Ctrl + Enter (run in place) or Alt + Enter (run and insert a new cell). Cells can be executed in any order, but the kernel keeps in memory the actual order of execution: a good reflex is therefore to re-run the entire notebook from top to bottom after every significant change, to avoid silent inconsistencies.

Markdown essentials. Headings are written with #, ##, ###; bold and italic with **...** and *...*; inline math with $ax+b$ and display math with $$ \hat{y} = ax + b $$. Images and lists follow the standard Markdown syntax.

The libraries used throughout the chapter are imported once, with the conventional aliases:

import numpy as np import matplotlib.pyplot as plt import pandas as pd import plotly.express as px import seaborn as sns

These five names — np, plt, pd, px, sns — are universal in the Python data ecosystem. Sticking to them makes your code immediately readable to any other practitioner.

Python essentials

Python is a dynamically typed language: a variable is just a name attached to a value, and the type travels with the value. The four scalar types we will meet most often are integers (int), real numbers (float), strings (str) and booleans (bool).

a = 3 # int b = 2.5 # float c = "Python" # str d = True # bool

Ordered collections are stored in lists, with zero-based indexing and slicing.

values = [1, 2, 3, 4, 5] values[0] # first element values[-1] # last element values[1:4] # sublist

Iteration over a collection uses a for loop. To iterate over a sequence of integers, the range(start, stop, step) builder is the standard tool: it produces values from start (included) up to stop (excluded), with the given step. Wrapping range(...) in list(...) materialises the sequence as a regular list.

for v in values: print(v) for i in range(0, 10, 2): print(i) list(range(0, 10, 2))

A while loop repeats a block as long as a condition holds. The loop variable must be updated inside the block, otherwise the loop never terminates.

count = 0 while count < 5: print(count) count += 1

Conditional logic uses if, elif and else, and combines elementary tests with the boolean operators and, or, not. Indentation is syntactically required in Python: the four-space block under an if is what defines the body of the conditional, not a pair of braces.

if a > 0: print("a is positive") elif a == 0: print("a is zero") else: print("a is negative") x = 7 x > 0 and x < 10 # True x < 0 or x > 5 # True not (x == 7) # False

Functions are introduced with the def keyword. They group reusable logic and make the rest of the code easier to read. The return statement specifies the value sent back to the caller.

def square(x): return x ** 2

Finally, advanced functionality is delivered by external modules, imported with import and usually given a short alias. Comments start with # and are ignored by the interpreter.

Indentation rule. Every block opened by if, for, while, def or class must be indented consistently — typically four spaces. Mixing tabs and spaces is the most common source of IndentationError.

Vectors and matrices with NumPy

NumPy is the foundation of the scientific Python stack. It provides a homogeneous, efficient ndarray object for one-dimensional vectors and two-dimensional matrices, together with a rich library of vectorised operations that replace explicit Python loops.

x = np.array([1, 2, 3, 4]) X = np.array([ [1, 2], [3, 4], [5, 6], ])

The fundamental property of an array is its shape, returned by the shape attribute. For a vector the shape is (n,); for a matrix it is (n_rows, n_columns). Inspecting shapes before performing an operation is the single best habit to acquire: most numerical bugs in Python boil down to a shape mismatch.

x.shape # (4,) X.shape # (3, 2)

The shape of an array can be changed without altering its values, as long as the total number of elements remains constant. This is what reshape does. A common pattern is to turn a flat vector into a matrix in order to feed a model that expects a 2D input.

x = np.arange(1, 11) # vector of 10 elements X = x.reshape(5, 2) # 5 x 2 matrix, same values

NumPy provides a complete suite of reduction operators — sum, mean, standard deviation, minimum, maximum. On a matrix, the axis argument controls along which dimension the reduction is performed. The convention to remember is that axis=0 collapses the rows and therefore produces one value per column, while axis=1 collapses the columns and produces one value per row.

np.sum(x) np.mean(x) np.sum(X, axis=0) # sum per column np.sum(X, axis=1) # sum per row

Axis convention. axis=0 walks down the rows (result has one entry per column). axis=1 walks across the columns (result has one entry per row). The intuition: the named axis is the one that disappears.

The dot product of two vectors and the matrix–vector product are the two operations that underpin linear models. The dot product

uv=iuiviu \cdot v = \sum_i u_i v_i

is computed with np.dot(u, v). The matrix–vector product, written y^=Xw\hat{y} = X w in linear regression, is computed with the dedicated @ operator.

u = np.array([1, 2, 3]) v = np.array([4, 5, 6]) np.dot(u, v) X = np.array([[1, 2], [3, 4], [5, 6]]) w = np.array([0.5, 1.0]) y_hat = X @ w

It is essential to distinguish * from @. The star * performs element-wise multiplication: it multiplies each entry of one array by the entry at the same position in the other (with broadcasting if the shapes differ). The at-sign @ performs the matrix product in the algebraic sense.

x * w # element-wise product X @ w # matrix-vector product

Most arithmetic operations between an array and a scalar are likewise vectorised, which means they apply elementwise without an explicit loop.

x + 1 x * 2 x ** 2

These vectorised expressions are not just shorter than for loops — they are several orders of magnitude faster, because NumPy delegates the actual work to compiled C routines.

DataFrames with pandas

While NumPy reasons in terms of pure numerical arrays, pandas introduces the DataFrame, a tabular structure with named columns and an index, that mirrors the way data scientists think about a dataset. A DataFrame is the natural object returned by reading a CSV file.

df = pd.read_csv('/kaggle/input/mini-datasets/co2_mini.csv')

pd.read_csv autodetects the separator (comma by default) and infers the data type of each column. When the file uses an unusual separator, decimal mark or encoding, the parameters sep, decimal and encoding can be set explicitly. On a Kaggle notebook, the path to a CSV file can be copied directly from the Data panel by hovering over the file name.

The first reflex after a read_csv is to look at the head of the table, to confirm that the column names, the values and the inferred types are what we expected:

df.head() # first 5 rows df.head(10) # first 10 rows

If head() shows misaligned columns, suspicious NaN values or numbers parsed as strings, the cure is almost always to revisit the parameters of read_csv.

The co2_mini dataset has two columns: consumption (litres per 100 km) and co2 (grams of CO₂ per kilometre). It is a subset of a public dataset of vehicle measurements, restricted to vehicles running on premium gasoline, in order to expose a clean, almost linear relationship between fuel consumption and CO₂ emissions.

Inspecting the structure

A small set of attributes and methods is enough to grasp the structure of a freshly loaded DataFrame:

df.shape # (n_rows, n_columns) df.shape[0] # number of rows df.shape[1] # number of columns df.columns # column names df.info() # types, non-null counts, memory usage

df.info() is particularly useful for spotting missing values (a column with fewer non-null entries than the total number of rows), non-numeric columns (object dtype) and any need for type conversion before modelling.

Selecting columns

Columns can be extracted with bracket notation. The subtlety, which has consequences in machine learning, is that single brackets return a one-dimensional Series, while double brackets return a two-dimensional DataFrame containing the requested columns.

df['co2'] # Series (1D) df[['co2']] # DataFrame (2D, one column) df[['consumption', 'co2']] # DataFrame (2D, two columns)

Many scikit-learn estimators expect a 2D feature matrix as input, which is why we will systematically use the df[[...]] form for features and the df[...] form for the target. A more concise alternative, dot notation df.co2, also works but only when the column name has no space, no special character and does not collide with an existing pandas attribute. The bracket form is therefore safer.

Series vs DataFrame. df['col'] is a 1D Series; df[['col']] is a 2D DataFrame. In machine learning, features X are typically a DataFrame while the target y is a Series.

Descriptive statistics

Every classical descriptive statistic is available both as a method on a pandas object and as a top-level NumPy function. The two routes are essentially equivalent, but the pandas methods are aware of missing values and integrate naturally with the rest of the pandas API.

df['co2'].sum() # or np.sum(df['co2']) df['co2'].mean() # or np.mean(df['co2']) df['co2'].std() # or np.std(df['co2']) df['co2'].min() # or np.min(df['co2']) df['co2'].max() # or np.max(df['co2'])

The most useful one-liner for a first look at numerical columns is describe(), which returns a small summary table:

df.describe()

For each numerical column, describe reports the count of non-null values, the mean, the standard deviation, the minimum, the three quartiles (Q1Q_1 at 25 %, the median Q2Q_2 at 50 %, Q3Q_3 at 75 %) and the maximum. The quartiles split the ordered observations into four equal parts, and the difference

IQR=Q3Q1\mathrm{IQR} = Q_3 - Q_1

— the interquartile range — measures the dispersion of the central half of the data. Together, these numbers reveal the central tendency of a variable, its dispersion and the presence of outliers.

Visualising distributions and relationships with Seaborn

Seaborn is a statistical visualisation library built on top of Matplotlib. Its DataFrame-aware syntax makes it the natural first choice for exploratory plots. Four families of plots are sufficient at this stage: boxplots, histograms, density curves and scatter plots.

Boxplot

A boxplot condenses a numerical distribution into five summary statistics: minimum, Q1Q_1, median, Q3Q_3, maximum. The box covers the interquartile range and contains the central 50 % of the data; the line inside the box marks the median; the whiskers extend to the most extreme non-outlier values.

plt.figure(figsize=(6, 4)) sns.boxplot(y=df['co2']) plt.title('Boxplot of the CO2 variable') plt.show()

A boxplot makes dispersion, asymmetry and the position of potential outliers visible at a glance. By convention, an observation xx is flagged as an outlier when

x<Q11.5×IQRorx>Q3+1.5×IQR,x < Q_1 - 1.5 \times \mathrm{IQR} \quad \text{or} \quad x > Q_3 + 1.5 \times \mathrm{IQR},

and is then plotted as an individual point. Outliers may correspond to genuine rare events, to recording errors, or to data-entry mistakes — they should never be removed automatically without a contextual analysis.

Histogram and kernel density estimate

The histogram approximates a distribution by counting how many observations fall in each of a series of consecutive intervals (the bins). The number of bins controls the granularity: too few bins flatten the distribution, too many turn random fluctuations into spurious peaks.

plt.figure(figsize=(6, 4)) sns.histplot(data=df, x='co2', bins=100) plt.title('Distribution of the CO2 variable') plt.show()

The kernel density estimate (KDE) offers a continuous, smooth alternative. It does not partition the axis into bins; it returns a curve whose total area is one and whose height at a given point is proportional to the local concentration of observations.

plt.figure(figsize=(6, 4)) sns.kdeplot(data=df, x='co2', fill=True) plt.title('Density estimation (KDE)') plt.show()

The two representations are complementary, and Seaborn even allows them to be superimposed by passing kde=True to histplot.

plt.figure(figsize=(6, 4)) sns.histplot(data=df, x='co2', bins=100, kde=True) plt.title('Histogram with density estimation') plt.show()

Violin plot

A violin plot combines the strengths of the boxplot and the KDE: a mirrored density curve gives the shape of the distribution, while an inner box reproduces the median and quartiles. Wide regions of the violin signal high concentration; narrow regions correspond to rare values.

plt.figure(figsize=(6, 4)) sns.violinplot(y=df['co2'], inner='box') plt.title('Violin plot of the CO2 variable') plt.show()

Scatter plot

The scatter plot moves from the analysis of one variable to the analysis of a pair of variables. Each observation is drawn as a point at coordinates (xi,yi)(x_i, y_i), where xx is an explanatory variable and yy a target variable. The shape of the resulting cloud answers three questions at once: is there a relationship, is it linear, and how dispersed is it?

plt.figure(figsize=(6, 4)) sns.scatterplot(data=df, x='consumption', y='co2') plt.title('Relationship between consumption and CO2') plt.show()

For the co2_mini dataset, the scatter plot of co2 against consumption is essentially a straight band rising from the bottom-left to the top-right corner — exactly the kind of pattern that linear regression is designed to model.

Why these four plots? Boxplot, histogram, KDE and scatter plot together cover the structure of a single variable (centre, dispersion, shape) and the relationship between two variables. They are the visual checklist that should precede any modelling decision.

Interactive figures with Plotly Express

Plotly Express (imported as px) is a high-level interface to Plotly. It produces interactive figures from a DataFrame in a single call: hovering over a point reveals its exact value, the figure can be panned and zoomed, and the result can be embedded in a dashboard. The same four basic chart types are available.

fig = px.histogram(df, x='co2', nbins=100) fig.show() fig = px.box(df, y='co2') fig.show() fig = px.violin(df, y='co2', box=True, points='outliers') fig.show() fig = px.scatter(df, x='consumption', y='co2') fig.show()

A particularly useful Plotly Express option for a first look at a regression problem is trendline='ols', which adds an ordinary-least-squares trend line on top of a scatter plot. We will use it again in the next chapter as a sanity check.

fig = px.scatter(df, x='consumption', y='co2', trendline='ols') fig.show()

Plotly Express separates the creation of a figure (px.histogram, px.scatter, …) from its aesthetic adjustment (update_traces, update_layout). This separation is convenient because it lets us tune transparency, marker outlines or figure size after the fact, without rewriting the full plotting call.

fig = px.histogram(df, x='co2', nbins=50) fig.update_traces(opacity=0.7, marker_line_color='black', marker_line_width=1) fig.update_layout(width=600, height=400) fig.show()

Seaborn and Plotly are not in competition: Seaborn produces clean static figures suitable for printed reports, Plotly produces interactive figures suitable for exploration and online sharing. Both deserve a place in the data scientist's toolbox.

A first look at Python objects

Many machine learning libraries — scikit-learn first and foremost — expose their estimators as Python objects. Understanding what an object is therefore matters even when we never write a class ourselves. A class is a blueprint that bundles together two kinds of ingredients: attributes, which store data, and methods, which act on that data. The following minimal example illustrates the mechanism on a toy Calculator class.

class Calculator: def __init__(self): self.memory = 10 def add(self, x): self.memory = self.memory + x

The special method __init__ is the constructor: it is called automatically whenever a new object is created from the class, and it is used to initialise the object's attributes. The first parameter of every instance method, conventionally named self, refers to the object on which the method is being called — without self, Python would not know which object the body of the method should read from or write to. Inside the constructor, the assignment self.memory = 10 creates an attribute named memory and attaches it to the current instance.

Instantiating the class — that is, creating a concrete object — is done by calling the class as if it were a function:

calc = Calculator()

Behind the scenes, Python allocates a fresh chunk of memory and then calls Calculator.__init__(calc) automatically. The variable calc is a reference to this object; the object exists as long as something references it. Two different instances each have their own independent memory attribute.

Once an object exists, attributes and methods are accessed with dot notation. Reading an attribute returns the stored value; calling a method may also modify the internal state of the object.

calc.memory # 10 calc.add(20) # modifies calc.memory in place calc.memory # 30

The change made by calc.add(20) is persistent: as long as we keep working with the same calc object, subsequent calls will see memory == 30. This pattern — an object that holds state and exposes methods that read or update that state — is exactly how a scikit-learn estimator works. A linear regression object, for example, stores its learned coefficients in attributes (coef_, intercept_) and exposes methods (fit, predict) that read or modify them.

A note on f-strings

A small but ubiquitous detail of modern Python is the f-string, a string literal prefixed with the letter f in which {expression} placeholders are replaced by the evaluated value of expression. This is the cleanest way to format text and numbers together.

year = 3 capital = 1234.567 print(f"Year {year}: capital = {capital}")

The format specification after a colon controls how a number is rendered. The most useful one for monetary values is .2f, which prints a real number with two digits after the decimal point:

print(f"{capital:.2f}") # 1234.57 print(f"Year {year}: capital = {capital:.2f}")

We will use f-strings throughout the course to print readable training logs and evaluation metrics.

Exercises

The following exercises consolidate the material of the chapter. Each one can be solved with the constructs introduced above; no library other than NumPy or pandas is required.

Exercise 1 — Python recap (without NumPy)

This exercise checks the mastery of the basic Python constructs needed for the rest of the notebook.

  1. Create a list containing the integers from 1 to 10.
  2. Using a for loop, print only the even numbers in this list.
  3. Using a for loop with range, compute the sum of the integers from 1 to 10.
  4. Write a function mean(values) which:
    • takes as input a list of numbers,
    • returns the arithmetic mean of these numbers.
  5. Using a while loop, print the integers from 5 down to 1 in decreasing order.

Exercise 2 — Operations on vectors and matrices (NumPy)

The goal of this exercise is to manipulate vectors and matrices with NumPy, without explicit loops, in order to prepare the calculations used later in linear regression.

Consider the vector

x = np.array([1, 2, 3, 4, 5, 6])
  1. Reshape x into a matrix X of shape (3, 2) using reshape.

  2. Compute:

    • the sum of all elements of X,
    • the sum per column,
    • the sum per row.
  3. Given the coefficient vector

    w = np.array([0.5, 1.0])

    compute the matrix–vector product XwX w.

  4. Compare the results of

    X * w X @ w

    and explain the observed difference in one or two sentences.

Exercise 3 — Object-oriented programming: a compound interest calculator

We want to model a capital invested at compound interest. The capital evolves from year nn to year n+1n+1 according to

Cn+1=Cn×(1+r),C_{n+1} = C_n \times (1 + r),

where CnC_n is the capital at year nn and rr is the annual interest rate (in decimal form).

  1. Write a class CompoundInterestCalculator containing:
    • a constructor __init__(initial_capital, rate),
    • an attribute capital representing the current capital,
    • an attribute rate representing the annual interest rate.
  2. Add a method next_year() which:
    • updates capital by applying one year of interest,
    • modifies the internal state of the object.
  3. Add a method simulate(n) which:
    • simulates the evolution of the capital over n years,
    • prints the capital after each year, formatted with two decimal places using an f-string of the form f"Year {year}: capital = {self.capital:.2f}".
  4. Instantiate an object with an initial capital of 1000 and a rate of 5 %.
  5. Simulate the evolution of the capital over 5 years.

Going further

The following references are the primary documentation for the libraries used in this chapter. They are the recommended first stop whenever you need a feature that was not covered in the course.