teach.pascalyim.com
Contents

DL · Chapter 1

Deep Learning 1 — The Linear Neuron

Open on Kaggle

Deep learning is, at its core, the art of stacking small differentiable building blocks and teaching them to cooperate. Before we can stack anything, we have to understand the block itself. This first chapter focuses on the simplest such block — the linear neuron — and follows it from a single scalar input to a fully vectorised version with several inputs. We will write the model from scratch in NumPy, derive its gradients on paper, implement gradient descent in three flavours (batch, stochastic, minibatch), and finally see why the inputs almost always need to be normalised before training begins.

The linear neuron is also a useful conceptual bridge with what came before. From a statistical standpoint, it is exactly a linear regression. What changes is the training procedure: instead of solving a closed-form system as in LinearRegression, we minimise the cost function by iterative gradient descent. That shift is the entire point of this chapter, because gradient descent is the mechanism that scales to networks of thousands or millions of parameters where no closed-form solution exists.

The running examples come from two of the small datasets we have used since the start of the course: abalone_mini, which relates the physical measurements of abalones to the number of rings on their shells (a proxy for age), and house_mini, which collects sale prices of houses in King County. They are large enough to make gradient descent meaningful and small enough to be plotted in full on screen.

Linear neuron with one input

A linear neuron with one input computes an affine transformation of a scalar value xx:

u=ax+b,u = a x + b,

where aa is a weight, bb is a bias, and uu is the prediction. Geometrically this is the equation of a straight line in the (x,u)(x, u) plane. From a learning standpoint, aa and bb are the two trainable parameters of the model: starting from arbitrary initial values, training will adjust them until the prediction uu matches the target yy as closely as possible on the training data.

The criterion that measures how closely the predictions match the targets is the mean squared error:

E(a,b)  =  1ni=1n(yi(axi+b))2.E(a, b) \;=\; \frac{1}{n} \sum_{i=1}^{n} \bigl(y_i - (a x_i + b)\bigr)^2.

Key formula — the cost function. EE depends on the parameters (a,b)(a, b) through the predictions ui=axi+bu_i = a x_i + b. The training data (xi,yi)(x_i, y_i) are fixed; only aa and bb are allowed to move. Minimising EE over (a,b)(a, b) is the entire training problem.

Gradient descent

The cost surface E(a,b)E(a, b) is a quadratic bowl. The closed-form solution exists and is given by the normal equations of linear regression, but the same problem can also be attacked iteratively, by walking downhill on the surface. This is the strategy that will generalise to deep networks, where no closed form is available, so it deserves to be understood carefully on the simplest possible model.

The gradient E\nabla E points in the direction of steepest increase of the cost. To decrease EE, we therefore step in the opposite direction, with an amplitude controlled by the learning rate η\eta:

a    aηEa,b    bηEb.a \;\leftarrow\; a - \eta\,\frac{\partial E}{\partial a}, \qquad b \;\leftarrow\; b - \eta\,\frac{\partial E}{\partial b}.

Intuition. The gradient tells us which way the cost rises. We turn around and take a step. The size of that step is the learning rate. Repeat until the cost stops decreasing.

Computing the partial derivatives

Working out the gradient explicitly is the kind of exercise that should be done at least once by hand. Take a single example ii and define its individual loss as Ei=12(yiui)2E_i = \tfrac{1}{2}(y_i - u_i)^2 with ui=axi+bu_i = a x_i + b. Applying the chain rule,

Eia=Eiuiuia=(yiui)xi=(uiyi)xi,\frac{\partial E_i}{\partial a} = \frac{\partial E_i}{\partial u_i} \cdot \frac{\partial u_i}{\partial a} = -(y_i - u_i) \cdot x_i = (u_i - y_i)\, x_i,

and similarly

Eib=Eiuiuib=(uiyi).\frac{\partial E_i}{\partial b} = \frac{\partial E_i}{\partial u_i} \cdot \frac{\partial u_i}{\partial b} = (u_i - y_i).

In batch mode, the gradient of the full cost is the average over all training examples:

Ea=1ni=1n(uiyi)xi,Eb=1ni=1n(uiyi).\frac{\partial E}{\partial a} = \frac{1}{n} \sum_{i=1}^{n} (u_i - y_i)\, x_i, \qquad \frac{\partial E}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (u_i - y_i).

These two formulas are everything that gradient descent needs. The factor 12\tfrac{1}{2} on EiE_i was chosen to make the chain rule prettier; if we instead start from the MSE 1n(yiui)2\tfrac{1}{n}\sum (y_i - u_i)^2, a factor of 22 appears in the gradient. Practitioners simply absorb such constants into the learning rate.

Learning rate and epochs

Two hyperparameters dominate the dynamics of training. The learning rate η\eta controls the size of each step. If it is too small, convergence is slow and many epochs are needed. If it is too large, updates overshoot the minimum and the cost oscillates or diverges outright. The learning rate does not change the direction of the update — only the distance travelled.

An epoch is a full pass over the training set. Within one epoch the parameters may be updated once (batch), nn times (pure SGD), or n/Bn/B times (minibatches of size BB). Increasing the number of epochs lets the model approach the minimum more closely, but past a certain point it brings no further improvement on test data.

A useful mental picture is that of a hiker descending a foggy mountain. The gradient is the sense of slope under their feet — the direction of steepest ascent — which they invert to walk downhill. The learning rate is the length of each stride. With strides too short, dusk falls before they reach the valley. With strides too long, they stride right across the valley floor and back up the opposite slope. With well-chosen strides, the descent is rapid and stable. The objective of practical experimentation with η\eta is precisely to find that sweet spot, and the loss history we record at every epoch is the only measurement instrument we have to do so.

Implementing the linear neuron from scratch

Re-coding the model "by hand" is the cleanest way to internalise what training really does. We follow the conventional API: a constructor that initialises the parameters, a forward method that produces predictions, a fit method that performs the gradient descent loop. The history attribute stores the value of the loss at each epoch, so that convergence can be plotted afterwards.

class LinearNeuron1D: def __init__(self): self.a = np.random.uniform() self.b = np.random.uniform() self.history = [] def forward(self, x): return self.a * x + self.b def fit(self, x, y, lr=0.1, epochs=10): for _ in range(epochs): u = self.forward(x) grad_a = ((u - y) * x).mean() grad_b = (u - y).mean() self.a -= lr * grad_a self.b -= lr * grad_b self.history.append(((y - u) ** 2).mean()) return self.history

A few details deserve comment. forward is written so that it operates on a NumPy vector x containing all training points at once. The expression (u - y) * x is therefore a vector of element-wise products, and .mean() averages them — a vectorised translation of the formula 1n(uiyi)xi\tfrac{1}{n}\sum (u_i - y_i) x_i. The same trick is used for E/b\partial E / \partial b. Recording the loss at each epoch lets us diagnose training rather than treat it as a black box: a loss curve that decreases smoothly is healthy; one that oscillates or grows tells us the learning rate is too large.

A canonical run on the abalone dataset, predicting Rings (age proxy) from Length, looks like this:

df = pd.read_csv("abalone_mini.csv") x = df["Length"].to_numpy() y = df["Rings"].to_numpy() model = LinearNeuron1D() model.fit(x, y, lr=0.1, epochs=500) y_hat = model.forward(x)

The MAE and RMSE printed afterwards quantify the fit; plotting model.history shows the typical exponential-like decrease of the loss over the 500 epochs.

Batch, SGD, and minibatch

So far the gradient at every step has been computed from all training points. This is the batch mode. It is deterministic, but each update costs O(n)O(n) operations, which becomes prohibitive on large datasets.

Stochastic gradient descent (SGD) sits at the opposite extreme. At each iteration we pick a single example (xi,yi)(x_i, y_i) uniformly at random, compute the gradient on that one point, and update the parameters. The gradient is now a noisy, unbiased estimator of the true batch gradient. Updates are cheap but the trajectory of EE is irregular and the loss curve fluctuates from one iteration to the next.

The minibatch strategy is the practical compromise that virtually all modern deep learning relies on. We draw a subset of BB examples at random — typically B=16,32,64B = 16, 32, 64 — and update with the average gradient over the minibatch. This reduces gradient noise compared with pure SGD while keeping each step inexpensive, and it maps beautifully onto the vectorised hardware of GPUs.

Batch vs. SGD vs. minibatch.

  • Batch — exact gradient, deterministic trajectory, expensive on large data.
  • SGD — one example per update, very cheap but very noisy.
  • Minibatch — average gradient over BB examples, the de-facto default in deep learning.

The three modes share the same update rule; only the index set used to compute the gradient changes.

A single class can support all three by choosing the index set at the start of each iteration:

class LinearNeuron1D: def __init__(self): self.a = np.random.uniform() self.b = np.random.uniform() self.history = [] def forward(self, x): return self.a * x + self.b def fit(self, x, y, lr=0.1, epochs=10, mode="batch", batch_size=32): n = len(x) for _ in range(epochs): if mode == "batch": idx = np.arange(n) elif mode == "sgd": idx = np.array([np.random.randint(n)]) elif mode == "minibatch": idx = np.random.choice(n, size=batch_size, replace=False) else: raise ValueError("mode not recognized") u = self.forward(x[idx]) grad_a = ((u - y[idx]) * x[idx]).mean() grad_b = (u - y[idx]).mean() self.a -= lr * grad_a self.b -= lr * grad_b self.history.append(((y[idx] - u) ** 2).mean()) return self.history

On the same dataset and learning rate, comparing the three modes is instructive. The batch curve descends smoothly and monotonically; the SGD curve is jagged and the loss occasionally rises before resuming its descent; the minibatch curve sits between the two — clearly less noisy than SGD, but much cheaper per epoch than batch.

The noise injected by minibatch sampling is sometimes counted as a feature rather than a bug. On non-convex surfaces — which is the rule for deep networks, although not for our linear neuron — small random perturbations of the gradient help the iterates escape shallow local minima and traverse saddle points where the exact gradient would stall. The same noise also acts as an implicit regulariser: the parameters at convergence are not pinned to a single point but fluctuate in a small region around the minimum, which empirically improves generalisation. None of this is visible on the linear neuron because its cost surface is a global, convex bowl, but the mechanism is universal and worth keeping in mind for the next chapters.

A second practical point concerns the order in which minibatches are drawn. Drawing them with replacement, as we do above with np.random.choice, is the simplest implementation and is asymptotically equivalent to drawing them without replacement when BnB \ll n. In production code, however, one usually shuffles the entire training set at the start of every epoch and partitions it into consecutive minibatches; this guarantees that every example is visited exactly once per epoch and reduces the variance of the per-epoch average loss. PyTorch's DataLoader automates this with a single argument, shuffle=True.

Linear neuron with multiple inputs

Few real problems are governed by a single explanatory variable. To handle several variables at once, we generalise the scalar weight aa to a weight vector wRmw \in \mathbb{R}^m and stack the nn observations as rows of a design matrix XRn×mX \in \mathbb{R}^{n \times m}. Each row is one example, each column is one input variable.

The multiple-input linear neuron then computes its predictions in a single matrix–vector product:

u  =  Xw+b,uRn.u \;=\; X w + b, \qquad u \in \mathbb{R}^n.

The cost remains the batch MSE, E(w,b)=1ni(yiui)2E(w, b) = \tfrac{1}{n}\sum_i (y_i - u_i)^2. The gradients generalise just as cleanly. With respect to the weight vector,

Ew=1nX(uy),\frac{\partial E}{\partial w} = \frac{1}{n}\, X^\top (u - y),

a vector of size mm that aggregates, for each weight wjw_j, the contributions of all observations along column jj. The bias gradient is the scalar

Eb=1ni=1n(uiyi).\frac{\partial E}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (u_i - y_i).

Vectorised form. All the loops over examples and over input dimensions disappear into matrix products. X @ w computes the predictions for every observation in one call; X.T @ (u - y) / n computes the entire weight gradient. This is exactly how PyTorch and TensorFlow implement the forward and backward passes of an nn.Linear layer under the hood.

A from-scratch implementation, parametrised by the same mode parameter as before, is short:

class LinearNeuron: def __init__(self): self.w = None self.b = np.random.uniform() self.history = [] def forward(self, X): return X @ self.w + self.b def fit(self, X, y, lr=0.1, epochs=10, mode="batch", batch_size=32): n, m = X.shape self.w = np.random.uniform(size=m) for _ in range(epochs): if mode == "batch": idx = np.arange(n) elif mode == "sgd": idx = np.array([np.random.randint(n)]) elif mode == "minibatch": idx = np.random.choice(n, size=batch_size, replace=False) Xb, yb = X[idx], y[idx] u = Xb @ self.w + self.b grad_w = Xb.T @ (u - yb) / len(idx) grad_b = (u - yb).mean() self.w -= lr * grad_w self.b -= lr * grad_b self.history.append(((yb - u) ** 2).mean()) return self.history

Tested on abalone_mini with all explanatory variables and a target of Rings, the model trains without difficulty: a learning rate of 0.10.1 and a few hundred minibatch epochs are enough to recover an MAE comparable to that of the closed-form linear regression. The numerical agreement between LinearNeuron and sklearn.linear_model.LinearRegression on the same data is the best sanity check that the gradient formulas have been implemented correctly.

Notice how compact the multi-input version is compared with what one might write in pure Python. There are no nested loops over examples and over weights; everything is expressed as matrix products that NumPy delegates to highly optimised BLAS routines. This vectorisation is not just an aesthetic preference but a quantitative one: a Python loop over n=104n = 10^4 examples is hundreds of times slower than a single X @ w call. Once we move to PyTorch later in the course, the same product will run on a GPU at yet another order of magnitude of speed; the equations on paper, however, will not change.

Two implementation traps deserve a warning. First, the weight gradient is X.T @ (u - y) / n, not X @ (u - y) / n: the transpose is what aligns dimensions correctly so that the result is a vector of length mm. Second, when uu and yy have shapes (n,) and (n, 1) respectively (or vice versa), NumPy's broadcasting rules will compute their difference as a (n, n) matrix and the gradient becomes complete nonsense. Always check u.shape and y.shape against each other when debugging — this single sanity check fixes a very large fraction of all gradient bugs.

Why we need to normalise the inputs

The picture changes dramatically when we move to house_mini, whose features include surfaces in square feet, numbers of bedrooms, and zip codes — quantities that span very different orders of magnitude. Running the same LinearNeuron with lr=0.1 immediately produces NaN values: the gradient explodes. Even with η=1010\eta = 10^{-10}, training is so slow that thousands of epochs make no visible progress.

The cause is structural. To see it, consider the simplest possible case — a quadratic cost depending on a single parameter aa:

E(a)=12k(aa)2,E(a) = \tfrac{1}{2}\, k\, (a - a^*)^2,

where aa^* is the optimum and k>0k > 0 is the curvature (the Hessian, here a scalar). The gradient is E(a)=k(aa)E'(a) = k(a - a^*), so gradient descent reads

at+1=atηk(ata).a_{t+1} = a_t - \eta\, k\, (a_t - a^*).

Setting et=atae_t = a_t - a^*, the error obeys et+1=(1ηk)ete_{t+1} = (1 - \eta k)\, e_t. Convergence requires 1ηk<1|1 - \eta k| < 1, that is

0<η<2k.0 < \eta < \frac{2}{k}.

The convergence speed is governed by 1ηk|1 - \eta k|. If kk is large (steep, narrow bowl), η\eta must be very small or the iteration diverges. If kk is small (shallow bowl), the descent is stable but slow.

For a one-dimensional linear regression the curvature with respect to aa is proportional to 1nixi2\tfrac{1}{n}\sum_i x_i^2. A feature whose values are in the thousands therefore generates a curvature millions of times larger than a feature in the unit interval — and the same learning rate cannot possibly suit both directions.

In several dimensions the curvature is no longer a scalar but the Hessian matrix HH. Each eigenvector of HH defines a direction of the parameter space with its own curvature, given by the corresponding eigenvalue. The contrast between the largest and smallest eigenvalues — the conditioning κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} — measures how badly stretched the cost surface is. A poorly conditioned Hessian gives a long, narrow valley along which gradient descent zigzags painfully.

Intuition — round bowls converge faster than long valleys. Normalising the inputs reshapes the cost surface to be closer to a sphere, equalises the curvatures along the parameter axes, and lets a single learning rate work well in every direction.

sklearn scalers

Three scalers cover the vast majority of cases.

The MinMaxScaler maps each feature linearly to [0,1][0, 1] via xscaled=(xxmin)/(xmaxxmin)x_{\text{scaled}} = (x - x_{\min})/(x_{\max} - x_{\min}). It preserves the shape of the distribution but is sensitive to outliers, since a single extreme value dictates xmaxx_{\max}.

The StandardScaler centres each feature on zero and scales it to unit variance: xscaled=(xμ)/σx_{\text{scaled}} = (x - \mu)/\sigma. This is the default choice for gradient descent — it is the formulation that the convergence analysis above recommends explicitly.

The RobustScaler subtracts the median and divides by the interquartile range, xscaled=(xmed)/IQRx_{\text{scaled}} = (x - \text{med})/\text{IQR}. It is much less affected by outliers and should be preferred when the data contains heavy tails or extreme values.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)

The split between fit_transform on the training set and transform on the test set is not a stylistic detail. The statistics (μ\mu, σ\sigma, min, max, …) must be estimated only on the training data. If we computed them on the full dataset instead, information from the test set would leak into the model and our estimate of generalisation performance would be optimistically biased. The fit/transform split is precisely what sklearn Pipelines automate to make this discipline impossible to forget.

After standardising the features of house_mini, the same LinearNeuron that previously diverged converges happily with lr=0.01 and a few hundred minibatch epochs, producing a sensible MAE on the test set. The only thing that changed is the scale of the inputs.

A subtler corollary is that the target variable yy is sometimes worth scaling too. In a regression problem where prices are measured in hundreds of thousands, the unnormalised loss 1n(yiui)2\tfrac{1}{n}\sum (y_i - u_i)^2 has values on the order of 101010^{10}, and the gradient inherits the same magnitude. Centring and scaling yy — most often via StandardScaler again, applied separately to the target — keeps both the loss and its gradient in a numerically friendly range. After training, predictions are mapped back to the original scale by inverting the scaler. This trick is essentially free when the closed-form solution is used, but it can make a real difference for iterative methods.

The two laws of preprocessing.

  1. Always normalise the inputs before iterative training. A StandardScaler step costs nothing and prevents the worst convergence pathologies.
  2. Always fit the scaler on the training set only. Computing μ\mu and σ\sigma on the union of train and test data is data leakage, even if it looks innocent.

Exercises

  1. Partial derivative of the bias. Starting from Ei=12(yiui)2E_i = \tfrac{1}{2}(y_i - u_i)^2 with ui=axi+bu_i = a x_i + b, derive Ei/b\partial E_i / \partial b by hand. Then write the batch version E/b\partial E / \partial b as an average over i=1,,ni = 1, \dots, n.

  2. LinearNeuron1D from scratch. Implement the class described in the chapter. The constructor must initialise aa and bb randomly; forward(x) must return ax+ba x + b vectorised over a NumPy array; fit(x, y, lr, epochs) must run batch gradient descent. Test on abalone_mini with Length as input and Rings as target.

  3. Loss history. Add a history attribute initialised to [] and append the value of the cost E(a,b)E(a, b) at each epoch. Plot the resulting curve as a function of the epoch number. Vary lr and observe how the curve changes — including the lr for which the loss diverges.

  4. Three modes of gradient descent. Extend fit with a mode argument taking values "batch", "sgd", "minibatch", plus a batch_size parameter used only in minibatch mode. Useful NumPy callbacks: np.random.randint(0, n) draws a single index, np.random.choice(n, size=B, replace=False) draws BB distinct indices, x[idx] extracts a subset. Verify on the same data and the same number of epochs that the SGD trajectory is noisier than the batch trajectory and that minibatch sits between the two.

  5. LinearNeuron with multiple inputs. Implement the multi-input class. __init__ should leave self.w = None; fit should initialise the weight vector on its first call from the shape of XX. The weight gradient must use the vectorised expression grad_w = X.T @ (u - y) / n. Test on abalone_mini with all explanatory variables and Rings as target, then on house_mini with price as target.

  6. Without normalisation. Run the multi-input neuron on the raw features of house_mini. Find the largest learning rate that does not cause the loss to blow up (it will be very small, of the order of 101010^{-10}). Then standardise the features with StandardScaler (fitting on the training set only, transforming both train and test) and increase the learning rate progressively: 0.010.01, 0.10.1, 0.50.5, 1.01.0. Conclude on the role of normalisation.

Recap

This first chapter of deep learning has covered a small number of ideas, but each of them is foundational and will reappear unchanged in every subsequent model. We start from a parametric model — here the linear neuron u=Xw+bu = X w + b — and a cost function that measures how far the predictions are from the targets. Training is the iterative minimisation of that cost via gradient descent: we compute the partial derivatives of the cost with respect to every parameter, take a step in the opposite direction proportional to a learning rate, and repeat. The choice of how much data to use at each step distinguishes batch (everything), SGD (one example), and minibatch (a small random subset) — three flavours of the same algorithm. Finally, the geometry of the cost surface depends directly on the scale of the inputs, which is why standardising the features is part of the pre-flight checklist of any deep learning experiment, and why this discipline must be carried out without leaking test-set information into the training procedure. Every later chapter — non-linear activations, deep networks, convolutions — adds expressivity to the model but reuses, verbatim, the optimisation machinery introduced here.

Going further

The linear neuron is the simplest possible deep learning component, but it is not a toy: the same model is what nn.Linear(in_features, out_features) implements in PyTorch. The forward pass of nn.Linear(m, 1) is exactly u=Xw+bu = X w + b; what changes is that PyTorch tracks the computational graph automatically, computes the gradients with loss.backward(), and exposes a family of optimisers (torch.optim.SGD, torch.optim.Adam, …) that apply the update rule for us. We will meet these mechanisms in the next chapter, but the equations are unchanged — only the bookkeeping is hidden.

For pure regression on small or medium tabular data, the closed-form solution provided by numpy.linalg.lstsq or by sklearn.linear_model.LinearRegression is faster, more accurate, and parameter-free. Iterative gradient descent only becomes the better choice when the closed form is unavailable — typically because the model is non-linear (deep network) or because the dataset is too large to invert XXX^\top X in memory. scipy.optimize.minimize offers a richer toolbox of generic optimisers (BFGS, L-BFGS, Nelder-Mead) that can be useful as a sanity check or for bespoke loss functions.

To go deeper into the theory of gradient descent, the reference is Goodfellow, Bengio and Courville's Deep Learning (MIT Press, 2016), particularly Chapter 4 on numerical computation and Chapter 8 on optimisation. The two-page derivation of the convergence of gradient descent on a quadratic, sketched above in one dimension, is generalised there to the full eigenvalue analysis of the Hessian. The key takeaway transfers verbatim from the linear neuron to the deepest network: the conditioning of the optimisation problem is the conditioning of its inputs, and that is a hyperparameter the practitioner controls before training even starts.