Backpropagation
A single backward pass that recovers the gradient of every weight: the chain rule, industrialised.
In chapter 7, we learned to compute for a single weight by multiplying local derivatives along the graph of one neuron. That was the chain rule, laid down brick by brick. But a real network does not have one weight: it has thousands, chained across several layers. Recomputing that from scratch for every weight would be ruinous.
So the question of this chapter is precise: how do we get the derivative of the score with respect to every weight at once, in a single pass, without repeating the same multiplications a thousand times? The answer is an algorithm of rare elegance, and you already hold all its pieces. All that remains is to organise them.
Act 1: the cost of brute force
Recall the end of chapter 7. For a single neuron we had
Three factors, multiplied along the path . Simple. Now stack two layers. The score depends on the weights of the last layer, but also on those of the hidden layer, which act earlier and whose effect must cross the whole rest of the network before reaching the score.
The naive approach treats each weight separately: for each one, start again from the score and walk the chain of derivatives down to it. The problem is obvious as soon as you count. Two weights that end at the same output neuron share exactly the same start of the chain. Recomputing it for each is redoing the same multiplication over and over. On a network with a million weights, that is a million almost identical chains, recomputed for nothing.
The liberating idea fits in one sentence: what if we computed the shared pieces once, stored them, and reused them? That is precisely what backpropagation does. The word names the algorithm that computes the gradient of the score with respect to every weight by climbing the network from output to input, reusing intermediate results instead of recomputing them.
Act 2: the error signal and the two passes
The piece shared by every chain passing through a neuron is the sensitivity of the score to that neuron. We give it a name and a symbol. The error signal Error signal The sensitivity of the loss to a neuron's pre-activation, written delta = dL/dz. It measures how much the score would change if the neuron's net input shifted by a hair. Backpropagation computes this signal for every neuron, from output back to input, then derives each weight's gradient through the rule dL/dw = delta times the upstream activation. of a neuron, written , is the derivative of the score with respect to that neuron’s pre-activation:
It reads: how much the score would change if the net input of this neuron shifted by a hair. Once this number is known for a neuron, the gradient of each of its incoming weights follows from a single product. This is the master formula of the chapter:
the error signal of the arriving neuron , times the activation of the departing neuron . So all the work reduces to computing the . And for that, we proceed in two passes.
The forward pass stores. We first run a normal forward pass, from inputs to output, but this time we keep in memory every pre-activation and every activation . They will be reused.
The backward pass propagates the error. The backward pass Backward pass The phase of backpropagation where the error signal propagates from output to input, the reverse direction of the forward pass, to assemble each weight's gradient. It reuses the activations stored during the forward pass instead of recomputing them. starts from the output and climbs back. At the output neuron, the error signal combines the slope of the score and the slope of the activation:
Then we step back one layer. The error signal of a hidden neuron is the error of the neurons it feeds, brought back to it by weighting with the connection weights, then modulated by its own local slope:
The sum runs over every neuron of the next layer that feeds. That is the whole meaning of the prefix: the error propagates backwards, from output to input, each neuron receiving the share of error it helped create downstream.
Let us unroll this on a small toy network with two inputs, two hidden neurons and one output, with the following weights: the hidden layer has and zero biases, the output has and a bias . We feed the input with target .
The forward pass gives , hence , then and . The score is . The backward pass gives first , from which we read the output gradients, then the hidden signals , from which we read the hidden-layer gradients. You will already notice that the hidden error signals are nearly ten times smaller than the output one: remember it, act 3 will return to this.
Step through the component: the forward pass lights up the network from left to right, then the backward pass carries the error signal back from right to left.
Backpropagation, step by step
Three things to watch as you play:
- During the forward pass, each neuron stores its activation. During the backward pass, those activations come straight back out in the products: nothing is recomputed, everything is reused.
- The output error signal spreads to the hidden neurons by following the weights backwards. A hidden neuron linked by a strong weight receives a larger share of the error.
- Each weight gradient is the product of two already computed numbers: the error signal of the arriving neuron and the activation of the departing neuron. The master formula, in action.
Act 3: why “backward”, and the product that collapses
The direction of the backward pass is not an arbitrary choice, it is forced by the graph. Reading a network as a computation graph Computation graph A representation of a computation as a directed graph whose nodes are operations and whose edges are the values flowing between them. Reading a network as a computation graph makes backpropagation systematic: local derivatives are multiplied along the edges, from outputs back to inputs. means seeing each value as a node and each dependency as an arrow. Now the arrows of computation go from input to output: determines , which determines , which determines . The derivatives, on the other hand, travel the opposite way, because to know how depends on a deep weight, we must climb the whole chain separating that weight from the score. The computation descends, the error climbs.
This climb has a heavy consequence. The error signal of a deep neuron is a product: at each layer crossed backwards, we multiply by a new factor . But the derivative of the sigmoid never exceeds , its maximum value, reached at . Multiplying numbers all below , layer after layer, melts the result exponentially. After a few layers, the error signal of the first layers is so small that they barely learn any more. This is the vanishing gradient Vanishing gradient The disappearance of the gradient in deep network layers. When the maximum derivative of an activation function is below 1, the gradient multiplies at each layer crossed and collapses exponentially. Identified by Glorot and Bengio (2010), it is one of the reasons for the shift to ReLU. Source: Glorot and Bengio, 2010 , and you saw its seed in the toy network already, where the hidden gradients were noticeably smaller than the output ones.
Stack layers in the simulator and watch the gradient collapse. Toggle the activation to compare: the sigmoid crushes everything, whereas ReLU, whose derivative equals in its active region, lets the signal through.
Drag to stack layers and watch the product of derivatives collapse.
Three things to watch as you play:
- With the sigmoid, each added layer divides the gradient by at least four. Over six layers, almost nothing is left for the first ones.
- With ReLU, the derivative equals in the active region, so the product does not crush the same way. This is one of the main reasons for the massive shift to ReLU.
- The phenomenon is a pure consequence of backpropagation: the deep gradient is a product of local derivatives, and a product of small numbers is tiny.
Exercises
Grab something to write with. The solutions are deliberately detailed, line by line. We keep the toy network from act 2, and you are given the useful sigmoid values: and .
Exercise 1. Run the forward pass of the toy network. Compute , , then , using the sigmoid values provided.
Exercise 2. You are given and the target . Compute the output error signal , then the gradient of the first output weight.
Exercise 3. Backpropagate the signal to the first hidden neuron. Compute , then the gradient (the weight from the first input to the first hidden neuron). Recall , and .
Exercise 4. To understand the vanishing gradient, suppose a chain of sigmoid neurons all centred at , where . Give the accumulated multiplicative factor after 3 layers, then after 6 layers.
In one sentence
Backpropagation computes the gradient of every weight in one forward pass that stores the activations and one backward pass that propagates the error signal from output to input, each weight gradient being the product of the arriving neuron’s error signal by the departing neuron’s activation.
Quiz
1. What does the error signal δ of a neuron represent?
2. Why does the forward pass store the activations?
3. How do we obtain the error signal of a hidden neuron?
4. Why do we speak of "backward" propagation?
5. Where does the vanishing gradient with the sigmoid come from?
Towards chapter 9
We can now compute, in a single backward pass, the gradient of the score with respect to every weight of the network. So we hold the full compass: for each weight, in which direction and by how much moving it would lower the score. But knowing where to descend is not descending. How far to step each time? What happens if the step is too large, or too small? Gradient descent, in chapter 9, turns this gradient into a weight-update rule, and finally makes the network learn.
Sources
- Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature 323, 533-536. DOI 10.1038/323533a0
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Section 6.5 (backpropagation and differentiation). https://www.deeplearningbook.org
- Nielsen, M. (2015). Neural Networks and Deep Learning. Chapter 2, on the backpropagation algorithm. http://neuralnetworksanddeeplearning.com/chap2.html