Derivatives and the chain rule
From the slope of a curve to the gradient: the tool that turns the cost landscape into a path of descent.
In chapter 6, we learned to compute a network’s score and read it as an altitude in a cost landscape. Good networks sit in the valleys, bad ones on the peaks, and learning means descending toward a low point. The decisive question remains: around the point where we currently stand, in which direction should each weight be adjusted to bring the score down?
To answer it, we need to measure which way the surface tilts. That local tilt has a precise mathematical name: the derivative. This chapter lays down the three tools on which all of learning depends. The derivative measures the slope of a curve at a point. The chain rule propagates that slope through a composition of computations. The gradient generalises the slope to many weights at once. None of these tools is hard, and each one has a clear geometric reading. By the end, the static landscape of chapter 6 will have become a path of descent.
Act 1: the derivative is a slope
Take a curve, say . Pick a point on it. If you zoom in very close around that point, the curve eventually looks like a straight line. The slope of that line is the derivative Derivative The slope of a function at a point. Formally, the limit of the rate of change (f(x+h) - f(x)) / h as h tends to zero. It tells how much, and in which direction, the output changes when the input moves by a hair. of at that point. It tells you two things at once: in which direction the output moves when the input increases, and by how much.
To define it properly, we start from the rate of change between two nearby points, and :
This is the slope of the line cutting the curve at those two points, the secant. As we let the gap tend to , the second point slides toward the first, and the secant tips over toward the tangent. The derivative is the limit of that rate:
For , this computation gives (we work it out by hand in exercise 1). The slope is therefore at , it is at (the bottom of the bowl, where the curve is flat), and at (the curve is descending).
A few derivatives come up constantly in this course. Here they are once and for all:
The last one, the sigmoid, is remarkable: its slope is expressed in terms of its own value. We will use it in Act 2 and again in chapter 8.
Slide the point along the curve, and narrow the gap to watch the secant tip over toward the tangent.
The derivative is the slope of a tangent
Three things to notice while you play:
- The smaller you make , the closer the secant’s slope gets to the exact slope . The gap between the two shrinks toward zero: that is the limit, made visible.
- At the bottom of the bowl, at , the tangent is horizontal and the slope is . That is the signature of a minimum, and it will be the goal of learning.
- To the left of the bottom the slope is negative, to the right it is positive. The sign of the derivative tells you which way the curve descends.
Act 2: the chain rule
A network is not a single function but a stack of nested functions. The output of one layer feeds the next. To differentiate such a stack, we need to know how to differentiate a composition of functions Function composition The operation of applying one function to the result of another, written f ring g. A multilayer network is a composition where each layer output becomes the next layer input, and this nesting, alternated with non-linear activations, is what creates the global non-linearity. , that is, a computation of the form .
The chain rule Chain rule The rule for differentiating a composition of functions: the derivative of f(g(x)) is f'(g(x)) times g'(x). Local derivatives are multiplied along the path. It is the mechanical heart of backpropagation. answers exactly that need. It says the derivative of a composition is the product of the local derivatives:
The intuition is mechanical. If amplifies variations by a factor , and amplifies them by a factor at the point where it receives its input, then the whole thing amplifies by a factor equal to the product of the two. We multiply the slopes along the path.
Let us work through a numerical example. Let and . The local derivatives are and . The chain rule gives
At , we have , so .
The same mechanism is a neuron. This is the key point of the chapter. Take a single neuron and its score, as in chapter 6. The input passes through three nested computations:
To know how to adjust the weight , we need . This is a composition of three functions, so we multiply three local derivatives along the path :
The first derivative comes from , the second is the derivative of the sigmoid (written using its own value, ), the third comes from . For the bias, only the last factor changes: , so .
You have just computed how the score responds to a weight, by propagating local derivatives backward through the graph. That is exactly backpropagation. Chapter 8 will simply apply this mechanically to a full network.
Toggle between the abstract composition and the neuron graph, and watch the local derivatives multiply along the path.
Multiplying slopes along a chain
Three things to notice while you play:
- In composition mode, each edge carries its local derivative, and the product along the path reconstitutes . Change : the product follows.
- In neuron mode, the same mechanism computes and . The edges are the same building blocks; only the functions change.
- When one factor in the product is small (for instance when the neuron is saturated), the whole product collapses. That is the origin of the vanishing gradient, a problem we will encounter further on.
Act 3: partial derivatives and the gradient
A real network does not have one weight but millions. The score is then a function of all those weights at once. How do we talk about slope when there are many directions to move in?
The answer fits in a single gesture: we look at one direction at a time. The partial derivative Partial derivative The derivative of a multivariable function with respect to a single variable, the others held constant. It measures the slope along one axis. Stacked together, the partial derivatives form the gradient. with respect to one weight is the slope obtained by varying only that weight, all others held constant. To write it, we replace the straight with a round : is the slope along alone.
Take a toy cost with two weights, . Its two partial derivatives are
Stack those slopes into a vector, and you get the gradient Gradient The vector of all partial derivatives of a function. It points in the direction of steepest increase of the function at a given point, and its norm measures the slope. In training, we follow the opposite of the gradient to drive the loss down. , written :
The gradient is not just a list of numbers. Seen as an arrow at the current point, it points in the direction of steepest ascent of the score, and its length measures the steepness of that ascent. To bring the score down, it suffices to go in the opposite direction, . We finally hold the compass of descent.
Move the point on the landscape and read both partial slopes and the gradient arrow.
The gradient, compass for the descent
The gradient is the compass for the descent in chapter 9.
Three things to notice while you play:
- The gradient arrow always points uphill, away from the valley. The descent arrow points toward the valley.
- At the bottom of the bowl, both partial slopes vanish and the gradient becomes zero. That is the signal of a minimum, the multi-weight version of the horizontal tangent from Act 1.
- The coefficient in front of makes the slope along twice as steep: the gradient does not point toward the centre in a straight line, it leans toward the steeper axis.
Exercises
Grab something to write with. The solutions are deliberately detailed, line by line.
Exercise 1. Recover the derivative of from the definition, that is, compute the limit of as .
Exercise 2. Let . Compute using the chain rule, then give its value at .
Exercise 3. For a neuron, with and . Give , then .
Exercise 4. Let . Compute the gradient at the point , then give the direction of descent.
In one sentence
The derivative measures the slope of a function at a point, the chain rule propagates that slope through nested computations by multiplying the local derivatives, and the gradient stacks those slopes into a compass that tells, for each weight, which way to bring the score down.
Quiz
1. What does the derivative of a function at a point measure?
2. How does the chain rule differentiate a composition f(g(x))?
3. Why does computing dL/dw for a neuron prefigure backpropagation?
4. What is a partial derivative?
5. Which direction does the gradient point, and what do we do to bring the score down?
Toward chapter 8
We now know how to compute the slope of the score with respect to a single isolated weight, by multiplying local derivatives along a neuron’s computation graph. But a real network has thousands of weights chained across many layers. Redoing this computation from scratch for each one would be ruinously expensive. Backpropagation, in chapter 8, organises this computation into a single backward pass through the network that recovers all the partial derivatives at once. It is the chain rule, industrialised.
Sources
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6 (and section 6.5 on backpropagation and differentiation). https://www.deeplearningbook.org
- Spivak, M. (2008). Calculus (4th ed.). Publish or Perish. Chapters on the derivative and the chain rule.
- Nielsen, M. (2015). Neural Networks and Deep Learning. Chapter 2, on propagating derivatives through a network. http://neuralnetworksanddeeplearning.com/chap2.html