Forward pass and loss functions
From input to prediction, then how to give it a score.
In chapter 5, we saw that a multilayer network can express almost any function, provided its weights are set well. For XOR, we set them by hand. But by hand does not scale: a real network has millions of weights. So we want to set them automatically.
The very first building block of that automation is to know two things: how to compute the network’s output for a given input, then how to give that output a score that says how wrong it is. Learning, for a network, will be the search to improve that score. This chapter builds those two bricks. It does not yet compute any weight correction: it sets the playing field on which, from chapter 7 on, the gradient will operate.
Act 1: computing the prediction
The forward pass Forward pass The forward propagation. A computation phase where an input traverses the network layer by layer, from inputs to output, applying at each neuron its weighted sum and activation function. Produces the final prediction. is the computation that turns an input into a prediction, traversing the network layer by layer, from input to output. Each layer applies its formula , and its output becomes the input of the next.
Take a small 3-class classifier: two inputs, a hidden layer of two neurons, an output layer of three neurons. Feed it the input .
The hidden layer. Its weights and biases are
We first compute the weighted sum of each neuron, the pre-activation :
Then we apply the activation. With ReLU, which replaces every negative by zero:
The first hidden neuron switched off, the second stayed active.
The output layer. Its weights and biases are
We compute the three pre-activations, here called the logits :
From scores to probabilities. Three logits are three arbitrary numbers. For a classifier we want three probabilities, positive and summing to . That is the role of the softmax Softmax A function that turns a vector of reals into a probability distribution. For a vector z, softmax(z)_i = exp(z_i) / sum(exp(z_j)). Used as the output activation in multi-class classification. function:
On our logits :
The network predicts class 2 with about confidence. That is a full forward pass: from to a probability distribution.
Step through it yourself, and toggle the hidden activation between ReLU and sigmoid to see how each one transforms the same input.
The forward pass, layer by layer
Three things to watch while playing:
- At each step, the active layer lights up in the diagram. Information always moves from input to output, never backward.
- Toggle from ReLU to sigmoid: the same input produces different hidden values , but the predicted class stays 2. The network’s structure matters as much as the exact values.
- The last step, softmax, turns three arbitrary logits into three probabilities that sum to .
Act 2: giving a score
We can compute a prediction. It remains to say how good it is. A loss function Loss function A measure of the error between a network's prediction and the expected truth. Also called cost function. The higher it is, the more wrong the network. Training seeks to minimise it. Common examples: MSE for regression, cross-entropy for classification. takes the prediction and the target, and returns a single number: the score. The larger the number, the more wrong the network. Learning will be making that number smaller.
Not all tasks are scored the same way. Let us look at the two big families.
Regression: mean squared error
When we predict a continuous quantity (a price, a temperature), it is regression. The natural score is the mean squared error Mean squared error A loss function that averages the square of the gap between the prediction and the target. The square penalizes large gaps heavily and makes the cost differentiable everywhere. Written MSE, it is the natural choice for regression. Source: Goodfellow, Bengio and Courville, 2016 (MSE), the average of the squared gaps between prediction and target:
Why the square, and not the absolute value? For two reasons. First, the square penalizes large gaps far more than small ones: being off by costs a hundred times more than being off by , not ten times. Second, the square is smooth, differentiable everywhere, whereas the absolute value has a sharp corner at zero. That smoothness will be precious in chapter 7, when we look for the slope.
Classification: cross-entropy
In classification, the target is not a number but a class. The prediction, on the other hand, is a probability. Could we reuse the MSE between the predicted probability and the target? We could, but it would be a poor choice, and the reason is instructive.
Imagine a binary task (target equal to or ) where the model predicts the probability that the target is . If the true answer is and the model announces , it is wrong with total confidence. We would want a score that blows up in that case. Yet the MSE caps out: , barely less than for a hesitation at .
When the target is , it reduces to . If is close to , is close to : right answer, small score. But if tends to , tends to infinity: the score explodes. That is exactly the punishment we wanted for confident error.
Several classes: softmax and one-hot target
For our 3-class classifier, the output is the softmax distribution from Act 1. The target is encoded one-hot: a vector that is on the correct class and elsewhere. For a true class 2, the target is .
The multiclass cross-entropy generalizes the binary version:
where is the true class. Since the target is one-hot, all terms vanish except the one for the correct class. The score therefore depends only on the probability the model gave to the right answer. On our example, : small score, the network was right and confident. Had the true class been 1, the score would have been .
Compare the two scores side by side. Drag the prediction and watch which one explodes.
MSE versus cross-entropy
Three things to watch while playing:
- In binary classification, push toward the wrong end: cross-entropy heads to infinity, while MSE caps out around . That is why we prefer cross-entropy for classifying.
- The “confident and wrong” banner lights up when the model asserts a wrong answer with confidence. The score must punish it severely, and it does.
- In 3-class mode, change the logits: softmax redistributes the probabilities, and cross-entropy looks only at the one for the true class.
Act 3: the cost landscape
We have a score. But what does that score depend on? The training data is fixed once and for all. What can change is the network’s weights. The score is therefore a function of the weights.
Imagine a network with only two weights, and . To each pair corresponds a score. We can plot that score as an altitude above the plane of weights: we get a landscape, a cost surface Cost surface The graph of the cost seen as a function of the network weights, with the data held fixed. Each set of weights is a point whose altitude is the corresponding cost. Learning amounts to descending toward a valley of this landscape, which the gradient chapters will do. Source: Goodfellow, Bengio and Courville, 2016 . Good networks sit in the valleys, bad ones on the peaks.
Learning then takes on a clear geometric meaning: it is descending toward a valley of this landscape. The map below shows the cost surface of a small two-weight model. Move the point and read the score.
The cost landscape
Three things to watch while playing:
- The landscape is a bowl: a single valley, marked with a cross. It is the weight pair that minimizes the score.
- Move the point: the farther from the valley, the higher the score. It depends only on the weights, the data staying fixed.
- This chapter shows the scenery. Knowing which direction to descend to reach the valley is the subject of chapters 7 to 9.
Exercises
Grab something to write with. The solutions are deliberately detailed, line by line.
Exercise 1. Reuse the network from Act 1 (same , ReLU hidden activation), but with the input . Compute , then , then the logits , and give the predicted class (the largest logit, no need to compute the softmax).
Exercise 2. A regression model predicts for three examples whose targets are . Compute the mean squared error.
Exercise 3. Target . Compute the binary cross-entropy for , then for , then for . Each time compare with the MSE . What does cross-entropy punish that MSE lets slide?
Exercise 4. A 3-class classifier produces the logits . Compute the softmax probabilities, then the cross-entropy if the true class is class 1 (one-hot ).
In one sentence
The forward pass computes the prediction by propagating the input layer by layer, the loss function gives it a score, and all of learning will be finding the weights that minimize that score in the cost landscape.
Quiz
1. In a forward pass, which way does information flow?
2. What is the softmax function for at the output of a classifier?
3. Why do we prefer cross-entropy over MSE for classification?
4. On the cost surface, what varies from one point to another?
5. What does the valley (the minimum) of the cost surface represent?
Toward chapter 7
We now know how to compute a network’s score and read it as an altitude in the cost landscape. The decisive question remains: in which direction should we change each weight to lower that score? We need to know which way the surface tilts around the current point. That is exactly what a derivative measures. Chapter 7 introduces derivatives and the chain rule, the tool that will turn this static landscape into a path of descent.
Sources
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapters 5 and 6. https://www.deeplearningbook.org
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 4 and 5 (cross-entropy).
- Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, Springer. DOI 10.1007/978-3-642-76153-9_28 (origin of softmax).