06 / 09 Forward pass and loss functions
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 06 / 09

Forward pass and loss functions

From input to prediction, then how to give it a score.

In chapter 5, we saw that a multilayer network can express almost any function, provided its weights are set well. For XOR, we set them by hand. But by hand does not scale: a real network has millions of weights. So we want to set them automatically.

The very first building block of that automation is to know two things: how to compute the network’s output for a given input, then how to give that output a score that says how wrong it is. Learning, for a network, will be the search to improve that score. This chapter builds those two bricks. It does not yet compute any weight correction: it sets the playing field on which, from chapter 7 on, the gradient will operate.

Act 1: computing the prediction

The forward pass Forward pass The forward propagation. A computation phase where an input traverses the network layer by layer, from inputs to output, applying at each neuron its weighted sum and activation function. Produces the final prediction. is the computation that turns an input into a prediction, traversing the network layer by layer, from input to output. Each layer applies its formula a=f(Wx+b)a = f(Wx + b), and its output becomes the input of the next.

Take a small 3-class classifier: two inputs, a hidden layer of two neurons, an output layer of three neurons. Feed it the input x=(1,2)x = (1, 2).

The hidden layer. Its weights and biases are

W(1)=(0.50.510.5),b(1)=(01).W^{(1)} = \begin{pmatrix} 0.5 & -0.5 \\ 1 & 0.5 \end{pmatrix}, \qquad b^{(1)} = \begin{pmatrix} 0 \\ -1 \end{pmatrix}.

We first compute the weighted sum of each neuron, the pre-activation z(1)=W(1)x+b(1)z^{(1)} = W^{(1)}x + b^{(1)}:

z1(1)=0.510.52+0=0.5,z^{(1)}_1 = 0.5 \cdot 1 - 0.5 \cdot 2 + 0 = -0.5, z2(1)=11+0.521=1.z^{(1)}_2 = 1 \cdot 1 + 0.5 \cdot 2 - 1 = 1.

Then we apply the activation. With ReLU, which replaces every negative by zero:

h=ReLU(z(1))=(max(0,0.5),  max(0,1))=(0,  1).h = \mathrm{ReLU}(z^{(1)}) = \big(\max(0, -0.5),\; \max(0, 1)\big) = (0,\; 1).

The first hidden neuron switched off, the second stayed active.

The output layer. Its weights and biases are

W(2)=(100211),b(2)=(001).W^{(2)} = \begin{pmatrix} 1 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix}, \qquad b^{(2)} = \begin{pmatrix} 0 \\ 0 \\ -1 \end{pmatrix}.

We compute the three pre-activations, here called the logits z(2)=W(2)h+b(2)z^{(2)} = W^{(2)}h + b^{(2)}:

z(2)=(0,  2,  0).z^{(2)} = (\,0,\; 2,\; 0\,).

From scores to probabilities. Three logits are three arbitrary numbers. For a classifier we want three probabilities, positive and summing to 11. That is the role of the softmax Softmax A function that turns a vector of reals into a probability distribution. For a vector z, softmax(z)_i = exp(z_i) / sum(exp(z_j)). Used as the output activation in multi-class classification. function:

softmax(z)k=ezkjezj.\mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_j e^{z_j}}.

On our logits (0,2,0)(0, 2, 0):

y^=(e0,  e2,  e0)e0+e2+e0=(1,  e2,  1)2+e2(0.11,  0.79,  0.11).\hat{y} = \frac{(\,e^{0},\; e^{2},\; e^{0}\,)}{e^{0} + e^{2} + e^{0}} = \frac{(\,1,\; e^{2},\; 1\,)}{2 + e^{2}} \approx (0.11,\; 0.79,\; 0.11).

The network predicts class 2 with about 79%79\% confidence. That is a full forward pass: from xx to a probability distribution.

Step through it yourself, and toggle the hidden activation between ReLU and sigmoid to see how each one transforms the same input.

x₁x₂h₁h₂ŷ₁ŷ₂ŷ₃
Hidden activation

The forward pass, layer by layer

Step 1Input
x
(1.00, 2.00)

Three things to watch while playing:

  • At each step, the active layer lights up in the diagram. Information always moves from input to output, never backward.
  • Toggle from ReLU to sigmoid: the same input produces different hidden values hh, but the predicted class stays 2. The network’s structure matters as much as the exact values.
  • The last step, softmax, turns three arbitrary logits into three probabilities that sum to 11.

Act 2: giving a score

We can compute a prediction. It remains to say how good it is. A loss function Loss function A measure of the error between a network's prediction and the expected truth. Also called cost function. The higher it is, the more wrong the network. Training seeks to minimise it. Common examples: MSE for regression, cross-entropy for classification. takes the prediction and the target, and returns a single number: the score. The larger the number, the more wrong the network. Learning will be making that number smaller.

Not all tasks are scored the same way. Let us look at the two big families.

Regression: mean squared error

When we predict a continuous quantity (a price, a temperature), it is regression. The natural score is the mean squared error Mean squared error A loss function that averages the square of the gap between the prediction and the target. The square penalizes large gaps heavily and makes the cost differentiable everywhere. Written MSE, it is the natural choice for regression. Source: Goodfellow, Bengio and Courville, 2016 (MSE), the average of the squared gaps between prediction and target:

MSE=1ni=1n(y^iyi)2.\mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2.

Why the square, and not the absolute value? For two reasons. First, the square penalizes large gaps far more than small ones: being off by 1010 costs a hundred times more than being off by 11, not ten times. Second, the square is smooth, differentiable everywhere, whereas the absolute value has a sharp corner at zero. That smoothness will be precious in chapter 7, when we look for the slope.

Classification: cross-entropy

In classification, the target is not a number but a class. The prediction, on the other hand, is a probability. Could we reuse the MSE between the predicted probability and the target? We could, but it would be a poor choice, and the reason is instructive.

Imagine a binary task (target tt equal to 00 or 11) where the model predicts the probability pp that the target is 11. If the true answer is 11 and the model announces p=0.01p = 0.01, it is wrong with total confidence. We would want a score that blows up in that case. Yet the MSE caps out: (0.011)20.98(0.01 - 1)^2 \approx 0.98, barely less than for a hesitation at p=0.5p = 0.5.

The cross-entropy Cross-entropy A classification loss that measures the gap between the predicted distribution and the target distribution. It equals minus the logarithm of the probability assigned to the correct class, so it blows up when the model is confident and wrong. Paired with the softmax function, it is the standard multiclass loss. Source: Bishop, 2006 for the binary case fixes this:

H=[tlnp+(1t)ln(1p)].H = -\big[\,t \ln p + (1 - t) \ln(1 - p)\,\big].

When the target is 11, it reduces to H=lnpH = -\ln p. If pp is close to 11, HH is close to 00: right answer, small score. But if pp tends to 00, lnp-\ln p tends to infinity: the score explodes. That is exactly the punishment we wanted for confident error.

Several classes: softmax and one-hot target

For our 3-class classifier, the output is the softmax distribution y^\hat{y} from Act 1. The target is encoded one-hot: a vector that is 11 on the correct class and 00 elsewhere. For a true class 2, the target is y=(0,1,0)y = (0, 1, 0).

The multiclass cross-entropy generalizes the binary version:

H=kyklny^k=lny^c,H = -\sum_{k} y_k \ln \hat{y}_k = -\ln \hat{y}_c,

where cc is the true class. Since the target is one-hot, all terms vanish except the one for the correct class. The score therefore depends only on the probability the model gave to the right answer. On our example, ln(0.79)0.24-\ln(0.79) \approx 0.24: small score, the network was right and confident. Had the true class been 1, the score would have been ln(0.11)2.2-\ln(0.11) \approx 2.2.

Compare the two scores side by side. Drag the prediction and watch which one explodes.

p = 0.50MSECross-entropy

MSE versus cross-entropy

predicted probability p0.50
target
Squared error (MSE)
0.25
Cross-entropy
0.69

Three things to watch while playing:

  • In binary classification, push pp toward the wrong end: cross-entropy heads to infinity, while MSE caps out around 11. That is why we prefer cross-entropy for classifying.
  • The “confident and wrong” banner lights up when the model asserts a wrong answer with confidence. The score must punish it severely, and it does.
  • In 3-class mode, change the logits: softmax redistributes the probabilities, and cross-entropy looks only at the one for the true class.

Act 3: the cost landscape

We have a score. But what does that score depend on? The training data is fixed once and for all. What can change is the network’s weights. The score is therefore a function of the weights.

Imagine a network with only two weights, w1w_1 and w2w_2. To each pair (w1,w2)(w_1, w_2) corresponds a score. We can plot that score as an altitude above the plane of weights: we get a landscape, a cost surface Cost surface The graph of the cost seen as a function of the network weights, with the data held fixed. Each set of weights is a point whose altitude is the corresponding cost. Learning amounts to descending toward a valley of this landscape, which the gradient chapters will do. Source: Goodfellow, Bengio and Courville, 2016 . Good networks sit in the valleys, bad ones on the peaks.

Learning then takes on a clear geometric meaning: it is descending toward a valley of this landscape. The map below shows the cost surface of a small two-weight model. Move the point and read the score.

w₁w₂

The cost landscape

w₁ = -0.5 w₂ = 3.5
Cost at the current point = 1.69
current point
+valley (minimal cost)
Click or drag to move the point and read the cost. Learning amounts to reaching the valley. How to get there: the following chapters.

Three things to watch while playing:

  • The landscape is a bowl: a single valley, marked with a cross. It is the weight pair that minimizes the score.
  • Move the point: the farther from the valley, the higher the score. It depends only on the weights, the data staying fixed.
  • This chapter shows the scenery. Knowing which direction to descend to reach the valley is the subject of chapters 7 to 9.

Exercises

Grab something to write with. The solutions are deliberately detailed, line by line.

Exercise 1. Reuse the network from Act 1 (same W(1),b(1),W(2),b(2)W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)}, ReLU hidden activation), but with the input x=(2,0)x = (2, 0). Compute z(1)z^{(1)}, then hh, then the logits z(2)z^{(2)}, and give the predicted class (the largest logit, no need to compute the softmax).

Exercise 2. A regression model predicts y^=(3,5,2)\hat{y} = (3, 5, 2) for three examples whose targets are y=(2,5,4)y = (2, 5, 4). Compute the mean squared error.

Exercise 3. Target t=1t = 1. Compute the binary cross-entropy H=lnpH = -\ln p for p=0.9p = 0.9, then for p=0.1p = 0.1, then for p=0.01p = 0.01. Each time compare with the MSE (p1)2(p - 1)^2. What does cross-entropy punish that MSE lets slide?

Exercise 4. A 3-class classifier produces the logits z=(2,1,0)z = (2, 1, 0). Compute the softmax probabilities, then the cross-entropy if the true class is class 1 (one-hot y=(1,0,0)y = (1, 0, 0)).

In one sentence

The forward pass computes the prediction by propagating the input layer by layer, the loss function gives it a score, and all of learning will be finding the weights that minimize that score in the cost landscape.

Quiz

Quiz
  1. 1. In a forward pass, which way does information flow?

  2. 2. What is the softmax function for at the output of a classifier?

  3. 3. Why do we prefer cross-entropy over MSE for classification?

  4. 4. On the cost surface, what varies from one point to another?

  5. 5. What does the valley (the minimum) of the cost surface represent?

Toward chapter 7

We now know how to compute a network’s score and read it as an altitude in the cost landscape. The decisive question remains: in which direction should we change each weight to lower that score? We need to know which way the surface tilts around the current point. That is exactly what a derivative measures. Chapter 7 introduces derivatives and the chain rule, the tool that will turn this static landscape into a path of descent.

Sources

  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapters 5 and 6. https://www.deeplearningbook.org
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 4 and 5 (cross-entropy).
  • Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, Springer. DOI 10.1007/978-3-642-76153-9_28 (origin of softmax).