05 / 09 From the neuron to the multilayer network
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 05 / 09

From the neuron to the multilayer network

Stacking neurons to solve XOR, then approximating almost any function.

In chapter 4 we proved that a single perceptron cannot compute XOR: no straight line separates the points where XOR equals 11 from those where it equals 00. But we also noticed something. XOR can be written as a combination of functions that are themselves separable: XOR(x1,x2)=(x1x2)¬(x1x2)\text{XOR}(x_1, x_2) = (x_1 \vee x_2) \wedge \neg(x_1 \wedge x_2).

This chapter turns that remark into a general method. We will feed neurons with the outputs of other neurons, solve XOR by hand, understand which boundary shapes this wiring can draw, then discover how far the idea reaches: almost every function.

Solving XOR by hand

The two-assistant analogy

Imagine you are forbidden from answering the question “are the two inputs different?” directly. Instead, you may ask two simpler questions to two assistants, then combine their answers.

  • First assistant: “is at least one input equal to 11?” That is the OR function.
  • Second assistant: “is it false that both inputs equal 11?” That is the NAND function.

Think for a moment. When exactly one input is 11, the first assistant says yes (there is indeed at least one) and so does the second (the two are not both 11). When both are 00, the first says no. When both are 11, the second says no. The two assistants say yes at the same time only in the case “exactly one input is 11”, which is precisely XOR. So all you have to do is answer yes when both assistants say yes: a simple AND.

The hidden layer

Each assistant is a perceptron. We place them in a hidden layer Hidden layer An intermediate layer in a neural network, sitting between the input layer and the output layer. Its neurons neither receive raw data nor produce the final prediction, they compute intermediate representations. A "deep" network has several hidden layers. : an intermediate layer whose outputs are not the final answer but intermediate questions that the next layer will combine. The resulting network, which stacks one or more hidden layers between input and output, is called a multilayer perceptron Multilayer perceptron A neural network organised in successive layers (input, one or more hidden layers, output), where each neuron applies an affine combination followed by an activation function. By stacking neurons it overcomes the single perceptron limit and computes non-linearly-separable functions such as XOR. Source: Rumelhart, Hinton & Williams, 1986 .

We reuse the threshold convention from chapter 4: the Heaviside function HH equals 11 if its argument is positive or zero, and 00 otherwise. The hidden layer computes two values h1h_1 and h2h_2:

h1  =  H(x1+x20.5)(OR).h_1 \;=\; H(x_1 + x_2 - 0.5) \qquad (\text{OR}).

This equation reads: h1h_1 equals 11 as soon as x1+x20.5x_1 + x_2 \geq 0.5, that is, as soon as at least one of the two inputs equals 11.

h2  =  H(1.5x1x2)(NAND).h_2 \;=\; H(1.5 - x_1 - x_2) \qquad (\text{NAND}).

This one reads: h2h_2 equals 11 as long as x1+x21.5x_1 + x_2 \leq 1.5, so everywhere except when both inputs equal 11 at once.

Finally, the output neuron computes the logical AND of h1h_1 and h2h_2:

y  =  H(h1+h21.5)(AND).y \;=\; H(h_1 + h_2 - 1.5) \qquad (\text{AND}).

It equals 11 only if h1+h21.5h_1 + h_2 \geq 1.5, that is, only if h1h_1 and h2h_2 both equal 11. Three perceptrons, arranged in two layers, therefore compute XOR: something a single perceptron could not do.

Build it yourself

The two hidden neurons are wired as OR and NAND. Your job is to set the output neuron (its weights v1v_1, v2v_2 on h1h_1, h2h_2, and its bias cc) so that it becomes an AND, and watch the truth table turn green.

v₁ = 1.00v₂ = 1.00x₁x₂ORNANDANDOR = step(x₁ + x₂ - 0.5)NAND = step(1.5 - x₁ - x₂)

Building XOR with a hidden layer

weight on OR (v₁)1.00
weight on NAND (v₂)1.00
output bias (c)-1.00
Truth table
(x₁, x₂)ORNANDyXOR
(0, 0)0110
(1, 0)1111
(0, 1)1111
(1, 1)1010
Correct outputs2 / 4

Three things to watch while playing:

  • As long as the output is not a true AND, at least one of the four rows stays red. The “Reveal the solution (AND)” button sets v1=v2=1v_1 = v_2 = 1 and c=1.5c = -1.5.
  • The XOR column (the target) and the yy column (what your network computes) match on all four rows only when the output realizes exactly the AND of OR and NAND.
  • The two hidden neurons have reshaped the problem: in the new coordinates (h1,h2)(h_1, h_2), XOR has become linearly separable, even though it was not in the coordinates (x1,x2)(x_1, x_2).

What layers let you draw

The fence analogy

A single neuron is a straight fence in a field: animals on one side, nothing on the other. With a single fence you can only mark off a half-field. But with several straight fences, and the rule “an animal is enclosed only if it is on the right side of every fence at once”, you fence in a plot whose shape you choose.

From half-plane to curve

Let us formalize. The decision boundary Decision boundary The set of input-space points where the model switches from one class to another, that is, where its output flips. For a single neuron it is a hyperplane ; for a multilayer network it can become polygonal, then curved. Source: Bishop, 2006 of a neuron is the hyperplane where its output flips. A single neuron therefore splits the plane into two half-planes. A layer of kk neurons followed by an output neuron that computes their AND keeps only the points on the right side of all kk boundaries at once: the intersection of kk half-planes, that is, a convex region (a polygon). By adding one more layer, we can take the union of several such convex regions and obtain arbitrary shapes, including non-convex ones.

Composing a decision boundary

Hidden neurons (k)1
k = 1: a single half-plane, exactly like a perceptron.

Region = intersection of k half-planes. A hidden neuron draws a line; the output layer logical AND keeps only the points on the correct side of all k lines at once.

Increase k: the circumscribed polygon gets closer to the circle. With enough neurons, the boundary becomes an arbitrary curve.

Three things to watch while playing:

  • With k=1k = 1, the boundary is a single line, exactly like a perceptron. With k=2k = 2, it is a strip. From k=3k = 3 on, it is a closed polygon.
  • The more hidden neurons you add, the more sides the polygon has and the more closely it hugs the circle. The boundary goes from linear to polygonal, then tends toward a curve.
  • Each hidden neuron contributes only one line. It is their combination by the next layer that creates the richness of shapes.

Matrix notation

Writing the weighted sum of each neuron separately quickly becomes unreadable once a layer has hundreds of them. The matrix from chapter 2 solves this: it stacks all the weighted sums of a layer into a single operation.

For a layer that receives an input vector xRninx \in \mathbb{R}^{n_{\text{in}}} and produces noutn_{\text{out}} neurons, we gather the weights in a matrix WRnout×ninW \in \mathbb{R}^{n_{\text{out}} \times n_{\text{in}}} and the biases in a vector bRnoutb \in \mathbb{R}^{n_{\text{out}}}. The layer then computes:

a  =  f(Wx+b).a \;=\; f(Wx + b).

This equation reads: we multiply the input vector xx by the weight matrix WW, add the bias vector bb, then apply the activation function ff to each component of the result. Each row of WW is the weight vector of one neuron in the layer, and each component of bb its bias. A single line of algebra thus replaces noutn_{\text{out}} weighted sums written out by hand.

A network with one hidden layer: two inputs, one hidden layer, one output

The multilayer perceptron as a composition

A deep network stacks these layers: the output of one becomes the input of the next. With LL layers, the network computes

a(L)  =  f(W(L)f(W(1)x+b(1))+b(L)).a^{(L)} \;=\; f\big(W^{(L)} \cdots f(W^{(1)} x + b^{(1)}) \cdots + b^{(L)}\big).

This is a composition of functions: we apply one layer, then another to the result of the first, and so on. But this composition is only useful because a non-linear activation is inserted between each layer. Without it, composing several linear layers would still give a linear function (we saw this in chapter 3), and the whole stack would collapse into a single layer. It is exactly the alternation “linear layer, then non-linear activation” that gives the network its expressive power Expressive power The range of functions a model can represent as its parameters vary. A single perceptron expresses only linear separations ; adding hidden layers widens the expressive power until it can approximate any continuous function. Source: Goodfellow, Bengio & Courville, 2016 .

How far can we go? The universal approximation theorem

The bump intuition

Take an activation function shaped like a soft step, such as the sigmoid from chapter 3. The difference of two slightly shifted sigmoids draws a bump: a function that is almost zero everywhere, except on a small interval where it forms a hump. By adding up many bumps, placed and scaled just right, you can hug the profile of any continuous curve, much as you approximate the relief of a mountain by stacking thinner and thinner bricks.

A single hidden layer does exactly that: each hidden neuron builds a step, and the output neuron takes their weighted sum. With enough neurons, that sum approximates the target function as closely as you like.

The statement

This is the content of the universal approximation theorem Universal approximation theorem A result (Cybenko 1989, Hornik 1989) stating that a network with a single hidden layer, given enough neurons and a non-polynomial activation, can approximate any continuous function on a bounded domain to arbitrary accuracy. It guarantees that such a network exists, not that we can learn it. Source: Cybenko, 1989 ; Hornik, 1989 .

Pen-and-paper exercises

Exercise 1: verifying the XOR construction

With h1=H(x1+x20.5)h_1 = H(x_1 + x_2 - 0.5), h2=H(1.5x1x2)h_2 = H(1.5 - x_1 - x_2) and y=H(h1+h21.5)y = H(h_1 + h_2 - 1.5), compute h1h_1, h2h_2 and yy on the four inputs, and check that yy does reproduce XOR.

Exercise 2: counting the parameters

Consider a network with architecture 2312 \to 3 \to 1: two inputs, a hidden layer of three neurons, one output neuron. How many parameters (weights and biases) does this network have in total?

Exercise 3: the matrix form

Write, in matrix notation, the equations of a network with two inputs, a hidden layer of two neurons, and one output (2212 \to 2 \to 1). Specify the dimensions of each weight matrix and each bias vector. The activation function is denoted ff.

Exercise 4: depth or width?

The universal approximation theorem states that a single hidden layer is enough to approximate any continuous function. Why, in practice, do we still build deep networks (with several hidden layers) rather than putting everything into one very wide layer?

In one sentence

A single neuron draws a line; stacking neurons into layers, with a non-linear activation in between, is enough to draw any boundary, and even to approximate almost any function.

Quiz

Quiz
  1. 1. Why can a single perceptron not compute XOR?

  2. 2. In the hand-built XOR, what does the second hidden neuron h₂ compute?

  3. 3. What does the intersection of k half-planes represent geometrically (a layer whose output is an AND)?

  4. 4. What does the universal approximation theorem guarantee?

  5. 5. Does universal imply learnable?

Toward chapter 6: measuring the error to learn

We now know that a multilayer network can express anything. The real question remains, the one the universal approximation theorem leaves open: how do we set its weights without placing them by hand, as we just did for XOR?

The first building block of the answer is being able to measure how wrong the network is. Chapter 6 introduces the forward pass (computing the output, layer after layer) and the cost function, which quantifies the gap between the prediction and the target. It is this measure of the error that, in chapters 7 and 8, will guide the automatic adjustment of the weights.

Sources

  • Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2(4), 303-314. DOI 10.1007/BF02551274
  • Hornik, K., Stinchcombe, M. and White, H. (1989). “Multilayer feedforward networks are universal approximators.” Neural Networks 2(5), 359-366. DOI 10.1016/0893-6080(89)90020-8
  • Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature 323, 533-536. DOI 10.1038/323533a0
  • Leshno, M., Lin, V. Y., Pinkus, A. and Schocken, S. (1993). “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.” Neural Networks 6(6), 861-867. DOI 10.1016/S0893-6080(05)80131-5
  • Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, chapter 6. MIT Press. deeplearningbook.org