From the neuron to the multilayer network
Stacking neurons to solve XOR, then approximating almost any function.
In chapter 4 we proved that a single perceptron cannot compute XOR: no straight line separates the points where XOR equals from those where it equals . But we also noticed something. XOR can be written as a combination of functions that are themselves separable: .
This chapter turns that remark into a general method. We will feed neurons with the outputs of other neurons, solve XOR by hand, understand which boundary shapes this wiring can draw, then discover how far the idea reaches: almost every function.
Solving XOR by hand
The two-assistant analogy
Imagine you are forbidden from answering the question “are the two inputs different?” directly. Instead, you may ask two simpler questions to two assistants, then combine their answers.
- First assistant: “is at least one input equal to ?” That is the OR function.
- Second assistant: “is it false that both inputs equal ?” That is the NAND function.
Think for a moment. When exactly one input is , the first assistant says yes (there is indeed at least one) and so does the second (the two are not both ). When both are , the first says no. When both are , the second says no. The two assistants say yes at the same time only in the case “exactly one input is ”, which is precisely XOR. So all you have to do is answer yes when both assistants say yes: a simple AND.
The hidden layer
Each assistant is a perceptron. We place them in a hidden layer Hidden layer An intermediate layer in a neural network, sitting between the input layer and the output layer. Its neurons neither receive raw data nor produce the final prediction, they compute intermediate representations. A "deep" network has several hidden layers. : an intermediate layer whose outputs are not the final answer but intermediate questions that the next layer will combine. The resulting network, which stacks one or more hidden layers between input and output, is called a multilayer perceptron Multilayer perceptron A neural network organised in successive layers (input, one or more hidden layers, output), where each neuron applies an affine combination followed by an activation function. By stacking neurons it overcomes the single perceptron limit and computes non-linearly-separable functions such as XOR. Source: Rumelhart, Hinton & Williams, 1986 .
We reuse the threshold convention from chapter 4: the Heaviside function equals if its argument is positive or zero, and otherwise. The hidden layer computes two values and :
This equation reads: equals as soon as , that is, as soon as at least one of the two inputs equals .
This one reads: equals as long as , so everywhere except when both inputs equal at once.
Finally, the output neuron computes the logical AND of and :
It equals only if , that is, only if and both equal . Three perceptrons, arranged in two layers, therefore compute XOR: something a single perceptron could not do.
Build it yourself
The two hidden neurons are wired as OR and NAND. Your job is to set the output neuron (its weights , on , , and its bias ) so that it becomes an AND, and watch the truth table turn green.
Building XOR with a hidden layer
Three things to watch while playing:
- As long as the output is not a true AND, at least one of the four rows stays red. The “Reveal the solution (AND)” button sets and .
- The XOR column (the target) and the column (what your network computes) match on all four rows only when the output realizes exactly the AND of OR and NAND.
- The two hidden neurons have reshaped the problem: in the new coordinates , XOR has become linearly separable, even though it was not in the coordinates .
What layers let you draw
The fence analogy
A single neuron is a straight fence in a field: animals on one side, nothing on the other. With a single fence you can only mark off a half-field. But with several straight fences, and the rule “an animal is enclosed only if it is on the right side of every fence at once”, you fence in a plot whose shape you choose.
From half-plane to curve
Let us formalize. The decision boundary Decision boundary The set of input-space points where the model switches from one class to another, that is, where its output flips. For a single neuron it is a hyperplane ; for a multilayer network it can become polygonal, then curved. Source: Bishop, 2006 of a neuron is the hyperplane where its output flips. A single neuron therefore splits the plane into two half-planes. A layer of neurons followed by an output neuron that computes their AND keeps only the points on the right side of all boundaries at once: the intersection of half-planes, that is, a convex region (a polygon). By adding one more layer, we can take the union of several such convex regions and obtain arbitrary shapes, including non-convex ones.
Composing a decision boundary
Region = intersection of k half-planes. A hidden neuron draws a line; the output layer logical AND keeps only the points on the correct side of all k lines at once.
Increase k: the circumscribed polygon gets closer to the circle. With enough neurons, the boundary becomes an arbitrary curve.
Three things to watch while playing:
- With , the boundary is a single line, exactly like a perceptron. With , it is a strip. From on, it is a closed polygon.
- The more hidden neurons you add, the more sides the polygon has and the more closely it hugs the circle. The boundary goes from linear to polygonal, then tends toward a curve.
- Each hidden neuron contributes only one line. It is their combination by the next layer that creates the richness of shapes.
Matrix notation
Writing the weighted sum of each neuron separately quickly becomes unreadable once a layer has hundreds of them. The matrix from chapter 2 solves this: it stacks all the weighted sums of a layer into a single operation.
For a layer that receives an input vector and produces neurons, we gather the weights in a matrix and the biases in a vector . The layer then computes:
This equation reads: we multiply the input vector by the weight matrix , add the bias vector , then apply the activation function to each component of the result. Each row of is the weight vector of one neuron in the layer, and each component of its bias. A single line of algebra thus replaces weighted sums written out by hand.
The multilayer perceptron as a composition
A deep network stacks these layers: the output of one becomes the input of the next. With layers, the network computes
This is a composition of functions: we apply one layer, then another to the result of the first, and so on. But this composition is only useful because a non-linear activation is inserted between each layer. Without it, composing several linear layers would still give a linear function (we saw this in chapter 3), and the whole stack would collapse into a single layer. It is exactly the alternation “linear layer, then non-linear activation” that gives the network its expressive power Expressive power The range of functions a model can represent as its parameters vary. A single perceptron expresses only linear separations ; adding hidden layers widens the expressive power until it can approximate any continuous function. Source: Goodfellow, Bengio & Courville, 2016 .
How far can we go? The universal approximation theorem
The bump intuition
Take an activation function shaped like a soft step, such as the sigmoid from chapter 3. The difference of two slightly shifted sigmoids draws a bump: a function that is almost zero everywhere, except on a small interval where it forms a hump. By adding up many bumps, placed and scaled just right, you can hug the profile of any continuous curve, much as you approximate the relief of a mountain by stacking thinner and thinner bricks.
A single hidden layer does exactly that: each hidden neuron builds a step, and the output neuron takes their weighted sum. With enough neurons, that sum approximates the target function as closely as you like.
The statement
Pen-and-paper exercises
Exercise 1: verifying the XOR construction
With , and , compute , and on the four inputs, and check that does reproduce XOR.
Exercise 2: counting the parameters
Consider a network with architecture : two inputs, a hidden layer of three neurons, one output neuron. How many parameters (weights and biases) does this network have in total?
Exercise 3: the matrix form
Write, in matrix notation, the equations of a network with two inputs, a hidden layer of two neurons, and one output (). Specify the dimensions of each weight matrix and each bias vector. The activation function is denoted .
Exercise 4: depth or width?
The universal approximation theorem states that a single hidden layer is enough to approximate any continuous function. Why, in practice, do we still build deep networks (with several hidden layers) rather than putting everything into one very wide layer?
In one sentence
A single neuron draws a line; stacking neurons into layers, with a non-linear activation in between, is enough to draw any boundary, and even to approximate almost any function.
Quiz
1. Why can a single perceptron not compute XOR?
2. In the hand-built XOR, what does the second hidden neuron h₂ compute?
3. What does the intersection of k half-planes represent geometrically (a layer whose output is an AND)?
4. What does the universal approximation theorem guarantee?
5. Does universal imply learnable?
Toward chapter 6: measuring the error to learn
We now know that a multilayer network can express anything. The real question remains, the one the universal approximation theorem leaves open: how do we set its weights without placing them by hand, as we just did for XOR?
The first building block of the answer is being able to measure how wrong the network is. Chapter 6 introduces the forward pass (computing the output, layer after layer) and the cost function, which quantifies the gap between the prediction and the target. It is this measure of the error that, in chapters 7 and 8, will guide the automatic adjustment of the weights.
Sources
- Cybenko, G. (1989). “Approximation by superpositions of a sigmoidal function.” Mathematics of Control, Signals and Systems 2(4), 303-314. DOI 10.1007/BF02551274
- Hornik, K., Stinchcombe, M. and White, H. (1989). “Multilayer feedforward networks are universal approximators.” Neural Networks 2(5), 359-366. DOI 10.1016/0893-6080(89)90020-8
- Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature 323, 533-536. DOI 10.1038/323533a0
- Leshno, M., Lin, V. Y., Pinkus, A. and Schocken, S. (1993). “Multilayer feedforward networks with a nonpolynomial activation function can approximate any function.” Neural Networks 6(6), 861-867. DOI 10.1016/S0893-6080(05)80131-5
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, chapter 6. MIT Press. deeplearningbook.org