The perceptron
How Rosenblatt taught a machine to learn without a gradient (1958).
In chapter 3 you learned why an activation function has to be non-linear Non-linearity The property of a function that is not affine. A non-linear activation function is mandatory in a deep network, otherwise the composition of several layers reduces to a single equivalent affine layer and depth loses its point. and differentiable for a deep network to be mathematically interesting. Yet the first artificial neuron that could ever learn, the perceptron of Frank Rosenblatt (1958), uses the threshold function Threshold function Binary activation function H(z) equal to 1 if z >= 0 and 0 otherwise. Also called the Heaviside function. It is the original activation of McCulloch-Pitts (1943) and Rosenblatt's perceptron (1958), later dropped because it is not differentiable. , which is almost everywhere differentiable with derivative zero. How did Rosenblatt make such a machine learn anything at all?
This chapter answers that question by building the perceptron from a purely geometric point of view, without ever differentiating. We prove that the procedure converges on a separable dataset, then we discover the limit that ended the first age of neural networks.
The geometry of a hyperplane
The ruler-on-the-table analogy
Place a flat ruler on a table. The ruler divides the space above the table into two zones: one in front of you, one behind. The edge of the ruler is the boundary. The direction perpendicular to that edge is what we will call the normal vector Normal vector Vector $w$ that defines the direction perpendicular to a hyperplane with equation $w \cdot x + b = 0$. Its direction indicates which side of the hyperplane a point lies on; its norm sets the scale of the signed distance. Source: Bishop, PRML, ch. 4 , and how far you put the ruler from your body is the offset.
A hyperplane Hyperplane A subset of Rⁿ defined by a linear equation w · x + b = 0. In two dimensions it is a line, in three a plane. It is exactly the decision boundary drawn by a single neuron. in two dimensions is exactly this: a line that splits the plane into two half-spaces Half-space One of the two regions in which a hyperplane partitions Rⁿ. Algebraically, the set of points x with w · x + b > 0 (resp. < 0). A threshold neuron splits space into exactly two half-spaces: active and inactive. . In three dimensions, it is a plane. In dimensions, it is a flat surface of dimension .
Formal definition
Let be a non-zero vector and a scalar. The affine hyperplane with equation is the set:
Signed distance from a point to the hyperplane
For an arbitrary point , we define its signed distance Signed distance Perpendicular distance from a point to a hyperplane, carrying a sign that depends on which side of the hyperplane the point lies on. For the hyperplane $w \cdot x + b = 0$, it equals $d(x) = (w \cdot x + b) / \|w\|$: positive on one side, negative on the other, zero on the hyperplane itself. Source: Hastie, Tibshirani, Friedman, ESL, ch. 4 to by:
This quantity is positive on one side of , negative on the other, and zero on itself. Its absolute value is the usual perpendicular distance.
Play with the hyperplane
Hyperplane geometry
Three things to notice as you play:
- When , the hyperplane passes exactly through the origin.
- Doubling does not move the line: only the direction of matters for the position of the hyperplane, not its magnitude.
- The signed distance flips sign when you click on the other side of the line: that is the sign we will exploit for classification.
Linearly separable, with a margin
The buffer zone analogy
Picture two neighbouring countries with a buffer zone between them. The shared border is the line in the middle. The width of the buffer zone is the margin: the wider it is, the more robust the border is to small perturbations of the points. If the buffer zone shrinks to zero, citizens of both countries cross paths and the border becomes ambiguous.
Encoding the targets: why
We have so far often encoded binary classes by and . For the perceptron, we will pick and instead. The choice simplifies everything: a sample is correctly classified by if and only if and have the same sign, which is a single inequality:
Formal definitions
Let be a dataset with and . We say that is linearly separable Linearly separable A labelled dataset is linearly separable if there exists a hyperplane that correctly separates the label-1 points from the label-0 points. XOR is the historical example of a non linearly separable problem. if there exist such that for every :
For such a pair , we define two margins. The functional margin Functional margin For a sample $(x, y)$ with $y \in \{-1, +1\}$, the functional margin is the quantity $\hat\gamma = y (w \cdot x + b)$. It is strictly positive if and only if the sample is correctly classified. It depends on the scale of the weights and is not a geometric distance. Source: Bishop, PRML, ch. 7 of point is . The geometric margin Geometric margin Minimal perpendicular distance between a separating hyperplane and the points of the dataset. Defined by $\gamma = \min_i y_i (w \cdot x_i + b) / \|w\|$ with $y_i \in \{-1, +1\}$. Plays a central role in Novikoff's theorem and in the formulation of support vector machines. Source: Novikoff, 1962 of point is . The margin of the dataset is the minimum over all points:
The functional margin depends on the scale of the weights ( doubles if you double ). The geometric margin, on the other hand, is invariant: it measures a real distance in the plane.
The perceptron, and the tension with chapter 3
Definition
Let . The associated perceptron is the classifier
where is the sign function.
1958 and 1960: two separate dates
Frank Rosenblatt publishes the founding paper in Psychological Review (Rosenblatt, 1958). It introduces the theoretical model and the learning rule you are about to see. Two years later, at the Cornell Aeronautical Laboratory, he builds the Mark I Perceptron Mark I Perceptron Physical machine built by Frank Rosenblatt between 1958 and 1960 at Cornell Aeronautical Laboratory. Able to recognise simple shapes thanks to 400 photoreceptors connected to weights that were tunable via motorised potentiometers. The first hardware implementation of a machine learning algorithm, distinct from the theoretical model published in 1958. Source: Rosenblatt, 1958, 1960 : a physical machine with 400 photoreceptors and weights that are tunable via motorised potentiometers. The New York Times declares that the US Navy has just built “a machine that learns by itself”.
The important point: 1958 is the model; 1960 is the hardware. Many accounts conflate the two, but the theoretical paper precedes the machine by two years.
The tension with what chapter 3 taught us
In chapter 3 we proved that the depth of a network is pointless if the activation is linear, and that we need a differentiable activation to compute a gradient. Yet is almost everywhere differentiable with derivative zero. How did Rosenblatt manage to teach a machine equipped with such a function?
The answer, surprisingly, is that he did not need a derivative. His learning procedure is a local geometric correction: whenever the perceptron is wrong on a sample, we shift the weight vector in the direction that would fix the error, with no gradient ever computed.
It is a historical exception. From chapter 7 onwards, we switch back to differentiable activations and gradient descent Gradient descent An optimisation algorithm that iteratively adjusts the parameters of a model to minimise a loss function. At each step, it moves the parameters in the direction opposite the gradient, by a distance proportional to the learning rate. The dominant method for training neural networks. Source: Cauchy, 1847 takes over. But for the perceptron, learning happens by hand, via projection.
The perceptron learning rule
The misaligned road sign analogy
Picture a road sign that points slightly the wrong way. Each time a driver takes a wrong turn because of it, you rotate the sign by a notch in the direction that would have avoided the mistake. You do not compute any derivative, you do not optimise anything: you react locally, incident by incident. After enough incidents, the sign is correctly aligned.
That is exactly what Rosenblatt’s rule does to the weights and the bias .
Statement
Let be the learning rate Learning rate A positive scalar controlling the step size taken by gradient descent at each iteration. Too small, training is slow; too large, it oscillates or diverges. Often denoted η (eta) or α (alpha). The first hyperparameter to tune in any training run. . For a misclassified sample , the perceptron learning rule Learning rule Procedure that updates the parameters (weights, bias) of a model from observed samples. For the perceptron, the rule is $w \leftarrow w + \eta \cdot y \cdot x$ and $b \leftarrow b + \eta \cdot y$ applied only when a sample is misclassified. Source: Rosenblatt, 1958 applies:
For a correctly classified sample, we leave everything alone. The procedure walks through the dataset, applies the update on every mistake, and repeats until no sample is misclassified.
Three forms of the same rule
| Form | Expression |
|---|---|
| Coordinate-wise | for every |
| Vectorial | |
| Bias-absorbed | with and |
The three say exactly the same thing. The coordinate-wise form is the most explicit for pen-and-paper computation. The vectorial form is the most compact. The bias-absorbed form is convenient for proofs: it folds two updates ( and ) into one.
Proof: the update strictly improves the functional margin
Before the update, let be the functional margin of . Since the sample is misclassified, .
After the update, the new . Let us compute the new functional margin.
Step 1. Substitute .
Step 2. Expand.
Step 3. Since , we have . The new functional margin is therefore:
Result. The added quantity is strictly positive (because ). The update therefore strictly increased the functional margin of that sample. No derivative anywhere in the proof. □
Build the perceptron step by step
Build the perceptron step by step
Three things to watch while you play:
- On OR or AND, the violet boundary stabilises quickly. The error counter falls to zero and the perceptron has converged.
- On XOR, the error counter never reaches zero, even after a hundred epochs. The rule keeps oscillating forever.
- The rate does not change convergence on a separable dataset: it only changes the size of each step and thus the visual speed of the boundary, not the final outcome.
The convergence theorem (Novikoff, 1962)
The zig-zagging cursor analogy
Picture a cursor bouncing between the two walls of a narrowing channel. Every bounce loses some of its energy. If the channel has a strictly positive width, the cursor eventually settles in the middle after finitely many bounces.
That is exactly what the Novikoff theorem Novikoff's theorem If a dataset is linearly separable with geometric margin $\gamma > 0$ and radius $R = \max_i \|x_i\|$, then the perceptron algorithm initialised at zero converges in at most $T \leq (R / \gamma)^2$ corrections, regardless of the learning rate. Source: Novikoff, 1962 proves: as long as the margin is strictly positive, the number of perceptron corrections is bounded by a quantity that depends only on the geometry of the dataset.
Statement
Let be a linearly separable dataset. Assume there exists a unit vector with and a scalar such that for every :
Let be the radius of the dataset.
A point of rigour: once the bias is absorbed, all the quantities in the theorem are read in . The radius is then , and denotes the optimal separator renormalised to in that same augmented space. The bound then keeps exactly its form.
Theorem. The perceptron algorithm (in bias-absorbed form) initialised at with step performs at most
corrections before classifying every sample correctly.
Proof in two lemmas
Let denote the weight vector after the -th correction.
Lemma 1 (lower bound). For every , .
Step 1. At initialisation , hence .
Step 2. On the -st correction, for a misclassified sample .
Step 3. Compute :
Step 4. By the separation hypothesis with margin : . Therefore:
Step 5. By induction on , . □
Lemma 2 (upper bound). For every , .
Step 1. At initialisation, .
Step 2. On the -st correction:
Step 3. Since is misclassified by , we have (otherwise it would be correctly classified). The middle term is therefore non-positive:
Step 4. By induction, , so . □
Combining via Cauchy-Schwarz.
Step 1. The Cauchy-Schwarz inequality says, for two vectors :
Step 2. Applied to and with :
Step 3. Combining with the two lemmas after corrections:
Step 4. Square and divide by (which is positive):
Result. The number of perceptron corrections is bounded by . With a step , the factor would appear identically in both lemmas: the lower bound would become and the upper bound . After Cauchy-Schwarz and simplification, the same bound is recovered. The number of corrections therefore does not depend on the choice of , and the procedure converges in finitely many steps. □
Intuitive reading
- The narrower the margin (two classes very close together), the larger the bound , and the slower the convergence.
- The larger the radius (points far from the origin), the larger the bound, quadratically.
- But however hard the dataset, the bound stays finite as long as .
Explore the bound live
Novikoff bound explorer
Three things to watch as you play:
- Bring the two clusters closer: shrinks and the bound explodes, while the actual grows more modestly.
- Drag an outlier far from the centre: grows, the bound grows too, but the actual does not necessarily grow as fast.
- The ratio is usually well below : the bound is pessimistic, but it exists.
What if the dataset is not separable?
The Novikoff theorem makes a crucial assumption: there exists a linear separator with margin . What happens when that assumption fails?
The perceptron oscillates
The bound was proved under the assumption . When the dataset is not linearly separable, is not defined: no pair classifies every example correctly. As a consequence, Rosenblatt’s learning rule keeps correcting forever, never converges. The weight vector oscillates indefinitely, and even the iterations that pass through an “almost good” are lost in the next step when another misclassified example triggers an update that undoes the progress.
Gallant’s Pocket Algorithm
The classical fix is surprisingly simple: keep the best ever seen in your pocket. After every Rosenblatt update, evaluate the new on the whole dataset, count the number of correctly classified examples, and if that number beats the pocket’s count, replace it. At the end (after a fixed budget of iterations, which you can pick large), you return the pocket content, not the last .
This procedure, the Pocket Algorithm, was introduced by Gallant (1990). On a separable dataset it reduces to the standard perceptron (the pocket ends up holding a perfect separator). On a non-separable dataset it converges in probability to the separator that maximises the number of correctly classified examples. We lose the Novikoff guarantee, but we recover a procedure usable in practice.
The impossibility of XOR
The impossible checkerboard analogy
Picture four squares of a chessboard: the diagonals alternate (two whites at bottom-left and top-right, two blacks at top-left and bottom-right). No single straight line can separate the whites from the blacks. That is exactly the situation of the XOR function.
Statement
The function defined by , , , is not realisable by a single perceptron.
Proof by contradiction
For this proof, we go back to the standard encoding of XOR with targets in and the Heaviside convention: the output is if and otherwise. The result does not depend on the encoding: switching to and gives four equivalent inequalities by change of variable, and the same contradiction.
Suppose there exist such that the perceptron realises XOR. Then the four constraints below are simultaneously true:
| Point | XOR | Inequality |
|---|---|---|
| (1): | ||
| (2): | ||
| (3): | ||
| (4): |
Step 1. Add (2) and (3):
Step 2. From (1), , hence , hence . Rewriting step 1:
The chain combines a non-strict inequality () and a strict one (), so the result is strict: .
Step 3. Add on both sides:
Step 4. But (4) says .
Result. We have simultaneously and . Contradiction. Hence no triple realises XOR. □
Explore it yourself
Why XOR is impossible
Three things to watch as you play:
- Whatever you set the sliders to, you never reach 4 / 4 inequalities satisfied. The maximum is 3 / 4.
- The “OR-like” preset satisfies (1), (2), (3) but violates (4). The “AND-like” preset satisfies (1), (4) but violates (2) or (3). No linear boundary reconciles all four.
- Clicking “Why 4 / 4 is impossible” reveals the proof by contradiction in compressed form.
Historical context
Marvin Minsky and Seymour Papert Minsky & Papert Marvin Minsky and Seymour Papert, authors of *Perceptrons* (MIT Press, 1969), which formally proved the limits of a single perceptron, notably the impossibility of computing the XOR function. Their analysis contributed to the decline of public funding for neural network research until the mid-1980s. Source: Minsky & Papert, *Perceptrons*, MIT Press, 1969 publish Perceptrons in 1969, the central chapter of which proves this impossibility and generalises it to a whole family of “non-local” functions. Their rigorous analysis contributed to a withdrawal of public funding for neural network research, a period we now call the first AI winter AI winter Period of disinterest and funding cuts for artificial intelligence research. The first AI winter, in the 1970s and early 1980s, followed Minsky and Papert's critique of the perceptron (1969). The second, in the late 1980s and 1990s, followed disappointments with expert systems. Each winter preceded a comeback: backpropagation after the first, modern deep learning after the second. Source: Russell & Norvig, *AIMA*, ch. 1 . Recovery had to wait for Hopfield in 1982 and the rediscovery of backpropagation by Rumelhart, Hinton and Williams in 1986.
Pen-and-paper exercises
Exercise 1: one iteration in three dimensions
Let , , . Present the sample with target .
(a) Compute the prediction . Is it correct?
(b) Apply the learning rule and give after the update.
(c) Check that the new functional margin is strictly greater than the old one.
Exercise 2: test separability
Is the dataset linearly separable? Justify.
Exercise 3: an explicit perceptron for NAND
Find explicitly such that the perceptron computes the NAND function, that is except when . Verify your solution on the four points.
Exercise 4: without bias, Novikoff can fail
Suppose we force throughout the entire training (the perceptron updates only , never ). Give a separable dataset of two points in dimension for which the no-bias perceptron does not converge. Justify.
In one sentence
The perceptron proves that a machine can build the boundary that separates two classes geometrically, without computing any derivative, in finitely many steps: provided that boundary is a straight line, and XOR is not.
Towards chapter 5: stacking solves XOR
XOR is not linearly separable, but we can write it as a composition of functions that are:
OR is linearly separable, and is too (you just proved it in exercise 3). Letting and , we have , and AND is itself separable. Three perceptrons, two in a first layer (OR and NAND) then one in a second (AND), therefore suffice to solve XOR:
That is exactly what chapter 5 will formalise: stacking perceptrons into layers vastly enlarges the class of functions the network can represent. In chapter 5, a single neuron splits space into two half-spaces; several neurons organised in layers split it into arbitrarily complex regions.
Quiz
1. Why does the perceptron learning rule not need to compute a derivative?
2. On a dataset that is not linearly separable, what does the perceptron do?
3. In the update w ← w + η y x, why do we multiply by x rather than something else?
4. What is the geometric meaning of b in the equation w · x + b = 0?
5. Why is XOR not realisable by a single perceptron, and what does chapter 5 do about it?
Sources
- Rosenblatt, F. (1958). “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological Review 65(6), 386-408. DOI 10.1037/h0042519
- Novikoff, A. B. J. (1962). “On convergence proofs on perceptrons.” Symposium on the Mathematical Theory of Automata 12, 615-622.
- Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-13043-1.
- Gallant, S. I. (1990). “Perceptron-Based Learning Algorithms.” IEEE Transactions on Neural Networks 1(2), 179-191. (Introduces the Pocket Algorithm.) DOI 10.1109/72.80230
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, ch. 4. Springer. ISBN 978-0-387-31073-2.
- Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, ch. 4. Springer. ISBN 978-0-387-84857-0.
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, ch. 1. MIT Press. ISBN 978-0-262-03561-3.
Further reading
- Cortes, C. & Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning 20(3), 273-297. SVMs are precisely the max-margin perceptron: where Rosenblatt’s rule accepts any valid separator, SVM optimises the separator whose geometric margin (the very same that shows up in Novikoff) is as large as possible. The direct heir of the perceptron, and the cornerstone of supervised learning before 2012. DOI 10.1007/BF00994018
- Freund, Y. & Schapire, R. E. (1999). “Large Margin Classification Using the Perceptron Algorithm.” Machine Learning 37(3), 277-296. Introduces the voted perceptron that aggregates the intermediate hypotheses: on noisy datasets, performance close to SVM for a cost comparable to the perceptron. DOI 10.1023/A:1007662407062