04 / 09 The perceptron
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 04 / 09

The perceptron

How Rosenblatt taught a machine to learn without a gradient (1958).

In chapter 3 you learned why an activation function has to be non-linear Non-linearity The property of a function that is not affine. A non-linear activation function is mandatory in a deep network, otherwise the composition of several layers reduces to a single equivalent affine layer and depth loses its point. and differentiable for a deep network to be mathematically interesting. Yet the first artificial neuron that could ever learn, the perceptron of Frank Rosenblatt (1958), uses the threshold function Threshold function Binary activation function H(z) equal to 1 if z >= 0 and 0 otherwise. Also called the Heaviside function. It is the original activation of McCulloch-Pitts (1943) and Rosenblatt's perceptron (1958), later dropped because it is not differentiable. , which is almost everywhere differentiable with derivative zero. How did Rosenblatt make such a machine learn anything at all?

This chapter answers that question by building the perceptron from a purely geometric point of view, without ever differentiating. We prove that the procedure converges on a separable dataset, then we discover the limit that ended the first age of neural networks.

The geometry of a hyperplane

The ruler-on-the-table analogy

Place a flat ruler on a table. The ruler divides the space above the table into two zones: one in front of you, one behind. The edge of the ruler is the boundary. The direction perpendicular to that edge is what we will call the normal vector Normal vector Vector $w$ that defines the direction perpendicular to a hyperplane with equation $w \cdot x + b = 0$. Its direction indicates which side of the hyperplane a point lies on; its norm sets the scale of the signed distance. Source: Bishop, PRML, ch. 4 , and how far you put the ruler from your body is the offset.

A hyperplane Hyperplane A subset of Rⁿ defined by a linear equation w · x + b = 0. In two dimensions it is a line, in three a plane. It is exactly the decision boundary drawn by a single neuron. in two dimensions is exactly this: a line that splits the plane into two half-spaces Half-space One of the two regions in which a hyperplane partitions Rⁿ. Algebraically, the set of points x with w · x + b > 0 (resp. < 0). A threshold neuron splits space into exactly two half-spaces: active and inactive. . In three dimensions, it is a plane. In nn dimensions, it is a flat surface of dimension n1n - 1.

Formal definition

Let wRnw \in \mathbb{R}^n be a non-zero vector and bRb \in \mathbb{R} a scalar. The affine hyperplane with equation wx+b=0w \cdot x + b = 0 is the set:

H={xRn  :  wx+b=0}.\mathcal{H} = \{\, x \in \mathbb{R}^n \;:\; w \cdot x + b = 0 \,\}.

Signed distance from a point to the hyperplane

For an arbitrary point xRnx \in \mathbb{R}^n, we define its signed distance Signed distance Perpendicular distance from a point to a hyperplane, carrying a sign that depends on which side of the hyperplane the point lies on. For the hyperplane $w \cdot x + b = 0$, it equals $d(x) = (w \cdot x + b) / \|w\|$: positive on one side, negative on the other, zero on the hyperplane itself. Source: Hastie, Tibshirani, Friedman, ESL, ch. 4 to H\mathcal{H} by:

d(x,H)  =  wx+bw.d(x, \mathcal{H}) \;=\; \frac{w \cdot x + b}{\|w\|}.

This quantity is positive on one side of H\mathcal{H}, negative on the other, and zero on H\mathcal{H} itself. Its absolute value is the usual perpendicular distance.

Play with the hyperplane

xyw (normal)+
Hyperplane: w · x + b = 0

Hyperplane geometry

w₁1.00
w₂0.50
b-0.50
Click inside the grid to place a probe point.
No probe point yet.

Three things to notice as you play:

  • When b=0b = 0, the hyperplane passes exactly through the origin.
  • Doubling ww does not move the line: only the direction of ww matters for the position of the hyperplane, not its magnitude.
  • The signed distance flips sign when you click on the other side of the line: that is the sign we will exploit for classification.

Linearly separable, with a margin

The buffer zone analogy

Picture two neighbouring countries with a buffer zone between them. The shared border is the line in the middle. The width of the buffer zone is the margin: the wider it is, the more robust the border is to small perturbations of the points. If the buffer zone shrinks to zero, citizens of both countries cross paths and the border becomes ambiguous.

Encoding the targets: why y{1,+1}y \in \{-1, +1\}

We have so far often encoded binary classes by 00 and 11. For the perceptron, we will pick 1-1 and +1+1 instead. The choice simplifies everything: a sample (x,y)(x, y) is correctly classified by (w,b)(w, b) if and only if yy and wx+bw \cdot x + b have the same sign, which is a single inequality:

y(wx+b)  >  0.y \, (w \cdot x + b) \;>\; 0.

Formal definitions

Let D={(xi,yi)}i=1m\mathcal{D} = \{(x_i, y_i)\}_{i=1}^m be a dataset with xiRnx_i \in \mathbb{R}^n and yi{1,+1}y_i \in \{-1, +1\}. We say that D\mathcal{D} is linearly separable Linearly separable A labelled dataset is linearly separable if there exists a hyperplane that correctly separates the label-1 points from the label-0 points. XOR is the historical example of a non linearly separable problem. if there exist (w,b)Rn×R(w, b) \in \mathbb{R}^n \times \mathbb{R} such that for every ii:

yi(wxi+b)  >  0.y_i \, (w \cdot x_i + b) \;>\; 0.

For such a pair (w,b)(w, b), we define two margins. The functional margin Functional margin For a sample $(x, y)$ with $y \in \{-1, +1\}$, the functional margin is the quantity $\hat\gamma = y (w \cdot x + b)$. It is strictly positive if and only if the sample is correctly classified. It depends on the scale of the weights and is not a geometric distance. Source: Bishop, PRML, ch. 7 of point ii is γ^i=yi(wxi+b)\hat\gamma_i = y_i (w \cdot x_i + b). The geometric margin Geometric margin Minimal perpendicular distance between a separating hyperplane and the points of the dataset. Defined by $\gamma = \min_i y_i (w \cdot x_i + b) / \|w\|$ with $y_i \in \{-1, +1\}$. Plays a central role in Novikoff's theorem and in the formulation of support vector machines. Source: Novikoff, 1962 of point ii is γi=γ^i/w\gamma_i = \hat\gamma_i / \|w\|. The margin of the dataset is the minimum over all points:

γ  =  mini=1,,myi(wxi+b)w.\gamma \;=\; \min_{i=1,\dots,m} \, \frac{y_i \, (w \cdot x_i + b)}{\|w\|}.

The functional margin depends on the scale of the weights (γ^\hat\gamma doubles if you double ww). The geometric margin, on the other hand, is invariant: it measures a real distance in the plane.

The perceptron, and the tension with chapter 3

Definition

Let (w,b)Rn×R(w, b) \in \mathbb{R}^n \times \mathbb{R}. The associated perceptron is the classifier

y^(x)  =  sgn(wx+b),\hat y(x) \;=\; \operatorname{sgn}(w \cdot x + b),

where sgn\operatorname{sgn} is the sign function.

1958 and 1960: two separate dates

Frank Rosenblatt publishes the founding paper in Psychological Review (Rosenblatt, 1958). It introduces the theoretical model and the learning rule you are about to see. Two years later, at the Cornell Aeronautical Laboratory, he builds the Mark I Perceptron Mark I Perceptron Physical machine built by Frank Rosenblatt between 1958 and 1960 at Cornell Aeronautical Laboratory. Able to recognise simple shapes thanks to 400 photoreceptors connected to weights that were tunable via motorised potentiometers. The first hardware implementation of a machine learning algorithm, distinct from the theoretical model published in 1958. Source: Rosenblatt, 1958, 1960 : a physical machine with 400 photoreceptors and weights that are tunable via motorised potentiometers. The New York Times declares that the US Navy has just built “a machine that learns by itself”.

The important point: 1958 is the model; 1960 is the hardware. Many accounts conflate the two, but the theoretical paper precedes the machine by two years.

The tension with what chapter 3 taught us

In chapter 3 we proved that the depth of a network is pointless if the activation is linear, and that we need a differentiable activation to compute a gradient. Yet sgn\operatorname{sgn} is almost everywhere differentiable with derivative zero. How did Rosenblatt manage to teach a machine equipped with such a function?

The answer, surprisingly, is that he did not need a derivative. His learning procedure is a local geometric correction: whenever the perceptron is wrong on a sample, we shift the weight vector in the direction that would fix the error, with no gradient ever computed.

It is a historical exception. From chapter 7 onwards, we switch back to differentiable activations and gradient descent Gradient descent An optimisation algorithm that iteratively adjusts the parameters of a model to minimise a loss function. At each step, it moves the parameters in the direction opposite the gradient, by a distance proportional to the learning rate. The dominant method for training neural networks. Source: Cauchy, 1847 takes over. But for the perceptron, learning happens by hand, via projection.

The perceptron learning rule

The misaligned road sign analogy

Picture a road sign that points slightly the wrong way. Each time a driver takes a wrong turn because of it, you rotate the sign by a notch in the direction that would have avoided the mistake. You do not compute any derivative, you do not optimise anything: you react locally, incident by incident. After enough incidents, the sign is correctly aligned.

That is exactly what Rosenblatt’s rule does to the weights ww and the bias bb.

Statement

Let η>0\eta > 0 be the learning rate Learning rate A positive scalar controlling the step size taken by gradient descent at each iteration. Too small, training is slow; too large, it oscillates or diverges. Often denoted η (eta) or α (alpha). The first hyperparameter to tune in any training run. . For a misclassified sample (xi,yi)(x_i, y_i), the perceptron learning rule Learning rule Procedure that updates the parameters (weights, bias) of a model from observed samples. For the perceptron, the rule is $w \leftarrow w + \eta \cdot y \cdot x$ and $b \leftarrow b + \eta \cdot y$ applied only when a sample is misclassified. Source: Rosenblatt, 1958 applies:

w    w+ηyixi,b    b+ηyi.w \;\leftarrow\; w + \eta \, y_i \, x_i, \qquad b \;\leftarrow\; b + \eta \, y_i.

For a correctly classified sample, we leave everything alone. The procedure walks through the dataset, applies the update on every mistake, and repeats until no sample is misclassified.

Three forms of the same rule

FormExpression
Coordinate-wisewjwj+ηyixi,jw_j \leftarrow w_j + \eta y_i x_{i,j} for every jj
Vectorialww+ηyixiw \leftarrow w + \eta y_i x_i
Bias-absorbedw~w~+ηyix~i\tilde w \leftarrow \tilde w + \eta y_i \tilde x_i with x~=(x,1)\tilde x = (x, 1) and w~=(w,b)\tilde w = (w, b)

The three say exactly the same thing. The coordinate-wise form is the most explicit for pen-and-paper computation. The vectorial form is the most compact. The bias-absorbed form is convenient for proofs: it folds two updates (ww and bb) into one.

Proof: the update strictly improves the functional margin

Before the update, let γ^i=yi(wxi+b)\hat\gamma_i = y_i (w \cdot x_i + b) be the functional margin of (xi,yi)(x_i, y_i). Since the sample is misclassified, γ^i0\hat\gamma_i \leq 0.

After the update, the new (w,b)=(w+ηyixi,b+ηyi)(w', b') = (w + \eta y_i x_i, \, b + \eta y_i). Let us compute the new functional margin.

Step 1. Substitute (w,b)(w', b').

γ^i  =  yi(wxi+b)  =  yi[(w+ηyixi)xi+(b+ηyi)].\hat\gamma_i' \;=\; y_i \, (w' \cdot x_i + b') \;=\; y_i \, \big[ (w + \eta y_i x_i) \cdot x_i + (b + \eta y_i) \big].

Step 2. Expand.

γ^i  =  yi(wxi+b)+ηyi2(xixi)+ηyi2.\hat\gamma_i' \;=\; y_i \, (w \cdot x_i + b) + \eta \, y_i^2 \, (x_i \cdot x_i) + \eta \, y_i^2.

Step 3. Since yi{1,+1}y_i \in \{-1, +1\}, we have yi2=1y_i^2 = 1. The new functional margin is therefore:

γ^i  =  γ^i+η(xi2+1).\hat\gamma_i' \;=\; \hat\gamma_i + \eta \, \big( \|x_i\|^2 + 1 \big).

Result. The added quantity η(xi2+1)\eta (\|x_i\|^2 + 1) is strictly positive (because η>0\eta > 0). The update therefore strictly increased the functional margin of that sample. No derivative anywhere in the proof. □

Build the perceptron step by step

0011x₁x₂+++
target +1 target −1green outline: correctred outline: incorrect

Build the perceptron step by step

Dataset
Rate η0.50
w₁ = 0.00
w₂ = 0.00
b = 0.00
Corrections : 0 · Epochs : 0
Misclassified : 1/4

Three things to watch while you play:

  • On OR or AND, the violet boundary stabilises quickly. The error counter falls to zero and the perceptron has converged.
  • On XOR, the error counter never reaches zero, even after a hundred epochs. The rule keeps oscillating forever.
  • The rate η\eta does not change convergence on a separable dataset: it only changes the size of each step and thus the visual speed of the boundary, not the final outcome.

The convergence theorem (Novikoff, 1962)

The zig-zagging cursor analogy

Picture a cursor bouncing between the two walls of a narrowing channel. Every bounce loses some of its energy. If the channel has a strictly positive width, the cursor eventually settles in the middle after finitely many bounces.

That is exactly what the Novikoff theorem Novikoff's theorem If a dataset is linearly separable with geometric margin $\gamma > 0$ and radius $R = \max_i \|x_i\|$, then the perceptron algorithm initialised at zero converges in at most $T \leq (R / \gamma)^2$ corrections, regardless of the learning rate. Source: Novikoff, 1962 proves: as long as the margin γ\gamma is strictly positive, the number of perceptron corrections is bounded by a quantity that depends only on the geometry of the dataset.

Statement

Let D={(xi,yi)}i=1m\mathcal{D} = \{(x_i, y_i)\}_{i=1}^m be a linearly separable dataset. Assume there exists a unit vector wRnw^* \in \mathbb{R}^n with w=1\|w^*\| = 1 and a scalar bRb^* \in \mathbb{R} such that for every ii:

yi(wxi+b)    γ  >  0.y_i \, (w^* \cdot x_i + b^*) \;\geq\; \gamma \;>\; 0.

Let R=maxixiR = \max_{i} \|x_i\| be the radius of the dataset.

A point of rigour: once the bias is absorbed, all the quantities in the theorem are read in Rn+1\mathbb{R}^{n+1}. The radius is then R=maxix~i=maxixi2+1R = \max_i \|\tilde x_i\| = \max_i \sqrt{\|x_i\|^2 + 1}, and ww^* denotes the optimal separator renormalised to w=1\|w^*\| = 1 in that same augmented space. The bound R2/γ2R^2 / \gamma^2 then keeps exactly its form.

Theorem. The perceptron algorithm (in bias-absorbed form) initialised at w0=0w_0 = 0 with step η=1\eta = 1 performs at most

T    R2γ2T \;\leq\; \frac{R^2}{\gamma^2}

corrections before classifying every sample correctly.

Proof in two lemmas

Let wtw_t denote the weight vector after the tt-th correction.

Lemma 1 (lower bound). For every t0t \geq 0, wtwtγw_t \cdot w^* \geq t \gamma.

Step 1. At initialisation w0=0w_0 = 0, hence w0w=0w_0 \cdot w^* = 0.

Step 2. On the (t+1)(t+1)-st correction, wt+1=wt+yixiw_{t+1} = w_t + y_i x_i for a misclassified sample (xi,yi)(x_i, y_i).

Step 3. Compute wt+1ww_{t+1} \cdot w^*:

wt+1w  =  (wt+yixi)w  =  wtw+yi(wxi).w_{t+1} \cdot w^* \;=\; (w_t + y_i x_i) \cdot w^* \;=\; w_t \cdot w^* + y_i \, (w^* \cdot x_i).

Step 4. By the separation hypothesis with margin γ\gamma: yi(wxi)γy_i (w^* \cdot x_i) \geq \gamma. Therefore:

wt+1w    wtw+γ.w_{t+1} \cdot w^* \;\geq\; w_t \cdot w^* + \gamma.

Step 5. By induction on tt, wtwtγw_t \cdot w^* \geq t \gamma. □

Lemma 2 (upper bound). For every t0t \geq 0, wt2tR2\|w_t\|^2 \leq t R^2.

Step 1. At initialisation, w02=0\|w_0\|^2 = 0.

Step 2. On the (t+1)(t+1)-st correction:

wt+12  =  wt+yixi2  =  wt2+2yi(wtxi)+xi2.\|w_{t+1}\|^2 \;=\; \|w_t + y_i x_i\|^2 \;=\; \|w_t\|^2 + 2 y_i \, (w_t \cdot x_i) + \|x_i\|^2.

Step 3. Since (xi,yi)(x_i, y_i) is misclassified by wtw_t, we have yi(wtxi)0y_i (w_t \cdot x_i) \leq 0 (otherwise it would be correctly classified). The middle term is therefore non-positive:

wt+12    wt2+xi2    wt2+R2.\|w_{t+1}\|^2 \;\leq\; \|w_t\|^2 + \|x_i\|^2 \;\leq\; \|w_t\|^2 + R^2.

Step 4. By induction, wt2tR2\|w_t\|^2 \leq t R^2, so wttR\|w_t\| \leq \sqrt{t} \, R. □

Combining via Cauchy-Schwarz.

Step 1. The Cauchy-Schwarz inequality says, for two vectors u,vu, v:

uv    uv.u \cdot v \;\leq\; \|u\| \, \|v\|.

Step 2. Applied to wTw_T and ww^* with w=1\|w^*\| = 1:

wTw    wT.w_T \cdot w^* \;\leq\; \|w_T\|.

Step 3. Combining with the two lemmas after TT corrections:

Tγ    wTw    wT    TR.T \gamma \;\leq\; w_T \cdot w^* \;\leq\; \|w_T\| \;\leq\; \sqrt{T} \, R.

Step 4. Square and divide by TT (which is positive):

T2γ2    TR2        T    R2γ2.T^2 \gamma^2 \;\leq\; T R^2 \;\;\Longrightarrow\;\; T \;\leq\; \frac{R^2}{\gamma^2}.

Result. The number of perceptron corrections is bounded by R2/γ2R^2 / \gamma^2. With a step η1\eta \neq 1, the factor η\eta would appear identically in both lemmas: the lower bound would become TηγT \, \eta \, \gamma and the upper bound Tη2R2T \, \eta^2 \, R^2. After Cauchy-Schwarz and simplification, the same bound is recovered. The number of corrections therefore does not depend on the choice of η\eta, and the procedure converges in finitely many steps. □

Intuitive reading

  • The narrower the margin γ\gamma (two classes very close together), the larger the bound R2/γ2R^2 / \gamma^2, and the slower the convergence.
  • The larger the radius RR (points far from the origin), the larger the bound, quadratically.
  • But however hard the dataset, the bound stays finite as long as γ>0\gamma > 0.

Explore the bound live

Drag the points to change R and γ.

Novikoff bound explorer

R (radius) : 1.44
γ (achieved margin) : 0.707
Actual T : 1
Bound (R/γ)² : 4.2
Ratio : 24.0%
Errors per epoch

Three things to watch as you play:

  • Bring the two clusters closer: γ\gamma shrinks and the bound (R/γ)2(R / \gamma)^2 explodes, while the actual TT grows more modestly.
  • Drag an outlier far from the centre: RR grows, the bound grows too, but the actual TT does not necessarily grow as fast.
  • The ratio T/(R/γ)2T / (R/\gamma)^2 is usually well below 11: the bound is pessimistic, but it exists.

What if the dataset is not separable?

The Novikoff theorem makes a crucial assumption: there exists a linear separator with margin γ>0\gamma > 0. What happens when that assumption fails?

The perceptron oscillates

The bound TR2/γ2T \leq R^2 / \gamma^2 was proved under the assumption γ>0\gamma > 0. When the dataset is not linearly separable, γ\gamma is not defined: no pair (w,b)(w, b) classifies every example correctly. As a consequence, Rosenblatt’s learning rule keeps correcting forever, never converges. The weight vector ww oscillates indefinitely, and even the iterations that pass through an “almost good” (w,b)(w, b) are lost in the next step when another misclassified example triggers an update that undoes the progress.

Gallant’s Pocket Algorithm

The classical fix is surprisingly simple: keep the best (w,b)(w, b) ever seen in your pocket. After every Rosenblatt update, evaluate the new (w,b)(w, b) on the whole dataset, count the number of correctly classified examples, and if that number beats the pocket’s count, replace it. At the end (after a fixed budget of iterations, which you can pick large), you return the pocket content, not the last (w,b)(w, b).

This procedure, the Pocket Algorithm, was introduced by Gallant (1990). On a separable dataset it reduces to the standard perceptron (the pocket ends up holding a perfect separator). On a non-separable dataset it converges in probability to the separator that maximises the number of correctly classified examples. We lose the Novikoff guarantee, but we recover a procedure usable in practice.

The impossibility of XOR

The impossible checkerboard analogy

Picture four squares of a chessboard: the diagonals alternate (two whites at bottom-left and top-right, two blacks at top-left and bottom-right). No single straight line can separate the whites from the blacks. That is exactly the situation of the XOR function.

Statement

The function XOR:{0,1}2{0,1}\text{XOR}: \{0, 1\}^2 \to \{0, 1\} defined by XOR(0,0)=0\text{XOR}(0,0) = 0, XOR(0,1)=1\text{XOR}(0,1) = 1, XOR(1,0)=1\text{XOR}(1,0) = 1, XOR(1,1)=0\text{XOR}(1,1) = 0 is not realisable by a single perceptron.

Proof by contradiction

For this proof, we go back to the standard encoding of XOR with targets in {0,1}\{0, 1\} and the Heaviside convention: the output is 11 if wx+b0w \cdot x + b \geq 0 and 00 otherwise. The result does not depend on the encoding: switching to {1,+1}\{-1, +1\} and sgn\operatorname{sgn} gives four equivalent inequalities by change of variable, and the same contradiction.

Suppose there exist (w1,w2,b)R3(w_1, w_2, b) \in \mathbb{R}^3 such that the perceptron realises XOR. Then the four constraints below are simultaneously true:

Point (x1,x2)(x_1, x_2)XORInequality
(0,0)(0, 0)00(1): b<0b < 0
(1,0)(1, 0)11(2): w1+b0w_1 + b \geq 0
(0,1)(0, 1)11(3): w2+b0w_2 + b \geq 0
(1,1)(1, 1)00(4): w1+w2+b<0w_1 + w_2 + b < 0

Step 1. Add (2) and (3):

(w1+b)+(w2+b)    0        w1+w2+2b    0.(w_1 + b) + (w_2 + b) \;\geq\; 0 \;\;\Longrightarrow\;\; w_1 + w_2 + 2b \;\geq\; 0.

Step 2. From (1), b<0b < 0, hence b>0-b > 0, hence 2b>0-2b > 0. Rewriting step 1:

w1+w2    2b  >  0.w_1 + w_2 \;\geq\; -2b \;>\; 0.

The chain combines a non-strict inequality (2b\geq -2b) and a strict one (2b>0-2b > 0), so the result is strict: w1+w2>0w_1 + w_2 > 0.

Step 3. Add bb on both sides:

w1+w2+b    b  >  0.w_1 + w_2 + b \;\geq\; -b \;>\; 0.

Step 4. But (4) says w1+w2+b<0w_1 + w_2 + b < 0.

Result. We have simultaneously w1+w2+b>0w_1 + w_2 + b > 0 and w1+w2+b<0w_1 + w_2 + b < 0. Contradiction. Hence no triple (w1,w2,b)(w_1, w_2, b) realises XOR. □

Explore it yourself

00110110

Why XOR is impossible

w₁1.00
w₂1.00
b-0.50
The four XOR inequalities
(1) (0,0)0 : b < 0
(2) (1,0)1 : w₁ + b ≥ 0
(3) (0,1)1 : w₂ + b ≥ 0
(4) (1,1)0 : w₁ + w₂ + b < 0
Inequalities satisfied3 / 4

Three things to watch as you play:

  • Whatever you set the sliders to, you never reach 4 / 4 inequalities satisfied. The maximum is 3 / 4.
  • The “OR-like” preset satisfies (1), (2), (3) but violates (4). The “AND-like” preset satisfies (1), (4) but violates (2) or (3). No linear boundary reconciles all four.
  • Clicking “Why 4 / 4 is impossible” reveals the proof by contradiction in compressed form.

Historical context

Marvin Minsky and Seymour Papert Minsky & Papert Marvin Minsky and Seymour Papert, authors of *Perceptrons* (MIT Press, 1969), which formally proved the limits of a single perceptron, notably the impossibility of computing the XOR function. Their analysis contributed to the decline of public funding for neural network research until the mid-1980s. Source: Minsky & Papert, *Perceptrons*, MIT Press, 1969 publish Perceptrons in 1969, the central chapter of which proves this impossibility and generalises it to a whole family of “non-local” functions. Their rigorous analysis contributed to a withdrawal of public funding for neural network research, a period we now call the first AI winter AI winter Period of disinterest and funding cuts for artificial intelligence research. The first AI winter, in the 1970s and early 1980s, followed Minsky and Papert's critique of the perceptron (1969). The second, in the late 1980s and 1990s, followed disappointments with expert systems. Each winter preceded a comeback: backpropagation after the first, modern deep learning after the second. Source: Russell & Norvig, *AIMA*, ch. 1 . Recovery had to wait for Hopfield in 1982 and the rediscovery of backpropagation by Rumelhart, Hinton and Williams in 1986.

Pen-and-paper exercises

Exercise 1: one iteration in three dimensions

Let w=(0.2,0.5,0.1)w = (0.2, -0.5, 0.1), b=0b = 0, η=0.1\eta = 0.1. Present the sample x=(1,1,1)x = (1, 1, -1) with target y=+1y = +1.

(a) Compute the prediction sgn(wx+b)\operatorname{sgn}(w \cdot x + b). Is it correct?

(b) Apply the learning rule and give (w,b)(w', b') after the update.

(c) Check that the new functional margin γ^i=y(wx+b)\hat\gamma_i' = y (w' \cdot x + b') is strictly greater than the old one.

Exercise 2: test separability

Is the dataset {((0,0),+1),((1,1),+1),((1,0),1),((0,1),1)}\{((0, 0), +1), ((1, 1), +1), ((1, 0), -1), ((0, 1), -1)\} linearly separable? Justify.

Exercise 3: an explicit perceptron for NAND

Find explicitly (w1,w2,b)(w_1, w_2, b) such that the perceptron computes the NAND function, that is NAND(x1,x2)=1\text{NAND}(x_1, x_2) = 1 except when x1=x2=1x_1 = x_2 = 1. Verify your solution on the four points.

Exercise 4: without bias, Novikoff can fail

Suppose we force b=0b = 0 throughout the entire training (the perceptron updates only ww, never bb). Give a separable dataset of two points in dimension 11 for which the no-bias perceptron does not converge. Justify.

In one sentence

The perceptron proves that a machine can build the boundary that separates two classes geometrically, without computing any derivative, in finitely many steps: provided that boundary is a straight line, and XOR is not.

Towards chapter 5: stacking solves XOR

XOR is not linearly separable, but we can write it as a composition of functions that are:

XOR(x1,x2)  =  (x1    x2)    ¬(x1    x2).\text{XOR}(x_1, x_2) \;=\; \big( x_1 \;\vee\; x_2 \big) \;\wedge\; \neg\big( x_1 \;\wedge\; x_2 \big).

OR is linearly separable, and ¬(x1x2)=NAND(x1,x2)\neg(x_1 \wedge x_2) = \text{NAND}(x_1, x_2) is too (you just proved it in exercise 3). Letting u=x1x2u = x_1 \vee x_2 and v=NAND(x1,x2)v = \text{NAND}(x_1, x_2), we have XOR(x1,x2)=uv=AND(u,v)\text{XOR}(x_1, x_2) = u \wedge v = \text{AND}(u, v), and AND is itself separable. Three perceptrons, two in a first layer (OR and NAND) then one in a second (AND), therefore suffice to solve XOR:

Decomposing XOR with two layers of perceptrons

That is exactly what chapter 5 will formalise: stacking perceptrons into layers vastly enlarges the class of functions the network can represent. In chapter 5, a single neuron splits space into two half-spaces; several neurons organised in layers split it into arbitrarily complex regions.

Quiz

Quiz
  1. 1. Why does the perceptron learning rule not need to compute a derivative?

  2. 2. On a dataset that is not linearly separable, what does the perceptron do?

  3. 3. In the update w ← w + η y x, why do we multiply by x rather than something else?

  4. 4. What is the geometric meaning of b in the equation w · x + b = 0?

  5. 5. Why is XOR not realisable by a single perceptron, and what does chapter 5 do about it?

Sources

  • Rosenblatt, F. (1958). “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological Review 65(6), 386-408. DOI 10.1037/h0042519
  • Novikoff, A. B. J. (1962). “On convergence proofs on perceptrons.” Symposium on the Mathematical Theory of Automata 12, 615-622.
  • Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-13043-1.
  • Gallant, S. I. (1990). “Perceptron-Based Learning Algorithms.” IEEE Transactions on Neural Networks 1(2), 179-191. (Introduces the Pocket Algorithm.) DOI 10.1109/72.80230
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning, ch. 4. Springer. ISBN 978-0-387-31073-2.
  • Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, ch. 4. Springer. ISBN 978-0-387-84857-0.
  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, ch. 1. MIT Press. ISBN 978-0-262-03561-3.

Further reading

  • Cortes, C. & Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning 20(3), 273-297. SVMs are precisely the max-margin perceptron: where Rosenblatt’s rule accepts any valid separator, SVM optimises the separator whose geometric margin γ\gamma (the very same that shows up in Novikoff) is as large as possible. The direct heir of the perceptron, and the cornerstone of supervised learning before 2012. DOI 10.1007/BF00994018
  • Freund, Y. & Schapire, R. E. (1999). “Large Margin Classification Using the Perceptron Algorithm.” Machine Learning 37(3), 277-296. Introduces the voted perceptron that aggregates the intermediate hypotheses: on noisy datasets, performance close to SVM for a cost comparable to the perceptron. DOI 10.1023/A:1007662407062