Neural networks: foundations and mathematics · 04 / 09

The perceptron

How Rosenblatt taught a machine to learn without a gradient (1958).

In chapter 3 you learned why an activation function has to be non-linear and differentiable for a deep network to be mathematically interesting. Yet the first artificial neuron that could ever learn, the perceptron of Frank Rosenblatt (1958), uses the threshold function , which is almost everywhere differentiable with derivative zero. How did Rosenblatt make such a machine learn anything at all?

This chapter answers that question by building the perceptron from a purely geometric point of view, without ever differentiating. We prove that the procedure converges on a separable dataset, then we discover the limit that ended the first age of neural networks.

The geometry of a hyperplane

The ruler-on-the-table analogy

Place a flat ruler on a table. The ruler divides the space above the table into two zones: one in front of you, one behind. The edge of the ruler is the boundary. The direction perpendicular to that edge is what we will call the normal vector , and how far you put the ruler from your body is the offset.

A hyperplane in two dimensions is exactly this: a line that splits the plane into two half-spaces . In three dimensions, it is a plane. In $n$ dimensions, it is a flat surface of dimension $n - 1$ .

Formal definition

Let $w \in \mathbb{R}^n$ be a non-zero vector and $b \in \mathbb{R}$ a scalar. The affine hyperplane with equation $w \cdot x + b = 0$ is the set:

\mathcal{H} = \{\, x \in \mathbb{R}^n \;:\; w \cdot x + b = 0 \,\}.

Signed distance from a point to the hyperplane

For an arbitrary point $x \in \mathbb{R}^n$ , we define its signed distance to $\mathcal{H}$ by:

d(x, \mathcal{H}) \;=\; \frac{w \cdot x + b}{\|w\|}.

This quantity is positive on one side of $\mathcal{H}$ , negative on the other, and zero on $\mathcal{H}$ itself. Its absolute value is the usual perpendicular distance.

Play with the hyperplane

Hyperplane: w · x + b = 0

Hyperplane geometry

w₁1.00

w₂0.50

b-0.50

Click inside the grid to place a probe point.

No probe point yet.

Three things to notice as you play:

When $b = 0$ , the hyperplane passes exactly through the origin.
Doubling $w$ does not move the line: only the direction of $w$ matters for the position of the hyperplane, not its magnitude.
The signed distance flips sign when you click on the other side of the line: that is the sign we will exploit for classification.

Linearly separable, with a margin

The buffer zone analogy

Picture two neighbouring countries with a buffer zone between them. The shared border is the line in the middle. The width of the buffer zone is the margin: the wider it is, the more robust the border is to small perturbations of the points. If the buffer zone shrinks to zero, citizens of both countries cross paths and the border becomes ambiguous.

Encoding the targets: why $y \in \{-1, +1\}$

We have so far often encoded binary classes by $0$ and $1$ . For the perceptron, we will pick $-1$ and $+1$ instead. The choice simplifies everything: a sample $(x, y)$ is correctly classified by $(w, b)$ if and only if $y$ and $w \cdot x + b$ have the same sign, which is a single inequality:

y \, (w \cdot x + b) \;>\; 0.

Formal definitions

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^m$ be a dataset with $x_i \in \mathbb{R}^n$ and $y_i \in \{-1, +1\}$ . We say that $\mathcal{D}$ is linearly separable if there exist $(w, b) \in \mathbb{R}^n \times \mathbb{R}$ such that for every $i$ :

y_i \, (w \cdot x_i + b) \;>\; 0.

For such a pair $(w, b)$ , we define two margins. The functional margin of point $i$ is $\hat\gamma_i = y_i (w \cdot x_i + b)$ . The geometric margin of point $i$ is $\gamma_i = \hat\gamma_i / \|w\|$ . The margin of the dataset is the minimum over all points:

\gamma \;=\; \min_{i=1,\dots,m} \, \frac{y_i \, (w \cdot x_i + b)}{\|w\|}.

The functional margin depends on the scale of the weights ( $\hat\gamma$ doubles if you double $w$ ). The geometric margin, on the other hand, is invariant: it measures a real distance in the plane.

The perceptron, and the tension with chapter 3

Definition

Let $(w, b) \in \mathbb{R}^n \times \mathbb{R}$ . The associated perceptron is the classifier

\hat y(x) \;=\; \operatorname{sgn}(w \cdot x + b),

where $\operatorname{sgn}$ is the sign function.

1958 and 1960: two separate dates

Frank Rosenblatt publishes the founding paper in Psychological Review (Rosenblatt, 1958). It introduces the theoretical model and the learning rule you are about to see. Two years later, at the Cornell Aeronautical Laboratory, he builds the Mark I Perceptron : a physical machine with 400 photoreceptors and weights that are tunable via motorised potentiometers. The New York Times declares that the US Navy has just built “a machine that learns by itself”.

The important point: 1958 is the model; 1960 is the hardware. Many accounts conflate the two, but the theoretical paper precedes the machine by two years.

The tension with what chapter 3 taught us

In chapter 3 we proved that the depth of a network is pointless if the activation is linear, and that we need a differentiable activation to compute a gradient. Yet $\operatorname{sgn}$ is almost everywhere differentiable with derivative zero. How did Rosenblatt manage to teach a machine equipped with such a function?

The answer, surprisingly, is that he did not need a derivative. His learning procedure is a local geometric correction: whenever the perceptron is wrong on a sample, we shift the weight vector in the direction that would fix the error, with no gradient ever computed.

It is a historical exception. From chapter 7 onwards, we switch back to differentiable activations and gradient descent takes over. But for the perceptron, learning happens by hand, via projection.

The perceptron learning rule

The misaligned road sign analogy

Picture a road sign that points slightly the wrong way. Each time a driver takes a wrong turn because of it, you rotate the sign by a notch in the direction that would have avoided the mistake. You do not compute any derivative, you do not optimise anything: you react locally, incident by incident. After enough incidents, the sign is correctly aligned.

That is exactly what Rosenblatt’s rule does to the weights $w$ and the bias $b$ .

Statement

Let $\eta > 0$ be the learning rate . For a misclassified sample $(x_i, y_i)$ , the perceptron learning rule applies:

w \;\leftarrow\; w + \eta \, y_i \, x_i, \qquad b \;\leftarrow\; b + \eta \, y_i.

For a correctly classified sample, we leave everything alone. The procedure walks through the dataset, applies the update on every mistake, and repeats until no sample is misclassified.

Three forms of the same rule

Form	Expression
Coordinate-wise	$w_j \leftarrow w_j + \eta y_i x_{i,j}$ for every $j$
Vectorial	$w \leftarrow w + \eta y_i x_i$
Bias-absorbed	$\tilde w \leftarrow \tilde w + \eta y_i \tilde x_i$ with $\tilde x = (x, 1)$ and $\tilde w = (w, b)$

The three say exactly the same thing. The coordinate-wise form is the most explicit for pen-and-paper computation. The vectorial form is the most compact. The bias-absorbed form is convenient for proofs: it folds two updates ( $w$ and $b$ ) into one.

Proof: the update strictly improves the functional margin

Before the update, let $\hat\gamma_i = y_i (w \cdot x_i + b)$ be the functional margin of $(x_i, y_i)$ . Since the sample is misclassified, $\hat\gamma_i \leq 0$ .

After the update, the new $(w', b') = (w + \eta y_i x_i, \, b + \eta y_i)$ . Let us compute the new functional margin.

Step 1. Substitute $(w', b')$ .

\hat\gamma_i' \;=\; y_i \, (w' \cdot x_i + b') \;=\; y_i \, \big[ (w + \eta y_i x_i) \cdot x_i + (b + \eta y_i) \big].

Step 2. Expand.

\hat\gamma_i' \;=\; y_i \, (w \cdot x_i + b) + \eta \, y_i^2 \, (x_i \cdot x_i) + \eta \, y_i^2.

Step 3. Since $y_i \in \{-1, +1\}$ , we have $y_i^2 = 1$ . The new functional margin is therefore:

\hat\gamma_i' \;=\; \hat\gamma_i + \eta \, \big( \|x_i\|^2 + 1 \big).

Result. The added quantity $\eta (\|x_i\|^2 + 1)$ is strictly positive (because $\eta > 0$ ). The update therefore strictly increased the functional margin of that sample. No derivative anywhere in the proof. □

Build the perceptron step by step

● target +1● target −1green outline: correctred outline: incorrect

Build the perceptron step by step

Dataset

Rate η0.50

w₁ = 0.00
w₂ = 0.00
b = 0.00
Corrections : 0 · Epochs : 0
Misclassified : 1/4

Three things to watch while you play:

On OR or AND, the violet boundary stabilises quickly. The error counter falls to zero and the perceptron has converged.
On XOR, the error counter never reaches zero, even after a hundred epochs. The rule keeps oscillating forever.
The rate $\eta$ does not change convergence on a separable dataset: it only changes the size of each step and thus the visual speed of the boundary, not the final outcome.

The convergence theorem (Novikoff, 1962)

The zig-zagging cursor analogy

Picture a cursor bouncing between the two walls of a narrowing channel. Every bounce loses some of its energy. If the channel has a strictly positive width, the cursor eventually settles in the middle after finitely many bounces.

That is exactly what the Novikoff theorem proves: as long as the margin $\gamma$ is strictly positive, the number of perceptron corrections is bounded by a quantity that depends only on the geometry of the dataset.

Statement

Let $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^m$ be a linearly separable dataset. Assume there exists a unit vector $w^* \in \mathbb{R}^n$ with $\|w^*\| = 1$ and a scalar $b^* \in \mathbb{R}$ such that for every $i$ :

y_i \, (w^* \cdot x_i + b^*) \;\geq\; \gamma \;>\; 0.

Let $R = \max_{i} \|x_i\|$ be the radius of the dataset.

A point of rigour: once the bias is absorbed, all the quantities in the theorem are read in $\mathbb{R}^{n+1}$ . The radius is then $R = \max_i \|\tilde x_i\| = \max_i \sqrt{\|x_i\|^2 + 1}$ , and $w^*$ denotes the optimal separator renormalised to $\|w^*\| = 1$ in that same augmented space. The bound $R^2 / \gamma^2$ then keeps exactly its form.

Theorem. The perceptron algorithm (in bias-absorbed form) initialised at $w_0 = 0$ with step $\eta = 1$ performs at most

T \;\leq\; \frac{R^2}{\gamma^2}

corrections before classifying every sample correctly.

Proof in two lemmas

Let $w_t$ denote the weight vector after the $t$ -th correction.

Lemma 1 (lower bound). For every $t \geq 0$ , $w_t \cdot w^* \geq t \gamma$ .

Step 1. At initialisation $w_0 = 0$ , hence $w_0 \cdot w^* = 0$ .

Step 2. On the $(t+1)$ -st correction, $w_{t+1} = w_t + y_i x_i$ for a misclassified sample $(x_i, y_i)$ .

Step 3. Compute $w_{t+1} \cdot w^*$ :

w_{t+1} \cdot w^* \;=\; (w_t + y_i x_i) \cdot w^* \;=\; w_t \cdot w^* + y_i \, (w^* \cdot x_i).

Step 4. By the separation hypothesis with margin $\gamma$ : $y_i (w^* \cdot x_i) \geq \gamma$ . Therefore:

w_{t+1} \cdot w^* \;\geq\; w_t \cdot w^* + \gamma.

Step 5. By induction on $t$ , $w_t \cdot w^* \geq t \gamma$ . □

Lemma 2 (upper bound). For every $t \geq 0$ , $\|w_t\|^2 \leq t R^2$ .

Step 1. At initialisation, $\|w_0\|^2 = 0$ .

Step 2. On the $(t+1)$ -st correction:

\|w_{t+1}\|^2 \;=\; \|w_t + y_i x_i\|^2 \;=\; \|w_t\|^2 + 2 y_i \, (w_t \cdot x_i) + \|x_i\|^2.

Step 3. Since $(x_i, y_i)$ is misclassified by $w_t$ , we have $y_i (w_t \cdot x_i) \leq 0$ (otherwise it would be correctly classified). The middle term is therefore non-positive:

\|w_{t+1}\|^2 \;\leq\; \|w_t\|^2 + \|x_i\|^2 \;\leq\; \|w_t\|^2 + R^2.

Step 4. By induction, $\|w_t\|^2 \leq t R^2$ , so $\|w_t\| \leq \sqrt{t} \, R$ . □

Combining via Cauchy-Schwarz.

Step 1. The Cauchy-Schwarz inequality says, for two vectors $u, v$ :

u \cdot v \;\leq\; \|u\| \, \|v\|.

Step 2. Applied to $w_T$ and $w^*$ with $\|w^*\| = 1$ :

w_T \cdot w^* \;\leq\; \|w_T\|.

Step 3. Combining with the two lemmas after $T$ corrections:

T \gamma \;\leq\; w_T \cdot w^* \;\leq\; \|w_T\| \;\leq\; \sqrt{T} \, R.

Step 4. Square and divide by $T$ (which is positive):

T^2 \gamma^2 \;\leq\; T R^2 \;\;\Longrightarrow\;\; T \;\leq\; \frac{R^2}{\gamma^2}.

Result. The number of perceptron corrections is bounded by $R^2 / \gamma^2$ . With a step $\eta \neq 1$ , the factor $\eta$ would appear identically in both lemmas: the lower bound would become $T \, \eta \, \gamma$ and the upper bound $T \, \eta^2 \, R^2$ . After Cauchy-Schwarz and simplification, the same bound is recovered. The number of corrections therefore does not depend on the choice of $\eta$ , and the procedure converges in finitely many steps. □

Intuitive reading

The narrower the margin $\gamma$ (two classes very close together), the larger the bound $R^2 / \gamma^2$ , and the slower the convergence.
The larger the radius $R$ (points far from the origin), the larger the bound, quadratically.
But however hard the dataset, the bound stays finite as long as $\gamma > 0$ .

Explore the bound live

Drag the points to change R and γ.

Novikoff bound explorer

R (radius) : 1.44
γ (achieved margin) : 0.707
Actual T : 1
Bound (R/γ)² : 4.2
Ratio : 24.0%

Errors per epoch

Three things to watch as you play:

Bring the two clusters closer: $\gamma$ shrinks and the bound $(R / \gamma)^2$ explodes, while the actual $T$ grows more modestly.
Drag an outlier far from the centre: $R$ grows, the bound grows too, but the actual $T$ does not necessarily grow as fast.
The ratio $T / (R/\gamma)^2$ is usually well below $1$ : the bound is pessimistic, but it exists.

What if the dataset is not separable?

The Novikoff theorem makes a crucial assumption: there exists a linear separator with margin $\gamma > 0$ . What happens when that assumption fails?

The perceptron oscillates

The bound $T \leq R^2 / \gamma^2$ was proved under the assumption $\gamma > 0$ . When the dataset is not linearly separable, $\gamma$ is not defined: no pair $(w, b)$ classifies every example correctly. As a consequence, Rosenblatt’s learning rule keeps correcting forever, never converges. The weight vector $w$ oscillates indefinitely, and even the iterations that pass through an “almost good” $(w, b)$ are lost in the next step when another misclassified example triggers an update that undoes the progress.

Gallant’s Pocket Algorithm

The classical fix is surprisingly simple: keep the best $(w, b)$ ever seen in your pocket. After every Rosenblatt update, evaluate the new $(w, b)$ on the whole dataset, count the number of correctly classified examples, and if that number beats the pocket’s count, replace it. At the end (after a fixed budget of iterations, which you can pick large), you return the pocket content, not the last $(w, b)$ .

This procedure, the Pocket Algorithm, was introduced by Gallant (1990). On a separable dataset it reduces to the standard perceptron (the pocket ends up holding a perfect separator). On a non-separable dataset it converges in probability to the separator that maximises the number of correctly classified examples. We lose the Novikoff guarantee, but we recover a procedure usable in practice.

The impossibility of XOR

The impossible checkerboard analogy

Picture four squares of a chessboard: the diagonals alternate (two whites at bottom-left and top-right, two blacks at top-left and bottom-right). No single straight line can separate the whites from the blacks. That is exactly the situation of the XOR function.

Statement

The function $\text{XOR}: \{0, 1\}^2 \to \{0, 1\}$ defined by $\text{XOR}(0,0) = 0$ , $\text{XOR}(0,1) = 1$ , $\text{XOR}(1,0) = 1$ , $\text{XOR}(1,1) = 0$ is not realisable by a single perceptron.

Proof by contradiction

For this proof, we go back to the standard encoding of XOR with targets in $\{0, 1\}$ and the Heaviside convention: the output is $1$ if $w \cdot x + b \geq 0$ and $0$ otherwise. The result does not depend on the encoding: switching to $\{-1, +1\}$ and $\operatorname{sgn}$ gives four equivalent inequalities by change of variable, and the same contradiction.

Suppose there exist $(w_1, w_2, b) \in \mathbb{R}^3$ such that the perceptron realises XOR. Then the four constraints below are simultaneously true:

Point $(x_1, x_2)$	XOR	Inequality
$(0, 0)$	$0$	(1): $b < 0$
$(1, 0)$	$1$	(2): $w_1 + b \geq 0$
$(0, 1)$	$1$	(3): $w_2 + b \geq 0$
$(1, 1)$	$0$	(4): $w_1 + w_2 + b < 0$

Step 1. Add (2) and (3):

(w_1 + b) + (w_2 + b) \;\geq\; 0 \;\;\Longrightarrow\;\; w_1 + w_2 + 2b \;\geq\; 0.

Step 2. From (1), $b < 0$ , hence $-b > 0$ , hence $-2b > 0$ . Rewriting step 1:

w_1 + w_2 \;\geq\; -2b \;>\; 0.

The chain combines a non-strict inequality ( $\geq -2b$ ) and a strict one ( $-2b > 0$ ), so the result is strict: $w_1 + w_2 > 0$ .

Step 3. Add $b$ on both sides:

w_1 + w_2 + b \;\geq\; -b \;>\; 0.

Step 4. But (4) says $w_1 + w_2 + b < 0$ .

Result. We have simultaneously $w_1 + w_2 + b > 0$ and $w_1 + w_2 + b < 0$ . Contradiction. Hence no triple $(w_1, w_2, b)$ realises XOR. □

Explore it yourself

Why XOR is impossible

w₁1.00

w₂1.00

b-0.50

The four XOR inequalities

(1) (0,0) → 0 : b < 0✓
(2) (1,0) → 1 : w₁ + b ≥ 0✓
(3) (0,1) → 1 : w₂ + b ≥ 0✓
(4) (1,1) → 0 : w₁ + w₂ + b < 0✗
Inequalities satisfied3 / 4

Three things to watch as you play:

Whatever you set the sliders to, you never reach 4 / 4 inequalities satisfied. The maximum is 3 / 4.
The “OR-like” preset satisfies (1), (2), (3) but violates (4). The “AND-like” preset satisfies (1), (4) but violates (2) or (3). No linear boundary reconciles all four.
Clicking “Why 4 / 4 is impossible” reveals the proof by contradiction in compressed form.

Historical context

Marvin Minsky and Seymour Papert publish Perceptrons in 1969, the central chapter of which proves this impossibility and generalises it to a whole family of “non-local” functions. Their rigorous analysis contributed to a withdrawal of public funding for neural network research, a period we now call the first AI winter . Recovery had to wait for Hopfield in 1982 and the rediscovery of backpropagation by Rumelhart, Hinton and Williams in 1986.

Pen-and-paper exercises

Exercise 1: one iteration in three dimensions

Let $w = (0.2, -0.5, 0.1)$ , $b = 0$ , $\eta = 0.1$ . Present the sample $x = (1, 1, -1)$ with target $y = +1$ .

(a) Compute the prediction $\operatorname{sgn}(w \cdot x + b)$ . Is it correct?

(b) Apply the learning rule and give $(w', b')$ after the update.

Solution to exercise 1: one iteration in three dimensions

Recall: $w = (0.2, -0.5, 0.1)$ , $b = 0$ , $\eta = 0.1$ , $x = (1, 1, -1)$ , $y = +1$ .

Step 1. Compute the dot product $w \cdot x$ coordinate by coordinate.

w \cdot x \;=\; 0.2 \times 1 + (-0.5) \times 1 + 0.1 \times (-1).

w \cdot x \;=\; 0.2 - 0.5 - 0.1 \;=\; -0.4.

Step 2. Add the bias.

w \cdot x + b \;=\; -0.4 + 0 \;=\; -0.4.

Step 3. The prediction is $\operatorname{sgn}(-0.4) = -1$ . The target is $+1$ . The sample is misclassified: apply the rule.

Step 4. Update each weight coordinate. The form is $w_j' = w_j + \eta y x_j$ .

w_1' \;=\; 0.2 + 0.1 \times 1 \times 1 \;=\; 0.3.

w_2' \;=\; -0.5 + 0.1 \times 1 \times 1 \;=\; -0.4.

w_3' \;=\; 0.1 + 0.1 \times 1 \times (-1) \;=\; 0.

Step 5. Update the bias: $b' = b + \eta y = 0 + 0.1 = 0.1$ .

Step 6. Check the new functional margin. Compute $w' \cdot x + b'$ .

w' \cdot x \;=\; 0.3 \times 1 + (-0.4) \times 1 + 0 \times (-1) \;=\; -0.1.

w' \cdot x + b' \;=\; -0.1 + 0.1 \;=\; 0.

Step 7. New functional margin: $\hat\gamma' = y (w' \cdot x + b') = 1 \times 0 = 0$ . Old margin: $\hat\gamma = 1 \times (-0.4) = -0.4$ .

Result. The new margin $0$ is strictly greater than the old margin $-0.4$ . The sample is not yet perfectly classified (the margin is not strictly positive), but it has increased, consistent with the proof. A second iteration would finish the job.

Exercise 2: test separability

Is the dataset $\{((0, 0), +1), ((1, 1), +1), ((1, 0), -1), ((0, 1), -1)\}$ linearly separable? Justify.

Solution to exercise 2: test separability

Step 1. Recognise the inverted XOR: the two samples on the main diagonal ( $(0,0)$ and $(1,1)$ ) share target $+1$ , the two on the anti-diagonal ( $(1,0)$ and $(0,1)$ ) share target $-1$ . It is XOR with classes flipped.

Step 2. Assume by contradiction the existence of $(w_1, w_2, b)$ realising this function. The four constraints become:

(1): b \;\geq\; 0, \qquad (2): w_1 + b \;<\; 0, \qquad (3): w_2 + b \;<\; 0, \qquad (4): w_1 + w_2 + b \;\geq\; 0.

Step 3. From (2), $w_1 + b < 0$ , so $w_1 < -b$ . From (1), $b \geq 0$ , so $-b \leq 0$ . Therefore $w_1 < -b \leq 0$ .

Step 4. By symmetry, from (3) and (1), $w_2 < -b \leq 0$ .

Step 5. Summing the strict inequalities from steps 3 and 4:

w_1 + w_2 \;<\; -2b \;\leq\; 0.

Step 6. Adding $b$ to the two outer members, and using $b \geq 0$ once more so $-b \leq 0$ :

w_1 + w_2 + b \;<\; -2b + b \;=\; -b \;\leq\; 0.

Therefore $w_1 + w_2 + b < 0$ . But (4) requires $w_1 + w_2 + b \geq 0$ .

Result. Contradiction between $w_1 + w_2 + b < 0$ and $w_1 + w_2 + b \geq 0$ . The dataset is not linearly separable.

Exercise 3: an explicit perceptron for NAND

Find explicitly $(w_1, w_2, b)$ such that the perceptron computes the NAND function, that is $\text{NAND}(x_1, x_2) = 1$ except when $x_1 = x_2 = 1$ . Verify your solution on the four points.

Solution to exercise 3: an explicit perceptron for NAND

Step 1. NAND is the negation of AND. Truth table:

$(x_1, x_2)$	NAND
$(0, 0)$	$1$
$(0, 1)$	$1$
$(1, 0)$	$1$
$(1, 1)$	$0$

Step 2. Intuition: find a line that separates the point $(1, 1)$ (target $0$ ) from the other three (target $1$ ). Try $w_1 = w_2 = -1$ and $b = 1.5$ .

Step 3. Check each point. Since the NAND target lies in $\{0, 1\}$ , we read the output with the Heaviside function $H$ (output $1$ if $z \geq 0$ , else $0$ ), as in the XOR proof. Compute $z = w_1 x_1 + w_2 x_2 + b$ and $H(z)$ .

(0, 0): z = 0 + 0 + 1.5 = 1.5 \;\geq\; 0 \;\Longrightarrow\; H = 1. \;\;\text{OK}.

(0, 1): z = 0 - 1 + 1.5 = 0.5 \;\geq\; 0 \;\Longrightarrow\; H = 1. \;\;\text{OK}.

(1, 0): z = -1 + 0 + 1.5 = 0.5 \;\geq\; 0 \;\Longrightarrow\; H = 1. \;\;\text{OK}.

(1, 1): z = -1 - 1 + 1.5 = -0.5 \;<\; 0 \;\Longrightarrow\; H = 0. \;\;\text{OK}.

Result. The perceptron $(w_1, w_2, b) = (-1, -1, 1.5)$ realises NAND. Geometrically, the decision boundary is the line $-x_1 - x_2 + 1.5 = 0$ , i.e. $x_1 + x_2 = 1.5$ , which indeed separates $(1, 1)$ from the other three points.

Exercise 4: without bias, Novikoff can fail

Suppose we force $b = 0$ throughout the entire training (the perceptron updates only $w$ , never $b$ ). Give a separable dataset of two points in dimension $1$ for which the no-bias perceptron does not converge. Justify.

Solution to exercise 4: without bias, Novikoff can fail

Step 1. In one dimension without bias, the perceptron classifies by the sign of $w \, x$ : the boundary is only $x = 0$ . All points with $x > 0$ are classified on the same side as $\operatorname{sgn}(w)$ , all points with $x < 0$ on the other.

Step 2. Consider the dataset $\{ (1, +1), \; (-1, +1) \}$ : both points share the same target $+1$ , but lie on opposite sides of the origin.

Step 3. Check that the dataset is separable with bias: pick $(w, b) = (0.1, \; 1)$ . Then $z = 0.1 \cdot 1 + 1 = 1.1 > 0$ for the first point and $z = 0.1 \cdot (-1) + 1 = 0.9 > 0$ for the second. Both are correctly classified. The dataset is therefore linearly separable.

Step 4. Without bias ( $b = 0$ frozen), predicting $+1$ on the point $x = 1$ requires $w > 0$ ; predicting $+1$ on the point $x = -1$ requires $w < 0$ . These two conditions are incompatible: no $w \in \mathbb{R}$ classifies both points correctly.

Step 5. The algorithm oscillates: on the misclassified point, it pushes $w$ in one direction, but then the other point becomes misclassified and it pushes $w$ in the opposite direction. No convergence is possible.

Result. On the dataset $\{ (1, +1), (-1, +1) \}$ , linearly separable with bias (for instance $(w, b) = (0.1, 1)$ ), the no-bias perceptron does not converge: Novikoff’s theorem needs the bias as an extra degree of freedom. That is why we update it in practice, either directly, or by absorbing $b$ into $\tilde w$ through a constant coordinate $\tilde x = (x, 1)$ .

In one sentence

The perceptron proves that a machine can build the boundary that separates two classes geometrically, without computing any derivative, in finitely many steps: provided that boundary is a straight line, and XOR is not.

Towards chapter 5: stacking solves XOR

XOR is not linearly separable, but we can write it as a composition of functions that are:

\text{XOR}(x_1, x_2) \;=\; \big( x_1 \;\vee\; x_2 \big) \;\wedge\; \neg\big( x_1 \;\wedge\; x_2 \big).

OR is linearly separable, and $\neg(x_1 \wedge x_2) = \text{NAND}(x_1, x_2)$ is too (you just proved it in exercise 3). Letting $u = x_1 \vee x_2$ and $v = \text{NAND}(x_1, x_2)$ , we have $\text{XOR}(x_1, x_2) = u \wedge v = \text{AND}(u, v)$ , and AND is itself separable. Three perceptrons, two in a first layer (OR and NAND) then one in a second (AND), therefore suffice to solve XOR:

Decomposing XOR with two layers of perceptrons

That is exactly what chapter 5 will formalise: stacking perceptrons into layers vastly enlarges the class of functions the network can represent. In chapter 5, a single neuron splits space into two half-spaces; several neurons organised in layers split it into arbitrarily complex regions.

Quiz

1. Why does the perceptron learning rule not need to compute a derivative?
2. On a dataset that is not linearly separable, what does the perceptron do?
3. In the update w ← w + η y x, why do we multiply by x rather than something else?
4. What is the geometric meaning of b in the equation w · x + b = 0?
5. Why is XOR not realisable by a single perceptron, and what does chapter 5 do about it?

Sources

Rosenblatt, F. (1958). “The perceptron: A probabilistic model for information storage and organization in the brain.” Psychological Review 65(6), 386-408. DOI 10.1037/h0042519
Novikoff, A. B. J. (1962). “On convergence proofs on perceptrons.” Symposium on the Mathematical Theory of Automata 12, 615-622.
Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-13043-1.
Gallant, S. I. (1990). “Perceptron-Based Learning Algorithms.” IEEE Transactions on Neural Networks 1(2), 179-191. (Introduces the Pocket Algorithm.) DOI 10.1109/72.80230
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, ch. 4. Springer. ISBN 978-0-387-31073-2.
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, ch. 4. Springer. ISBN 978-0-387-84857-0.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning, ch. 1. MIT Press. ISBN 978-0-262-03561-3.

The geometry of a hyperplane

The ruler-on-the-table analogy

Formal definition

Signed distance from a point to the hyperplane

Play with the hyperplane

Hyperplane geometry

Linearly separable, with a margin

The buffer zone analogy

Encoding the targets: why y∈{−1,+1}y \in \{-1, +1\}y∈{−1,+1}

Formal definitions

The perceptron, and the tension with chapter 3

Definition

1958 and 1960: two separate dates

The tension with what chapter 3 taught us

The perceptron learning rule

The misaligned road sign analogy

Statement

Three forms of the same rule

Proof: the update strictly improves the functional margin

Build the perceptron step by step

Build the perceptron step by step

The convergence theorem (Novikoff, 1962)

The zig-zagging cursor analogy

Statement

Proof in two lemmas

Intuitive reading

Explore the bound live

Novikoff bound explorer

What if the dataset is not separable?

The perceptron oscillates

Gallant’s Pocket Algorithm

The impossibility of XOR

The impossible checkerboard analogy

Statement

Proof by contradiction

Explore it yourself

Why XOR is impossible

Historical context

Pen-and-paper exercises

Exercise 1: one iteration in three dimensions

Exercise 2: test separability

Exercise 3: an explicit perceptron for NAND

Exercise 4: without bias, Novikoff can fail

In one sentence

Towards chapter 5: stacking solves XOR

Quiz

Sources

Further reading

Encoding the targets: why $y \in \{-1, +1\}$