Neural networks: foundations and mathematics · 01 / 09

The artificial neuron

From biological to mathematical, what really happens inside the elementary brick of a network.

Every neural network, from the simplest to the deepest, is an assembly of one elementary brick repeated by the millions. That brick, the artificial neuron, has nothing magical about it. It is an equation with three ingredients, inspired by a biological cell hundreds of millions of years old.

By the end of this chapter you will be able to answer three questions: what exactly does an artificial neuron compute, where does the idea come from, and why does a lone neuron carry a limitation that modern networks had to overcome.

The biological inspiration

Your brain contains roughly 86 billion neurons (Azevedo et al., 2009). Each neuron receives electrical signals from other neurons through its dendrites, integrates them in its cell body, and decides, based on an internal threshold, whether to fire a signal along its axon.

In 1943, Warren McCulloch and Walter Pitts model that behaviour with an equation (McCulloch & Pitts, 1943), laying the foundations of what will later be called the artificial neuron. It is not a faithful copy of biology, but a mathematical simplification that turns out to be powerful.

The referee analogy

Picture a football referee deciding whether a foul deserves a penalty. He weighs several pieces of information, each more or less important depending on context.

Information	Input	Weight
The hand touched the ball	$x_1 = 1$	$w_1 = 0.8$
Inside the penalty area	$x_2 = 0$	$w_2 = 0.5$
Intentional gesture	$x_3 = 1$	$w_3 = 0.9$

The referee mentally computes a weighted sum : he adds the pieces of information after multiplying each by its importance. If the total clears a threshold, he whistles. That is exactly what an artificial neuron does.

Play with a neuron

Practice before theory. Drag the sliders below. The output recomputes live. Your mission: find a combination where the output $y$ is very close to $1$ , then another one where it is very close to $0$ . You will feel how each parameter pulls in its own direction.

f =

x₁ = 1.00x₂ = 0.00x₃ = 1.00w₁ = 0.80w₂ = 0.50w₃ = 0.90biais = -0.50

z = 1.00·0.80 + 0.00·0.50 + 1.00·0.90 + -0.50 = 1.20

y = σ(z) = 0.77

Figure: 3-input neuron (interactive sliders)

Parametric diagram of a neuron with three inputs x₁, x₂, x₃, three weights w₁=0.8, w₂=0.5, w₃=0.9, a bias b=-0.5, and a sigmoid activation. With the default values (x₁=1, x₂=0, x₃=1), the weighted sum is z = 1·0.8 + 0·0.5 + 1·0.9 + (-0.5) = 1.2 and the output y = σ(1.2) ≈ 0.77. Dragging the sliders shows how each input, weight, and bias moves the output.

Three things to notice while playing:

An input at zero cancels the contribution of its weight, no matter the weight’s value.
Increasing a weight amplifies the influence of its input. Flipping it negative reverses the effect.
The bias shifts the output independently of the inputs. A very negative bias makes the neuron very hard to activate.

Historical note: from binary threshold to sigmoid

The NeuronDiagram above uses a sigmoid activation by default, producing a continuous output between 0 and 1. But the original neurons of McCulloch and Pitts (1943), and Rosenblatt’s perceptron (1958), used a binary threshold function $H(z) = \mathbb{1}[z \geq 0]$ : output 1 if the sum is at least 0, else 0. No nuance.

Flip the f = selector above the diagram between σ (sigmoid) and H (Heaviside) to compare both activations live. With the starting configuration, the sigmoid gives $y \approx 0.77$ while the binary threshold snaps straight to $y = 1$ . Move the sliders to find the zones where the threshold switches: you will see the all-or-nothing behaviour of the historical neuron.

The shift to sigmoid, and later to ReLU, is historically tied to backpropagation (1986), which requires a differentiable activation function to propagate the gradient. Subject of chapter 3. The sigmoid here gives a smoother intuition, but keep in mind that the simplest neuron, mathematically, is the threshold one.

The mathematical formula

The neuron’s operation collapses into one equation. The expanded, readable form:

y = f( x₁·w₁ + x₂·w₂ + x₃·w₃ + b )

Expanded form

The same equation in compact notation with the sum symbol:

y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)

And the densest form, in vector notation:

y = f(\mathbf{w} \cdot \mathbf{x} + b)

All three forms say the same thing at different abstraction levels. Learn to recognise all three; you will encounter them in every scientific paper.

What each symbol means

$x_i$ : the inputs, the information the neuron receives
$w_i$ : the weights, the importance assigned to each input
$b$ : the bias , a base offset independent of the inputs
$f$ : the activation function , which turns the raw result into something interpretable (covered in chapter 3)

Information flow as a graph

Signal propagation through a single neuron

Each input goes through a multiplication by its weight, all contributions sum with the bias, and the result passes through the activation function to produce the output.

From biological to artificial

Biological element	Role	Artificial counterpart
Dendrites	Receive incoming signals	Inputs $x_1, x_2, \dots, x_n$
Synapses	Regulate signal strength	Weights $w_1, w_2, \dots, w_n$
Cell body	Integrates all signals	Weighted sum $\sum w_i x_i$
Activation threshold	Triggers (or not) a signal	Bias $b$ + function $f$
Axon	Carries the output	Output $y$

An idea born in 1943

The artificial neuron has an 83-year history made of breakthroughs and long winters.

Year	Event
1943	McCulloch & Pitts model the neuron mathematically.
1949	Donald Hebb states “neurons that fire together wire together”.
1958	Frank Rosenblatt publishes the founding paper of the perceptron , the first machine learning rule for a single neuron.
1960	Mark I Perceptron: first hardware realisation (Cornell Aeronautical Laboratory), an electromechanical computer able to recognise simple shapes.
1969	Minsky & Papert publish Perceptrons and prove a single neuron cannot learn XOR . First major dent in the field’s enthusiasm.
1973	Lighthill report in the UK: AI is judged unable to deliver on its promises. Funding collapses in Europe and the US. The first AI winter begins, lasting roughly a decade.
1986	Rumelhart, Hinton & Williams publish backpropagation . Research restarts.
2012	AlexNet (Krizhevsky, Sutskever, Hinton) crushes ImageNet on GPU. Modern deep learning begins.

What a single neuron cannot do

Formal definition: hyperplane and linear separability

A clean treatment demands a precise definition. Let $\mathbf{x} \in \mathbb{R}^n$ be an input vector, $\mathbf{w} \in \mathbb{R}^n$ a weight vector, $b \in \mathbb{R}$ a bias. We use $\mathbb{R}^n$ here to denote the set of ordered lists of $n$ real numbers (chapter 2 formalises this notion). The set of points that zero the weighted sum

\{\mathbf{x} \in \mathbb{R}^n \mid \mathbf{w} \cdot \mathbf{x} + b = 0\}

is called a hyperplane. In $\mathbb{R}^2$ it is a line, in $\mathbb{R}^3$ a plane, and in higher dimensions you can no longer draw it but the equation stands.

A hyperplane splits space into two half-spaces: one where $\mathbf{w} \cdot \mathbf{x} + b > 0$ (the neuron fires) and one where $\mathbf{w} \cdot \mathbf{x} + b < 0$ (it does not). A threshold neuron is exactly this partition operation.

A set of labelled points $\{(\mathbf{x}_i, y_i)\}$ with $y_i \in \{0, 1\}$ is said linearly separable if there exists $(\mathbf{w}, b)$ such that the associated hyperplane correctly separates the points of label 1 from those of label 0. Otherwise it is not linearly separable.

Why XOR is not linearly separable

This result is not only visual, it can be proved in a few lines by contradiction. We use here the threshold function, which is that of the historical perceptron. The same impossibility holds with a sigmoid; we lose only the simplicity of the strict inequality.

Before tackling the proof, let us fix a notation that will be useful later. For a logical condition $P$ , we write $\mathbb{1}[P]$ for the indicator function of $P$ : it equals $1$ if $P$ is true, $0$ otherwise. In particular $\mathbb{1}[z \geq 0]$ is the Heaviside function, sometimes written $H(z)$ , which we already met at the start of this chapter.

Proof: XOR admits no linear separator

We note the four points of XOR: $(0,0) \to 0$ , $(1,0) \to 1$ , $(0,1) \to 1$ , $(1,1) \to 0$ .

Suppose by contradiction that there exists $(w_1, w_2, b) \in \mathbb{R}^3$ such that the classifier $\mathbb{1}[w_1 x_1 + w_2 x_2 + b \geq 0]$ produces the expected output at each point. We obtain four inequalities:

\begin{aligned} (0,0) \to 0 &\quad : \quad b < 0 \\ (1,0) \to 1 &\quad : \quad w_1 + b \geq 0 \\ (0,1) \to 1 &\quad : \quad w_2 + b \geq 0 \\ (1,1) \to 0 &\quad : \quad w_1 + w_2 + b < 0 \end{aligned}

Adding $(2)$ and $(3)$ : $w_1 + w_2 + 2b \geq 0$ , so $w_1 + w_2 \geq -2b$ .

From $(4)$ we have $w_1 + w_2 + b < 0$ , so $w_1 + w_2 < -b$ .

Combining the two: $-2b \leq w_1 + w_2 < -b$ , so $-2b < -b$ , i.e. $b > 0$ . Contradiction with $(1)$ which requires $b < 0$ . ∎

This proof fits in five lines but it closes the debate definitively: no choice of weights and bias can solve XOR with a single threshold neuron.

Verify it yourself

That is enough for linearly separable problems, like the AND gate, where one line isolates the $(1, 1)$ case from the other three. But it is not enough for XOR , whose positive cases sit diagonally: no line can separate them from the negatives.

Try to separate XOR yourself

Drag the sliders below to tilt and shift the line. On AND and OR, you can reach 4 out of 4 correctly classified. On XOR, you will never succeed: one point is always on the wrong side.

Slope = -1.00Intercept = 1.50

Line : B = -1.00·A + 1.50

1 / 4 correctly classified

On XOR, no straight line correctly separates the four points. Try as much as you want, you will never hit 4 out of 4.

That is exactly what Minsky and Papert proved in 1969. Geometrically, a neuron is a line; XOR requires a boundary that cannot be linear. The fix will come from multi-layer networks, which compose several lines to draw more complex boundaries.

The role of the bias, visually

Without bias, the line drawn by the neuron must pass through the origin. That is a strong constraint: most real-world problems have a decision boundary that does not sit at the origin.

The bias fixes this by translating the line anywhere in the plane. Mental image: weights control the orientation of the line (its slope); the bias controls its position.

To see it live, switch the component below to the OR dataset. Keep the slope near $-1$ and move only the intercept (which here plays the bias role). The line slides parallel to itself. Without that translation, you cannot correctly classify the three orange points.

Slope = -1.00Intercept = 0.50

Line : B = -1.00·A + 0.50

4 / 4 correctly classified

Figure: role of the bias on the OR dataset (interactive)

Same 2D plot, but on the OR table: only (0,0) negative, the three other points positive. Slope initialised at -1, intercept at 0.5. Moving only the intercept (which plays the role of the bias) slides the line parallel to itself. Without a bias (no translation), the line would have to pass through the origin and could not isolate (0,0) from the three others.

Another way to feel the bias: go back to the NeuronDiagram above, set every input to zero. The output now depends only on the bias. That is the neuron’s “base level”, independent of any input data.

In one sentence

An artificial neuron computes a linear combination of its inputs, adds a bias, and passes everything through a non-linear function. That’s it. The power comes from what we do with them once we stack them and train them.

On to chapter 2

You saw in the last section that the neuron formula also reads in the compact form $y = f(\mathbf{w} \cdot \mathbf{x} + b)$ . That vector notation is everywhere in deep learning, but we never really defined it: what is a vector exactly? What does the central dot between $\mathbf{w}$ and $\mathbf{x}$ mean? Chapter 2 lays down those linear-algebra foundations, staying strictly useful for the rest of the course.

Exercises

Take a sheet of paper and a pencil. Solutions are right below, look at them only after trying.

Exercise 1: direct computation

Given a two-input neuron with $w_1 = 2$ , $w_2 = -1$ , $b = 0.5$ , and ReLU activation $\text{ReLU}(z) = \max(0, z)$ . Compute the output for $x_1 = 0.7$ and $x_2 = 0.3$ .

Exercise 2: build an AND neuron

Find a triple $(w_1, w_2, b)$ such that a threshold neuron $H(z) = \mathbb{1}[z \geq 0]$ with two binary inputs $x_1, x_2 \in \{0, 1\}$ implements the logical AND function. Verify on the four cases.

Solution to exercise 1: direct computation

We have $w_1 = 2$ , $w_2 = -1$ , $b = 0.5$ , $x_1 = 0.7$ , $x_2 = 0.3$ , and the ReLU activation.

Step 1. Write the weighted-sum formula:

z = w_1 x_1 + w_2 x_2 + b

Step 2. Substitute the numerical values:

z = 2 \cdot 0.7 + (-1) \cdot 0.3 + 0.5

Step 3. Compute each product separately:

2 \cdot 0.7 = 1.4

(-1) \cdot 0.3 = -0.3

Step 4. Add the three terms:

z = 1.4 + (-0.3) + 0.5 = 1.4 - 0.3 + 0.5 = 1.6

Step 5. Apply the activation. Since $z = 1.6 > 0$ :

y = \text{ReLU}(1.6) = \max(0, 1.6) = 1.6

Result. The neuron’s output is $y = 1.6$ .

Solution to exercise 2: build an AND neuron

We seek $(w_1, w_2, b)$ such that $H(w_1 x_1 + w_2 x_2 + b)$ implements the logical AND on the four cases $(0, 0), (1, 0), (0, 1), (1, 1)$ .

A simple solution. Take $w_1 = 1$ , $w_2 = 1$ , $b = -1.5$ .

Case-by-case verification.

Case $(x_1, x_2) = (0, 0)$ :

z = 1 \cdot 0 + 1 \cdot 0 + (-1.5) = -1.5

Since $z < 0$ , the output is $0$ . Expected: $0$ ✓

Case $(x_1, x_2) = (1, 0)$ :

z = 1 \cdot 1 + 1 \cdot 0 + (-1.5) = 1 - 1.5 = -0.5

Since $z < 0$ , the output is $0$ . Expected: $0$ ✓

Case $(x_1, x_2) = (0, 1)$ :

z = 1 \cdot 0 + 1 \cdot 1 + (-1.5) = 1 - 1.5 = -0.5

Since $z < 0$ , the output is $0$ . Expected: $0$ ✓

Case $(x_1, x_2) = (1, 1)$ :

z = 1 \cdot 1 + 1 \cdot 1 + (-1.5) = 2 - 1.5 = 0.5

Since $z \geq 0$ , the output is $1$ . Expected: $1$ ✓

Other valid solutions. The triple $(1, 1, -1.2)$ also works (check by repeating the four cases).

General characterisation. Any triple $(w_1, w_2, b)$ with $w_1, w_2 > 0$ , $w_1 + b < 0$ , $w_2 + b < 0$ and $w_1 + w_2 + b \geq 0$ is a solution. Geometrically, the line $w_1 x_1 + w_2 x_2 + b = 0$ must pass between the three negative points and the point $(1, 1)$ .

Sources

Azevedo, F. A. et al. (2009). “Equal numbers of neuronal and nonneuronal cells make the human brain an isotropically scaled-up primate brain.” Journal of Comparative Neurology 513(5), 532-541. DOI 10.1002/cne.21974
McCulloch, W. & Pitts, W. (1943). “A Logical Calculus of Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics 5(4), 115-133. DOI 10.1007/BF02478259
Hebb, D. O. (1949). The Organization of Behavior. Wiley. Archive.org
Rosenblatt, F. (1958). “The Perceptron: a probabilistic model for information storage and organization in the brain.” Psychological Review 65(6), 386-408. DOI 10.1037/h0042519
Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature 323(6088), 533-536. DOI 10.1038/323533a0
Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 25. NeurIPS link
Lighthill, J. (1973). Artificial Intelligence: A General Survey. Science Research Council, United Kingdom. Archive.org

The biological inspiration

The referee analogy

Play with a neuron

The mathematical formula

What each symbol means

Information flow as a graph

From biological to artificial

An idea born in 1943

What a single neuron cannot do

Formal definition: hyperplane and linear separability

Why XOR is not linearly separable

Verify it yourself

Try to separate XOR yourself

The role of the bias, visually

In one sentence

On to chapter 2

Exercises

Exercise 1: direct computation

Exercise 2: build an AND neuron

Sources

Further reading