01 / 09 The artificial neuron
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 01 / 09

The artificial neuron

From biological to mathematical, what really happens inside the elementary brick of a network.

Every neural network, from the simplest to the deepest, is an assembly of one elementary brick repeated by the millions. That brick, the artificial neuron, has nothing magical about it. It is an equation with three ingredients, inspired by a biological cell hundreds of millions of years old.

By the end of this chapter you will be able to answer three questions: what exactly does an artificial neuron compute, where does the idea come from, and why does a lone neuron carry a limitation that modern networks had to overcome.

The biological inspiration

Your brain contains roughly 86 billion neurons (Azevedo et al., 2009). Each neuron receives electrical signals from other neurons through its dendrites, integrates them in its cell body, and decides, based on an internal threshold, whether to fire a signal along its axon.

In 1943, Warren McCulloch and Walter Pitts model that behaviour with an equation (McCulloch & Pitts, 1943), laying the foundations of what will later be called the artificial neuron. It is not a faithful copy of biology, but a mathematical simplification that turns out to be powerful.

The referee analogy

Picture a football referee deciding whether a foul deserves a penalty. He weighs several pieces of information, each more or less important depending on context.

InformationInputWeight
The hand touched the ballx1=1x_1 = 1w1=0.8w_1 = 0.8
Inside the penalty areax2=0x_2 = 0w2=0.5w_2 = 0.5
Intentional gesturex3=1x_3 = 1w3=0.9w_3 = 0.9

The referee mentally computes a weighted sum Weighted sum The addition of several values, each multiplied by a coefficient called weight. General formula Σ wᵢ xᵢ. It is the core of the artificial neuron's computation, before adding the bias and applying the activation function. : he adds the pieces of information after multiplying each by its importance. If the total clears a threshold, he whistles. That is exactly what an artificial neuron does.

Play with a neuron

Practice before theory. Drag the sliders below. The output recomputes live. Your mission: find a combination where the output yy is very close to 11, then another one where it is very close to 00. You will feel how each parameter pulls in its own direction.

f =
x₁ = 1.00w₁ = 0.80x₂ = 0.00w₂ = 0.50x₃ = 1.00w₃ = 0.90b = -0.50Σz = 1.20σy = 0.77
z = 1.00·0.80 + 0.00·0.50 + 1.00·0.90 + -0.50 = 1.20
y = σ(z) = 0.77

Three things to notice while playing:

  • An input at zero cancels the contribution of its weight, no matter the weight’s value.
  • Increasing a weight amplifies the influence of its input. Flipping it negative reverses the effect.
  • The bias shifts the output independently of the inputs. A very negative bias makes the neuron very hard to activate.

The mathematical formula

The neuron’s operation collapses into one equation. The expanded, readable form:

y = f( x₁·w₁ + x₂·w₂ + x₃·w₃ + b )
Expanded form

The same equation in compact notation with the sum symbol:

y=f(i=1nwixi+b)y = f\left(\sum_{i=1}^{n} w_i x_i + b\right)

And the densest form, in vector notation:

y=f(wx+b)y = f(\mathbf{w} \cdot \mathbf{x} + b)

All three forms say the same thing at different abstraction levels. Learn to recognise all three; you will encounter them in every scientific paper.

What each symbol means

Information flow as a graph

Signal propagation through a single neuron

Each input goes through a multiplication by its weight, all contributions sum with the bias, and the result passes through the activation function to produce the output.

From biological to artificial

Biological elementRoleArtificial counterpart
DendritesReceive incoming signalsInputs x1,x2,,xnx_1, x_2, \dots, x_n
SynapsesRegulate signal strengthWeights w1,w2,,wnw_1, w_2, \dots, w_n
Cell bodyIntegrates all signalsWeighted sum wixi\sum w_i x_i
Activation thresholdTriggers (or not) a signalBias bb + function ff
AxonCarries the outputOutput yy

An idea born in 1943

The artificial neuron has an 83-year history made of breakthroughs and long winters.

YearEvent
1943McCulloch & Pitts model the neuron mathematically.
1949Donald Hebb states “neurons that fire together wire together”.
1958Frank Rosenblatt publishes the founding paper of the perceptron Perceptron The first artificial neuron able to learn, invented by Frank Rosenblatt in 1958. It combines a weighted sum of the inputs with a threshold function to produce a binary 0 or 1 decision. Source: Rosenblatt, 1958 , the first machine learning rule for a single neuron.
1960Mark I Perceptron: first hardware realisation (Cornell Aeronautical Laboratory), an electromechanical computer able to recognise simple shapes.
1969Minsky & Papert publish Perceptrons and prove a single neuron cannot learn XOR XOR (exclusive or) Logical operation that returns 1 when exactly one of its two inputs is 1, and 0 otherwise. Its positive cases lie on a diagonal in 2D space, making them non separable by a single line. This makes XOR impossible to learn for a single perceptron. Source: Minsky and Papert, 1969 . First major dent in the field’s enthusiasm.
1973Lighthill report in the UK: AI is judged unable to deliver on its promises. Funding collapses in Europe and the US. The first AI winter begins, lasting roughly a decade.
1986Rumelhart, Hinton & Williams publish backpropagation Backpropagation An algorithm that computes the gradient of the loss function with respect to every weight in a neural network. It propagates the error from the output back through earlier layers using the chain rule. It is the core of multi-layer network training. Source: Rumelhart, Hinton and Williams, 1986 . Research restarts.
2012AlexNet (Krizhevsky, Sutskever, Hinton) crushes ImageNet on GPU. Modern deep learning begins.

What a single neuron cannot do

Formal definition: hyperplane and linear separability

A clean treatment demands a precise definition. Let xRn\mathbf{x} \in \mathbb{R}^n be an input vector, wRn\mathbf{w} \in \mathbb{R}^n a weight vector, bRb \in \mathbb{R} a bias. We use Rn\mathbb{R}^n here to denote the set of ordered lists of nn real numbers (chapter 2 formalises this notion). The set of points that zero the weighted sum

{xRnwx+b=0}\{\mathbf{x} \in \mathbb{R}^n \mid \mathbf{w} \cdot \mathbf{x} + b = 0\}

is called a hyperplane. In R2\mathbb{R}^2 it is a line, in R3\mathbb{R}^3 a plane, and in higher dimensions you can no longer draw it but the equation stands.

A hyperplane splits space into two half-spaces: one where wx+b>0\mathbf{w} \cdot \mathbf{x} + b > 0 (the neuron fires) and one where wx+b<0\mathbf{w} \cdot \mathbf{x} + b < 0 (it does not). A threshold neuron is exactly this partition operation.

A set of labelled points {(xi,yi)}\{(\mathbf{x}_i, y_i)\} with yi{0,1}y_i \in \{0, 1\} is said linearly separable if there exists (w,b)(\mathbf{w}, b) such that the associated hyperplane correctly separates the points of label 1 from those of label 0. Otherwise it is not linearly separable.

Why XOR is not linearly separable

This result is not only visual, it can be proved in a few lines by contradiction. We use here the threshold function, which is that of the historical perceptron. The same impossibility holds with a sigmoid; we lose only the simplicity of the strict inequality.

Before tackling the proof, let us fix a notation that will be useful later. For a logical condition PP, we write 1[P]\mathbb{1}[P] for the indicator function of PP: it equals 11 if PP is true, 00 otherwise. In particular 1[z0]\mathbb{1}[z \geq 0] is the Heaviside function, sometimes written H(z)H(z), which we already met at the start of this chapter.

This proof fits in five lines but it closes the debate definitively: no choice of weights and bias can solve XOR with a single threshold neuron.

Verify it yourself

That is enough for linearly separable problems, like the AND gate, where one line isolates the (1,1)(1, 1) case from the other three. But it is not enough for XOR XOR (exclusive or) Logical operation that returns 1 when exactly one of its two inputs is 1, and 0 otherwise. Its positive cases lie on a diagonal in 2D space, making them non separable by a single line. This makes XOR impossible to learn for a single perceptron. Source: Minsky and Papert, 1969 , whose positive cases sit diagonally: no line can separate them from the negatives.

Try to separate XOR yourself

Drag the sliders below to tilt and shift the line. On AND and OR, you can reach 4 out of 4 correctly classified. On XOR, you will never succeed: one point is always on the wrong side.

AB00110110
Line : B = -1.00·A + 1.50
1 / 4 correctly classified
On XOR, no straight line correctly separates the four points. Try as much as you want, you will never hit 4 out of 4.

That is exactly what Minsky and Papert proved in 1969. Geometrically, a neuron is a line; XOR requires a boundary that cannot be linear. The fix will come from multi-layer networks, which compose several lines to draw more complex boundaries.

The role of the bias, visually

Without bias, the line drawn by the neuron must pass through the origin. That is a strong constraint: most real-world problems have a decision boundary that does not sit at the origin.

The bias fixes this by translating the line anywhere in the plane. Mental image: weights control the orientation of the line (its slope); the bias controls its position.

To see it live, switch the component below to the OR dataset. Keep the slope near 1-1 and move only the intercept (which here plays the bias role). The line slides parallel to itself. Without that translation, you cannot correctly classify the three orange points.

AB00110111
Line : B = -1.00·A + 0.50
4 / 4 correctly classified

Another way to feel the bias: go back to the NeuronDiagram above, set every input to zero. The output now depends only on the bias. That is the neuron’s “base level”, independent of any input data.

In one sentence

An artificial neuron computes a linear combination of its inputs, adds a bias, and passes everything through a non-linear function. That’s it. The power comes from what we do with them once we stack them and train them.

On to chapter 2

You saw in the last section that the neuron formula also reads in the compact form y=f(wx+b)y = f(\mathbf{w} \cdot \mathbf{x} + b). That vector notation is everywhere in deep learning, but we never really defined it: what is a vector exactly? What does the central dot between w\mathbf{w} and x\mathbf{x} mean? Chapter 2 lays down those linear-algebra foundations, staying strictly useful for the rest of the course.

Exercises

Take a sheet of paper and a pencil. Solutions are right below, look at them only after trying.

Exercise 1: direct computation

Given a two-input neuron with w1=2w_1 = 2, w2=1w_2 = -1, b=0.5b = 0.5, and ReLU activation ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z). Compute the output for x1=0.7x_1 = 0.7 and x2=0.3x_2 = 0.3.

Exercise 2: build an AND neuron

Find a triple (w1,w2,b)(w_1, w_2, b) such that a threshold neuron H(z)=1[z0]H(z) = \mathbb{1}[z \geq 0] with two binary inputs x1,x2{0,1}x_1, x_2 \in \{0, 1\} implements the logical AND function. Verify on the four cases.

Sources

  • Azevedo, F. A. et al. (2009). “Equal numbers of neuronal and nonneuronal cells make the human brain an isotropically scaled-up primate brain.” Journal of Comparative Neurology 513(5), 532-541. DOI 10.1002/cne.21974
  • McCulloch, W. & Pitts, W. (1943). “A Logical Calculus of Ideas Immanent in Nervous Activity.” Bulletin of Mathematical Biophysics 5(4), 115-133. DOI 10.1007/BF02478259
  • Hebb, D. O. (1949). The Organization of Behavior. Wiley. Archive.org
  • Rosenblatt, F. (1958). “The Perceptron: a probabilistic model for information storage and organization in the brain.” Psychological Review 65(6), 386-408. DOI 10.1037/h0042519
  • Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
  • Rumelhart, D. E., Hinton, G. E. & Williams, R. J. (1986). “Learning representations by back-propagating errors.” Nature 323(6088), 533-536. DOI 10.1038/323533a0
  • Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 25. NeurIPS link
  • Lighthill, J. (1973). Artificial Intelligence: A General Survey. Science Research Council, United Kingdom. Archive.org

Further reading

  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6 on feed-forward networks. deeplearningbook.org
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 5 on neural networks.
  • Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapter 11. Free PDF at Stanford
  • LeCun, Y. (lecture series at Collège de France, 2016 onwards). “L’apprentissage profond.” college-de-france.fr
  • Ng, A. (online course on Coursera, “Deep Learning Specialization”). Excellent complement.
Quiz
  1. 1. What does a neuron compute before the activation function?

  2. 2. Why can a single neuron not learn XOR?

  3. 3. What is the geometric role of the bias?

  4. 4. What does the activation function f do in y = f(Σ wᵢ xᵢ + b)?

  5. 5. True or false: the artificial neuron is a faithful copy of the biological neuron.