Neural networks: foundations and mathematics · 03 / 09

Activation functions

Identity, sigmoid, ReLU, tanh: why they exist, what they give, and how to choose.

In chapters 1 and 2 you met the activation function $f$ in the formula $y = f(\mathbf{w} \cdot \mathbf{x} + b)$ without really spending time on it. That’s the subject of this chapter. We will see why it is mathematically essential, compare the four classics (Identity, sigmoid, ReLU, tanh) and learn how to choose between them.

Why we need a non-linear function

The proof that justifies everything

Consider a two-layer network with no activation function, i.e. $f$ replaced by the identity. The first layer outputs:

\mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1

The second layer applied to $\mathbf{h}$ gives:

\mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2 = W_2 (W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1) \mathbf{x} + (W_2 \mathbf{b}_1 + \mathbf{b}_2)

Setting $W' = W_2 W_1$ and $\mathbf{b}' = W_2 \mathbf{b}_1 + \mathbf{b}_2$ , we recover:

\mathbf{y} = W' \mathbf{x} + \mathbf{b}'

That is a single affine layer. Stacking layers without non-linearity is mathematically pointless: the whole thing reduces to one equivalent layer. This is the reason for being of the activation function.

The same argument repeats at every transition between layers. That is why we insert $f$ after every hidden layer, and not only at the output: if a single pair of adjacent layers lacks a non-linearity between them, those two layers collapse into one and depth loses its point again.

What we expect from an activation function

To be usable in practice, a function $f$ has to tick several boxes:

Non-linear (for the reason we just saw).
Differentiable almost everywhere, so backpropagation can compute a gradient (chapter 8).
Fast to compute, because we evaluate it millions of times per second during training.
Bounded (ideally), to keep outputs from blowing up.
Non-zero gradient over a wide range, otherwise training stalls.

The four classics we are about to study tick those boxes to varying degrees.

The four classical functions

Identity

The identity is the function that does nothing. We include it for completeness and because we use it on the output layer for regression problems.

f(x) = x

Identity

Equation: $f(x) = x$ . Derivative: $f'(x) = 1$ .

You can safely ignore this function as long as we are inside hidden layers, but it returns in chapter 6 when we discuss loss functions for regression.

Sigmoid

The sigmoid dominated neural networks between 1986 and 2010. Its S-shape squashes any real number into the open interval $]0, 1[$ , which lets us read its output as a probability.

σ(x) = 1 / (1 + e⁻ˣ)

Sigmoid

Equation: $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ . Derivative: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ , an elegant property that simplifies backpropagation.

Shape: S-curve, $\sigma(0) = 0.5$ , $\sigma(-\infty) \to 0$ , $\sigma(+\infty) \to 1$ .

Modern usage: output layer for binary classification (where the probability has a meaning). Rarely used in deep hidden layers because of the vanishing gradient (next section).

ReLU

ReLU is today the default choice for hidden layers. Its brutal simplicity is precisely what makes it efficient.

ReLU(x) = max(0, x)

ReLU

Equation: $\text{ReLU}(x) = \max(0, x)$ . Derivative: $\text{ReLU}'(x) = \mathbb{1}[x > 0]$ , i.e. $1$ for $x > 0$ and $0$ otherwise. Strictly undefined at $x = 0$ , but we conventionally set it to $0$ or $1$ with no practical consequence.

Shape: zero on negatives, identity on positives. Not bounded above.

Modern usage: since its introduction for RBMs by Nair and Hinton (2010), ReLU and its variants (Leaky ReLU, ELU, GELU) have become the dominant choice for hidden layers in deep networks. It computes in one operation, its gradient is exact (1 or 0), and it does not saturate upwards. Drawback: if a neuron’s input stays negative, its gradient is zero and backpropagation (chapter 8) can no longer update its weights. The neuron freezes. This is the dying ReLU problem, mitigated by the variants.

Watch a neuron die

The component below is a single-input ReLU neuron. Its output $\text{ReLU}(w \cdot x + b)$ is drawn in orange. Twelve dataset points are placed on the curve: green if the neuron fires on them, red if not. Push the bias toward very negative values: all points flip red. At that stage the neuron stops learning, because its gradient is zero everywhere.

Weight w = 1.20Bias b = 0.40

● 7 / 12 active ● 5 dead

Sélectif

Push the bias down to drive the neuron to death. When the ReLU output is zero on every sample, so is its gradient, and the neuron stops learning.

Simulator: dying ReLU neuron (interactive)

Single-input ReLU neuron with output ReLU(w·x + b), initial weight w=1.2 and bias b=0.4. The orange curve is the output as a function of the input. Twelve points from a uniform dataset are placed on this curve: green if the neuron fires on them, red if not. Pushing the bias to very negative values (b=-2 for example) flips all points red: the neuron is dead, its gradient is zero everywhere, it stops learning.

Three things to try:

Set the bias to $-2$ with a moderate positive weight. The neuron dies completely on the dataset.
Reset the bias to $0$ and flip the sign of the weight. The activation boundary pivots around the origin.
Find a configuration where exactly half of the points are active. That is typically a good starting point for training.

Tanh

The hyperbolic tangent is the centred cousin of the sigmoid. Same S-shape, but squashes into $]-1, 1[$ instead of $]0, 1[$ .

tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Tanh

Equation: $\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}$ . Derivative: $\tanh'(x) = 1 - \tanh^2(x)$ .

Shape: S-curve, $\tanh(0) = 0$ , $\tanh(-\infty) \to -1$ , $\tanh(+\infty) \to +1$ .

Modern usage: favoured when you want a zero-centred output (statistically preferable for training). Often used in classical RNNs (LSTM, GRU) and in some attention layers.

Recap table

A single overview to memorise the four classical functions: shape, derivative, output range, and computational cost.

Function	Definition	Derivative	Range	Relative cost
Identity	$f(x) = x$	$f'(x) = 1$	$\mathbb{R}$	1× (baseline)
Sigmoid	$\sigma(x) = \dfrac{1}{1 + e^{-x}}$	$\sigma'(x) = \sigma(x)(1 - \sigma(x))$	$(0, 1)$	~10× (one exp)
Tanh	$\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}$	$\tanh'(x) = 1 - \tanh^2(x)$	$(-1, 1)$	~10× (one exp in practice)
ReLU	$\text{ReLU}(x) = \max(0, x)$	$\text{ReLU}'(x) = \mathbb{1}[x > 0]$	$[0, +\infty)$	1× (one test)

Three observations to keep in mind: (1) only ReLU and Identity match the low cost of the dot product that feeds them; (2) only sigmoid and tanh saturate and therefore trigger vanishing gradient; (3) only ReLU can die, which the modern variants below fix.

Relative costs are orders of magnitude: the exact value depends on hardware (CPU vs GPU, FP32 vs FP16, presence of vector instructions or lookup tables for the exponential). On a CPU without hardware-accelerated exponential, the figures above hold; on a modern FP16 GPU, the gap between ReLU and sigmoid narrows significantly. Practical note: tanh is usually computed as $\tanh(x) = 1 - 2 / (e^{2x} + 1)$ , which requires only one exponential despite the defining formula seemingly asking for two.

The modern ReLU family

ReLU has been the default since 2012, but state-of-the-art architectures (transformers, foundation models, diffusion) have adopted smoother variants that solve the dying ReLU problem while preserving most of its advantages.

Leaky ReLU

LeakyReLU(x) = max(α x, x)

Leaky ReLU

Equation: $\text{LeakyReLU}(x) = \max(\alpha x, x)$ , with a small coefficient $\alpha$ typically set to $0.01$ . Derivative: $\alpha$ on the negative side, $1$ on the positive side.

Instead of clamping the output to zero for negative inputs, we let a small slope $\alpha$ through. The neuron keeps a non-zero gradient on both sides and can no longer die. Introduced by Maas, Hannun and Ng (2013) for speech recognition.

GELU

GELU(x) = x · Φ(x)

GELU

Equation: $\text{GELU}(x) = x \cdot \Phi(x)$ , where $\Phi$ is the cumulative distribution function of the standard normal distribution. It is the activation of late-2010s transformers (BERT, GPT-2, GPT-3) and remains widely used in most contemporary transformers. More recent proprietary architectures (GPT-4, Claude) do not publish their activation choice. Derivative: $\Phi(x) + x \cdot \phi(x)$ where $\phi$ is the Gaussian density.

Geometrically, GELU behaves almost like ReLU far from zero, but it is smooth everywhere (and therefore differentiable everywhere, unlike ReLU). Introduced by Hendrycks and Gimpel (2016).

SiLU (Swish)

SiLU(x) = x · σ(x)

SiLU / Swish

Equation: $\text{SiLU}(x) = x \cdot \sigma(x)$ , where $\sigma$ is the sigmoid. It is the same function as the “Swish” of Ramachandran et al. (2017), simply renamed. It is the inner activation of SwiGLU, the block used in Llama (all generations), Mistral, Mixtral, PaLM and most recent open-weight LLMs. It also shows up directly in EfficientNet. Derivative: $\sigma(x) + x \cdot \sigma(x)(1 - \sigma(x))$ .

Shape very close to GELU, slightly cheaper to compute (no error function, just a sigmoid). Independently described by Elfwing, Uchibe and Doya (2018) under the name SiLU.

ELU

ELU(x) = x if x ≥ 0, else α(eˣ - 1)

ELU

Equation: $\text{ELU}(x) = x$ if $x \geq 0$ , else $\alpha (e^x - 1)$ , with $\alpha$ typically equal to $1$ . A smooth shape on the negative side that asymptotes to $-\alpha$ . Derivative: $1$ on the positive side, $\alpha e^x$ on the negative side.

More expensive than ReLU (an exponential on the negative branch only) but centres the output around zero, which helps convergence. Introduced by Clevert, Unterthiner and Hochreiter (2015).

In 2026, the trio ReLU / GELU / SiLU covers the overwhelming majority of architectures deployed in production. Practical rule: ReLU for raw simplicity and speed, GELU for transformers, SiLU for recent dense architectures.

Play with the functions

The idea of an activation function is easier to see than to describe. The component below is a full workshop: you can enable or disable each function individually (click its label), display the derivatives as dashed lines, enable a cursor that places a vertical line and reveals the local tangent at the hovered point for each curve (the slope of that tangent is exactly $f'(x)$ ). A collapsible panel at the bottom also lets you tune the $x$ domain if you want to zoom in or out.

DerivativeCursor

x domain

min = -5.0max = 5.0

x = 0.00

Fonction	f(x)	f'(x)
Identity	0.000	1.000
Sigmoid	0.500	0.250
ReLU	0.000	0.000
Tanh	0.000	1.000

Les segments en pointillés autour de chaque point sont les tangentes locales. Une tangente plate signifie un gradient faible (apprentissage lent) ; une tangente raide signifie un gradient fort.

Workshop: plotting activation functions (interactive)

Workshop that plots four activation functions simultaneously (sigmoid, ReLU, tanh, identity) over x ∈ [-5, 5]. Derivatives shown as dashed lines. A cursor triggers a vertical line that reveals the local tangent at the hovered point on each curve. At x=0, σ'(0) = 0.25; at x=2, σ'(2) drops to about 0.1 (saturation). On ReLU's active part (x>0), the derivative equals exactly 1, independent of x. This is the root cause of the vanishing gradient on sigmoid and the robustness of ReLU.

Five experiments to try:

At $x = 0$ , check that $\sigma(0) = 0.5$ and $\sigma'(0) = 0.25$ . The sigmoid tangent is clean.
At $x = 2$ , see that $\sigma'(2)$ already fell toward $0.1$ . The sigmoid saturates fast, its tangent is nearly flat.
At $x = -3$ , ReLU and its derivative are strictly zero. On that branch no gradient flows back.
Toggle off Identity and Tanh to compare only sigmoid and ReLU: the contrast between saturation and piecewise linearity becomes obvious.
Compare at the centre the tangent slope of sigmoid (max $0.25$ ) with ReLU’s for $x > 0$ (always exactly $1$ ). That is the root of the vanishing gradient.

The vanishing gradient problem

When we differentiate the sigmoid, we get $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ . The maximum is reached at $x = 0$ and equals $0.25$ . So at every layer crossed, the gradient is multiplied by a factor of at most $0.25$ .

For a 10-layer network using sigmoid, the gradient at the first layer is multiplied by at most $0.25^{10} \approx 9.5 \times 10^{-7}$ . That is extremely small: the first layer no longer learns. This is the vanishing gradient problem, first observed by Hochreiter in his thesis (1991), analysed for recurrent networks by Bengio, Simard and Frasconi (1994), then quantified and addressed for deep feedforward networks by Glorot and Bengio (2010).

ReLU largely solves this problem: on its active part, the gradient is exactly $1$ . Multiplying by $1$ does not shrink the gradient. This is one of the two reasons (with its computational speed) for its modern dominance.

Watch the gradient collapse

The component below simulates a deep network. Change the number of layers and switch the activation. Each bar represents the effective gradient at a given layer, top to bottom from output to input. On sigmoid you see the bars shrink visibly as depth grows. On ReLU they keep their length.

Number of layers : 8

Layer 08

1.000

Layer 07

0.250

Layer 06

0.063

Layer 05

0.016

Layer 04

3.91e-3

Layer 03

9.77e-4

Layer 02

2.44e-4

Layer 01

6.10e-5

Gradient effectif à la première couche : 6.10e-5

The output layer receives a reference gradient of 1. Each layer crossed back toward the input multiplies it by the maximum derivative of the chosen function: sigmoid 0.25, tanh 1, ReLU 1.

Simulator: vanishing gradient in deep networks (interactive)

Simulator of a deep network with 8 layers (up to 20). Each bar represents the effective gradient at a given layer, from the output at top to the input at the bottom. On sigmoid, the bars shrink visibly: at 15 layers, the first-layer gradient is on the order of 10⁻⁹. On ReLU, all bars keep the same length, because the derivative equals 1 on the active part. This is the technical reason for dropping sigmoid in hidden layers in favour of ReLU starting in 2012.

The key experiment: push to $15$ layers with sigmoid and look at the first-layer gradient. It is on the order of $10^{-9}$ , totally insufficient to update a weight. Now switch to ReLU and watch the bars all match again. That is the exact technical reason why sigmoid was dropped from hidden layers in favour of ReLU starting in 2012.

How to choose in practice

A simple heuristic that works 95 % of the time:

Layer	Default choice	Variants
Hidden layers	ReLU	Leaky ReLU, ELU, GELU for dying-ReLU cases
Binary classification output	Sigmoid	(none)
Multi-class classification output	Softmax (chapter 6)	(none)
Regression output	Identity	(none)
Recurrent layers (RNN, LSTM, GRU)	Tanh + sigmoid on gates	(none)

This table is a heuristic, not dogma. Specific architectures (transformers, GANs, diffusion models) use other functions (GELU, Swish, Mish). But for a standard network, ReLU everywhere except output is an excellent starting point.

In one sentence

The activation function is what stops a deep network from collapsing into a linear regression. Sigmoid and tanh ruled until 2010, then ReLU took over the hidden layers from 2012 on. The choice depends on context but rarely turns out to be tricky in practice.

On to chapter 4

You now have every ingredient to understand the learning machine: vector inputs (chap. 2), weighted sum, bias and activation function (this chapter). Chapter 4 introduces the perceptron, the first neuron that adjusts its weights on its own from examples. It is the birth of machine learning as we know it.

A subtlety you will explore: this chapter has just established that an activation function must be differentiable to compute a gradient. Yet the perceptron uses the Heaviside step function, which is almost everywhere differentiable with derivative zero. How does it learn at all, without a gradient? That is precisely the question opening the next chapter.

Exercises

Exercise 1: differentiate the sigmoid

Starting from the definition $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ , prove that $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ .

Exercise 2: compute a derivative at a specific value

Compute $\sigma'(0)$ , then $\sigma'(2)$ . Compare those two values. What does this tell you about the behaviour of the gradient for inputs far from zero?

Exercise 3: compare the speed

Consider a neuron taking 1000 inputs. How many elementary arithmetic operations (additions, multiplications, exponentials) are needed to compute the output of a neuron with ReLU activation? With sigmoid activation? Does the difference look marginal or significant when multiplied by millions of evaluations?

Solution to exercise 1: differentiate the sigmoid

We have $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ and we want $\sigma'(x)$ . We use the chain rule.

Step 1. Let $u(x) = 1 + e^{-x}$ . Then:

\sigma(x) = \dfrac{1}{u(x)}

Step 2. The derivative of a function of the form $1/u$ is $-u'/u^2$ :

\sigma'(x) = -\dfrac{u'(x)}{u(x)^2}

Step 3. Compute $u'(x)$ . The derivative of $1$ is zero, and the derivative of $e^{-x}$ is $-e^{-x}$ (chain rule on the exponential):

u'(x) = 0 + (-e^{-x}) = -e^{-x}

Step 4. Substitute $u'$ and $u$ in the expression of $\sigma'$ :

\sigma'(x) = -\dfrac{-e^{-x}}{(1 + e^{-x})^2} = \dfrac{e^{-x}}{(1 + e^{-x})^2}

Step 5. Recognise two blocks. The first one is $\sigma(x)$ :

\dfrac{1}{1 + e^{-x}} = \sigma(x)

The second one rewrites by factoring:

\dfrac{e^{-x}}{1 + e^{-x}} = \dfrac{(1 + e^{-x}) - 1}{1 + e^{-x}} = 1 - \dfrac{1}{1 + e^{-x}} = 1 - \sigma(x)

Step 6. Putting it back together:

\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) \qquad \square

Solution to exercise 2: computation at two values

We use the formula proven in exercise 1: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ .

Computing $\sigma'(0)$ .

First $\sigma(0)$ :

\sigma(0) = \dfrac{1}{1 + e^{0}} = \dfrac{1}{1 + 1} = \dfrac{1}{2}

So:

\sigma'(0) = \dfrac{1}{2} \times \left(1 - \dfrac{1}{2}\right) = \dfrac{1}{2} \times \dfrac{1}{2} = \dfrac{1}{4} = 0.25

This is the maximum of the sigmoid derivative.

Computing $\sigma'(2)$ .

First $\sigma(2)$ :

\sigma(2) = \dfrac{1}{1 + e^{-2}} \approx \dfrac{1}{1 + 0.135} \approx \dfrac{1}{1.135} \approx 0.881

So:

\sigma'(2) \approx 0.881 \times (1 - 0.881) \approx 0.881 \times 0.119 \approx 0.105

Comparison and interpretation.

The gradient is divided by about $2.4$ between $x = 0$ and $x = 2$ . For larger inputs, it collapses very quickly: this is the saturation of the sigmoid and the source of the vanishing-gradient problem.

Solution to exercise 3: comparing compute costs

We count operations for a single neuron with 1000 inputs.

Common trunk: the weighted sum.

For both activations we first compute $z = \sum_{i=1}^{1000} w_i x_i + b$ . This requires:

1000 multiplications $w_i \times x_i$
999 additions for the sum
1 addition for the bias

That is around 2000 elementary operations for the weighted sum.

With ReLU.

A single extra operation: comparing $z$ with 0.

Total: around 2000 operations (the added comparison is negligible).

With sigmoid.

We must compute $1/(1 + e^{-z})$ , which requires:

1 negation: $-z$
1 exponential: $e^{-z}$
1 addition: $1 + e^{-z}$
1 division: $1/(1 + e^{-z})$

That is 4 extra operations.

Total: around 2003 operations.

The catch: not all operations cost the same.

The exponential is very expensive on commodity hardware: about 20 to 50 CPU cycles, against 1 cycle for a comparison.

For this 1000-input neuron the overhead nonetheless stays marginal: the weighted sum (about 2000 operations) dominates by far, and the exponential adds only a few percent. The difference becomes truly significant only when the activation weighs as much as the computation before it: neurons with very few inputs, or above all activations applied element-wise over large tensors, where each output costs just a handful of operations before the exponential. That is precisely the regime of modern deep networks, and that is where ReLU’s thriftiness pays off.

This is the second reason (along with the preserved gradient) for ReLU’s dominance in hidden layers.

Sources

Nair, V. & Hinton, G. E. (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML 27. ICML link
Glorot, X. & Bengio, Y. (2010). “Understanding the difficulty of training deep feedforward neural networks.” AISTATS 9, 249-256. AISTATS link
Cox, D. R. (1958). “The Regression Analysis of Binary Sequences.” Journal of the Royal Statistical Society 20(2), 215-242. (Historical origin of the logistic function.) DOI 10.1111/j.2517-6161.1958.tb00292.x

Why we need a non-linear function

The proof that justifies everything

What we expect from an activation function

The four classical functions

Identity

Sigmoid

ReLU

Watch a neuron die

Tanh

Recap table

The modern ReLU family

Leaky ReLU

GELU

SiLU (Swish)

ELU

Play with the functions

The vanishing gradient problem

Watch the gradient collapse

How to choose in practice

In one sentence

On to chapter 4

Exercises

Exercise 1: differentiate the sigmoid

Exercise 2: compute a derivative at a specific value

Exercise 3: compare the speed

Sources

Further reading