03 / 09 Activation functions
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 03 / 09

Activation functions

Identity, sigmoid, ReLU, tanh: why they exist, what they give, and how to choose.

In chapters 1 and 2 you met the activation function ff in the formula y=f(wx+b)y = f(\mathbf{w} \cdot \mathbf{x} + b) without really spending time on it. That’s the subject of this chapter. We will see why it is mathematically essential, compare the four classics (Identity, sigmoid, ReLU, tanh) and learn how to choose between them.

Why we need a non-linear function

The proof that justifies everything

Consider a two-layer network with no activation function, i.e. ff replaced by the identity. The first layer outputs:

h=W1x+b1\mathbf{h} = W_1 \mathbf{x} + \mathbf{b}_1

The second layer applied to h\mathbf{h} gives:

y=W2h+b2=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)\mathbf{y} = W_2 \mathbf{h} + \mathbf{b}_2 = W_2 (W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (W_2 W_1) \mathbf{x} + (W_2 \mathbf{b}_1 + \mathbf{b}_2)

Setting W=W2W1W' = W_2 W_1 and b=W2b1+b2\mathbf{b}' = W_2 \mathbf{b}_1 + \mathbf{b}_2, we recover:

y=Wx+b\mathbf{y} = W' \mathbf{x} + \mathbf{b}'

That is a single affine layer. Stacking layers without non-linearity is mathematically pointless: the whole thing reduces to one equivalent layer. This is the reason for being of the activation function.

The same argument repeats at every transition between layers. That is why we insert ff after every hidden layer, and not only at the output: if a single pair of adjacent layers lacks a non-linearity between them, those two layers collapse into one and depth loses its point again.

What we expect from an activation function

To be usable in practice, a function ff has to tick several boxes:

  1. Non-linear (for the reason we just saw).
  2. Differentiable almost everywhere, so backpropagation can compute a gradient (chapter 8).
  3. Fast to compute, because we evaluate it millions of times per second during training.
  4. Bounded (ideally), to keep outputs from blowing up.
  5. Non-zero gradient over a wide range, otherwise training stalls.

The four classics we are about to study tick those boxes to varying degrees.

The four classical functions

Identity

The identity is the function that does nothing. We include it for completeness and because we use it on the output layer for regression problems.

f(x) = x
Identity

Equation: f(x)=xf(x) = x. Derivative: f(x)=1f'(x) = 1.

You can safely ignore this function as long as we are inside hidden layers, but it returns in chapter 6 when we discuss loss functions for regression.

Sigmoid

The sigmoid Sigmoid An S-shaped activation function that takes any real number and squashes it into the open interval (0, 1). Its formula is σ(x) = 1 / (1 + e⁻ˣ). Historically the most used, it is today often replaced by ReLU in hidden layers. dominated neural networks between 1986 and 2010. Its S-shape squashes any real number into the open interval ]0,1[]0, 1[, which lets us read its output as a probability.

σ(x) = 1 / (1 + e⁻ˣ)
Sigmoid

Equation: σ(x)=11+ex\sigma(x) = \dfrac{1}{1 + e^{-x}}. Derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)), an elegant property that simplifies backpropagation.

Shape: S-curve, σ(0)=0.5\sigma(0) = 0.5, σ()0\sigma(-\infty) \to 0, σ(+)1\sigma(+\infty) \to 1.

Modern usage: output layer for binary classification (where the probability has a meaning). Rarely used in deep hidden layers because of the vanishing gradient (next section).

ReLU

ReLU ReLU An activation function defined as ReLU(x) = max(0, x). Linear for positive values, zero for negative ones. Simple, fast to compute, and largely solves the vanishing gradient problem. The de facto standard in hidden layers of deep networks since 2012. Source: Nair and Hinton, 2010 is today the default choice for hidden layers. Its brutal simplicity is precisely what makes it efficient.

ReLU(x) = max(0, x)
ReLU

Equation: ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x). Derivative: ReLU(x)=1[x>0]\text{ReLU}'(x) = \mathbb{1}[x > 0], i.e. 11 for x>0x > 0 and 00 otherwise. Strictly undefined at x=0x = 0, but we conventionally set it to 00 or 11 with no practical consequence.

Shape: zero on negatives, identity on positives. Not bounded above.

Modern usage: since its introduction for RBMs by Nair and Hinton (2010), ReLU and its variants (Leaky ReLU, ELU, GELU) have become the dominant choice for hidden layers in deep networks. It computes in one operation, its gradient is exact (1 or 0), and it does not saturate upwards. Drawback: if a neuron’s input stays negative, its gradient is zero and backpropagation (chapter 8) can no longer update its weights. The neuron freezes. This is the dying ReLU problem, mitigated by the variants.

Watch a neuron die

The component below is a single-input ReLU neuron. Its output ReLU(wx+b)\text{ReLU}(w \cdot x + b) is drawn in orange. Twelve dataset points are placed on the curve: green if the neuron fires on them, red if not. Push the bias toward very negative values: all points flip red. At that stage the neuron stops learning, because its gradient is zero everywhere.

xReLU
7 / 12 active 5 dead
Sélectif

Push the bias down to drive the neuron to death. When the ReLU output is zero on every sample, so is its gradient, and the neuron stops learning.

Three things to try:

  • Set the bias to 2-2 with a moderate positive weight. The neuron dies completely on the dataset.
  • Reset the bias to 00 and flip the sign of the weight. The activation boundary pivots around the origin.
  • Find a configuration where exactly half of the points are active. That is typically a good starting point for training.

Tanh

The hyperbolic tangent is the centred cousin of the sigmoid. Same S-shape, but squashes into ]1,1[]-1, 1[ instead of ]0,1[]0, 1[.

tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Tanh

Equation: tanh(x)=exexex+ex\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}. Derivative: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x).

Shape: S-curve, tanh(0)=0\tanh(0) = 0, tanh()1\tanh(-\infty) \to -1, tanh(+)+1\tanh(+\infty) \to +1.

Modern usage: favoured when you want a zero-centred output (statistically preferable for training). Often used in classical RNNs (LSTM, GRU) and in some attention layers.

Recap table

A single overview to memorise the four classical functions: shape, derivative, output range, and computational cost.

FunctionDefinitionDerivativeRangeRelative cost
Identityf(x)=xf(x) = xf(x)=1f'(x) = 1R\mathbb{R}1× (baseline)
Sigmoidσ(x)=11+ex\sigma(x) = \dfrac{1}{1 + e^{-x}}σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))(0,1)(0, 1)~10× (one exp)
Tanhtanh(x)=exexex+ex\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)(1,1)(-1, 1)~10× (one exp in practice)
ReLUReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)ReLU(x)=1[x>0]\text{ReLU}'(x) = \mathbb{1}[x > 0][0,+)[0, +\infty)1× (one test)

Three observations to keep in mind: (1) only ReLU and Identity match the low cost of the dot product that feeds them; (2) only sigmoid and tanh saturate and therefore trigger vanishing gradient; (3) only ReLU can die, which the modern variants below fix.

Relative costs are orders of magnitude: the exact value depends on hardware (CPU vs GPU, FP32 vs FP16, presence of vector instructions or lookup tables for the exponential). On a CPU without hardware-accelerated exponential, the figures above hold; on a modern FP16 GPU, the gap between ReLU and sigmoid narrows significantly. Practical note: tanh is usually computed as tanh(x)=12/(e2x+1)\tanh(x) = 1 - 2 / (e^{2x} + 1), which requires only one exponential despite the defining formula seemingly asking for two.

The modern ReLU family

ReLU has been the default since 2012, but state-of-the-art architectures (transformers, foundation models, diffusion) have adopted smoother variants that solve the dying ReLU problem while preserving most of its advantages.

Leaky ReLU

LeakyReLU(x) = max(α x, x)
Leaky ReLU

Equation: LeakyReLU(x)=max(αx,x)\text{LeakyReLU}(x) = \max(\alpha x, x), with a small coefficient α\alpha typically set to 0.010.01. Derivative: α\alpha on the negative side, 11 on the positive side.

Instead of clamping the output to zero for negative inputs, we let a small slope α\alpha through. The neuron keeps a non-zero gradient on both sides and can no longer die. Introduced by Maas, Hannun and Ng (2013) for speech recognition.

GELU

GELU(x) = x · Φ(x)
GELU

Equation: GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x), where Φ\Phi is the cumulative distribution function of the standard normal distribution. It is the activation of late-2010s transformers (BERT, GPT-2, GPT-3) and remains widely used in most contemporary transformers. More recent proprietary architectures (GPT-4, Claude) do not publish their activation choice. Derivative: Φ(x)+xϕ(x)\Phi(x) + x \cdot \phi(x) where ϕ\phi is the Gaussian density.

Geometrically, GELU behaves almost like ReLU far from zero, but it is smooth everywhere (and therefore differentiable everywhere, unlike ReLU). Introduced by Hendrycks and Gimpel (2016).

SiLU (Swish)

SiLU(x) = x · σ(x)
SiLU / Swish

Equation: SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x), where σ\sigma is the sigmoid. It is the same function as the “Swish” of Ramachandran et al. (2017), simply renamed. It is the inner activation of SwiGLU, the block used in Llama (all generations), Mistral, Mixtral, PaLM and most recent open-weight LLMs. It also shows up directly in EfficientNet. Derivative: σ(x)+xσ(x)(1σ(x))\sigma(x) + x \cdot \sigma(x)(1 - \sigma(x)).

Shape very close to GELU, slightly cheaper to compute (no error function, just a sigmoid). Independently described by Elfwing, Uchibe and Doya (2018) under the name SiLU.

ELU

ELU(x) = x if x ≥ 0, else α(eˣ - 1)
ELU

Equation: ELU(x)=x\text{ELU}(x) = x if x0x \geq 0, else α(ex1)\alpha (e^x - 1), with α\alpha typically equal to 11. A smooth shape on the negative side that asymptotes to α-\alpha. Derivative: 11 on the positive side, αex\alpha e^x on the negative side.

More expensive than ReLU (an exponential on the negative branch only) but centres the output around zero, which helps convergence. Introduced by Clevert, Unterthiner and Hochreiter (2015).

In 2026, the trio ReLU / GELU / SiLU covers the overwhelming majority of architectures deployed in production. Practical rule: ReLU for raw simplicity and speed, GELU for transformers, SiLU for recent dense architectures.

Play with the functions

The idea of an activation function is easier to see than to describe. The component below is a full workshop: you can enable or disable each function individually (click its label), display the derivatives as dashed lines, enable a cursor that places a vertical line and reveals the local tangent at the hovered point for each curve (the slope of that tangent is exactly f(x)f'(x)). A collapsible panel at the bottom also lets you tune the xx domain if you want to zoom in or out.

x domain
Fonctionf(x)f'(x)
Identity0.0001.000
Sigmoid0.5000.250
ReLU0.0000.000
Tanh0.0001.000

Les segments en pointillés autour de chaque point sont les tangentes locales. Une tangente plate signifie un gradient faible (apprentissage lent) ; une tangente raide signifie un gradient fort.

Five experiments to try:

  • At x=0x = 0, check that σ(0)=0.5\sigma(0) = 0.5 and σ(0)=0.25\sigma'(0) = 0.25. The sigmoid tangent is clean.
  • At x=2x = 2, see that σ(2)\sigma'(2) already fell toward 0.10.1. The sigmoid saturates fast, its tangent is nearly flat.
  • At x=3x = -3, ReLU and its derivative are strictly zero. On that branch no gradient flows back.
  • Toggle off Identity and Tanh to compare only sigmoid and ReLU: the contrast between saturation and piecewise linearity becomes obvious.
  • Compare at the centre the tangent slope of sigmoid (max 0.250.25) with ReLU’s for x>0x > 0 (always exactly 11). That is the root of the vanishing gradient.

The vanishing gradient problem

When we differentiate the sigmoid, we get σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)). The maximum is reached at x=0x = 0 and equals 0.250.25. So at every layer crossed, the gradient is multiplied by a factor of at most 0.250.25.

For a 10-layer network using sigmoid, the gradient at the first layer is multiplied by at most 0.25109.5×1070.25^{10} \approx 9.5 \times 10^{-7}. That is extremely small: the first layer no longer learns. This is the vanishing gradient problem, first observed by Hochreiter in his thesis (1991), analysed for recurrent networks by Bengio, Simard and Frasconi (1994), then quantified and addressed for deep feedforward networks by Glorot and Bengio (2010).

ReLU largely solves this problem: on its active part, the gradient is exactly 11. Multiplying by 11 does not shrink the gradient. This is one of the two reasons (with its computational speed) for its modern dominance.

Watch the gradient collapse

The component below simulates a deep network. Change the number of layers and switch the activation. Each bar represents the effective gradient at a given layer, top to bottom from output to input. On sigmoid you see the bars shrink visibly as depth grows. On ReLU they keep their length.

Layer 08
1.000
Layer 07
0.250
Layer 06
0.063
Layer 05
0.016
Layer 04
3.91e-3
Layer 03
9.77e-4
Layer 02
2.44e-4
Layer 01
6.10e-5
Gradient effectif à la première couche : 6.10e-5

The output layer receives a reference gradient of 1. Each layer crossed back toward the input multiplies it by the maximum derivative of the chosen function: sigmoid 0.25, tanh 1, ReLU 1.

The key experiment: push to 1515 layers with sigmoid and look at the first-layer gradient. It is on the order of 10910^{-9}, totally insufficient to update a weight. Now switch to ReLU and watch the bars all match again. That is the exact technical reason why sigmoid was dropped from hidden layers in favour of ReLU starting in 2012.

How to choose in practice

A simple heuristic that works 95 % of the time:

LayerDefault choiceVariants
Hidden layersReLULeaky ReLU, ELU, GELU for dying-ReLU cases
Binary classification outputSigmoid(none)
Multi-class classification outputSoftmax (chapter 6)(none)
Regression outputIdentity(none)
Recurrent layers (RNN, LSTM, GRU)Tanh + sigmoid on gates(none)

This table is a heuristic, not dogma. Specific architectures (transformers, GANs, diffusion models) use other functions (GELU, Swish, Mish). But for a standard network, ReLU everywhere except output is an excellent starting point.

In one sentence

The activation function is what stops a deep network from collapsing into a linear regression. Sigmoid and tanh ruled until 2010, then ReLU took over the hidden layers from 2012 on. The choice depends on context but rarely turns out to be tricky in practice.

On to chapter 4

You now have every ingredient to understand the learning machine: vector inputs (chap. 2), weighted sum, bias and activation function (this chapter). Chapter 4 introduces the perceptron, the first neuron that adjusts its weights on its own from examples. It is the birth of machine learning as we know it.

A subtlety you will explore: this chapter has just established that an activation function must be differentiable to compute a gradient. Yet the perceptron uses the Heaviside step function, which is almost everywhere differentiable with derivative zero. How does it learn at all, without a gradient? That is precisely the question opening the next chapter.

Exercises

Exercise 1: differentiate the sigmoid

Starting from the definition σ(x)=11+ex\sigma(x) = \dfrac{1}{1 + e^{-x}}, prove that σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)).

Exercise 2: compute a derivative at a specific value

Compute σ(0)\sigma'(0), then σ(2)\sigma'(2). Compare those two values. What does this tell you about the behaviour of the gradient for inputs far from zero?

Exercise 3: compare the speed

Consider a neuron taking 1000 inputs. How many elementary arithmetic operations (additions, multiplications, exponentials) are needed to compute the output of a neuron with ReLU activation? With sigmoid activation? Does the difference look marginal or significant when multiplied by millions of evaluations?

Sources

  • Nair, V. & Hinton, G. E. (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML 27. ICML link
  • Glorot, X. & Bengio, Y. (2010). “Understanding the difficulty of training deep feedforward neural networks.” AISTATS 9, 249-256. AISTATS link
  • Cox, D. R. (1958). “The Regression Analysis of Binary Sequences.” Journal of the Royal Statistical Society 20(2), 215-242. (Historical origin of the logistic function.) DOI 10.1111/j.2517-6161.1958.tb00292.x

Further reading

  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Section 6.3 on activation functions and their variants. deeplearningbook.org
  • Maas, A. L., Hannun, A. Y. & Ng, A. Y. (2013). “Rectifier Nonlinearities Improve Neural Network Acoustic Models.” ICML 30, Workshop on Deep Learning for Audio, Speech and Language Processing. (Introduction of Leaky ReLU.) Stanford link
  • Hendrycks, D. & Gimpel, K. (2016). “Gaussian Error Linear Units (GELUs).” arXiv. arXiv 1606.08415
  • Ramachandran, P., Zoph, B. & Le, Q. V. (2017). “Searching for Activation Functions.” arXiv. (Automated search for activations, introduces the Swish family, equivalent to SiLU.) arXiv 1710.05941
  • Elfwing, S., Uchibe, E. & Doya, K. (2018). “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning.” Neural Networks 107, 3-11. (Original description of SiLU.) arXiv 1702.03118
  • Clevert, D.-A., Unterthiner, T. & Hochreiter, S. (2015). “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).” arXiv. arXiv 1511.07289
Quiz
  1. 1. Why is a non-linear activation function essential in a deep network?

  2. 2. What is the derivative of the sigmoid σ(x)?

  3. 3. Why has ReLU dominated hidden layers of deep networks since 2012?

  4. 4. For a binary classification, which activation function is typically used at the output?

  5. 5. For a 10-layer deep network using sigmoid everywhere, what can happen during training?