Activation functions
Identity, sigmoid, ReLU, tanh: why they exist, what they give, and how to choose.
In chapters 1 and 2 you met the activation function in the formula without really spending time on it. That’s the subject of this chapter. We will see why it is mathematically essential, compare the four classics (Identity, sigmoid, ReLU, tanh) and learn how to choose between them.
Why we need a non-linear function
The proof that justifies everything
Consider a two-layer network with no activation function, i.e. replaced by the identity. The first layer outputs:
The second layer applied to gives:
Setting and , we recover:
That is a single affine layer. Stacking layers without non-linearity is mathematically pointless: the whole thing reduces to one equivalent layer. This is the reason for being of the activation function.
The same argument repeats at every transition between layers. That is why we insert after every hidden layer, and not only at the output: if a single pair of adjacent layers lacks a non-linearity between them, those two layers collapse into one and depth loses its point again.
What we expect from an activation function
To be usable in practice, a function has to tick several boxes:
- Non-linear (for the reason we just saw).
- Differentiable almost everywhere, so backpropagation can compute a gradient (chapter 8).
- Fast to compute, because we evaluate it millions of times per second during training.
- Bounded (ideally), to keep outputs from blowing up.
- Non-zero gradient over a wide range, otherwise training stalls.
The four classics we are about to study tick those boxes to varying degrees.
The four classical functions
Identity
The identity is the function that does nothing. We include it for completeness and because we use it on the output layer for regression problems.
Equation: . Derivative: .
You can safely ignore this function as long as we are inside hidden layers, but it returns in chapter 6 when we discuss loss functions for regression.
Sigmoid
The sigmoid Sigmoid An S-shaped activation function that takes any real number and squashes it into the open interval (0, 1). Its formula is σ(x) = 1 / (1 + e⁻ˣ). Historically the most used, it is today often replaced by ReLU in hidden layers. dominated neural networks between 1986 and 2010. Its S-shape squashes any real number into the open interval , which lets us read its output as a probability.
Equation: . Derivative: , an elegant property that simplifies backpropagation.
Shape: S-curve, , , .
Modern usage: output layer for binary classification (where the probability has a meaning). Rarely used in deep hidden layers because of the vanishing gradient (next section).
ReLU
ReLU ReLU An activation function defined as ReLU(x) = max(0, x). Linear for positive values, zero for negative ones. Simple, fast to compute, and largely solves the vanishing gradient problem. The de facto standard in hidden layers of deep networks since 2012. Source: Nair and Hinton, 2010 is today the default choice for hidden layers. Its brutal simplicity is precisely what makes it efficient.
Equation: . Derivative: , i.e. for and otherwise. Strictly undefined at , but we conventionally set it to or with no practical consequence.
Shape: zero on negatives, identity on positives. Not bounded above.
Modern usage: since its introduction for RBMs by Nair and Hinton (2010), ReLU and its variants (Leaky ReLU, ELU, GELU) have become the dominant choice for hidden layers in deep networks. It computes in one operation, its gradient is exact (1 or 0), and it does not saturate upwards. Drawback: if a neuron’s input stays negative, its gradient is zero and backpropagation (chapter 8) can no longer update its weights. The neuron freezes. This is the dying ReLU problem, mitigated by the variants.
Watch a neuron die
The component below is a single-input ReLU neuron. Its output is drawn in orange. Twelve dataset points are placed on the curve: green if the neuron fires on them, red if not. Push the bias toward very negative values: all points flip red. At that stage the neuron stops learning, because its gradient is zero everywhere.
Push the bias down to drive the neuron to death. When the ReLU output is zero on every sample, so is its gradient, and the neuron stops learning.
Three things to try:
- Set the bias to with a moderate positive weight. The neuron dies completely on the dataset.
- Reset the bias to and flip the sign of the weight. The activation boundary pivots around the origin.
- Find a configuration where exactly half of the points are active. That is typically a good starting point for training.
Tanh
The hyperbolic tangent is the centred cousin of the sigmoid. Same S-shape, but squashes into instead of .
Equation: . Derivative: .
Shape: S-curve, , , .
Modern usage: favoured when you want a zero-centred output (statistically preferable for training). Often used in classical RNNs (LSTM, GRU) and in some attention layers.
Recap table
A single overview to memorise the four classical functions: shape, derivative, output range, and computational cost.
| Function | Definition | Derivative | Range | Relative cost |
|---|---|---|---|---|
| Identity | 1× (baseline) | |||
| Sigmoid | ~10× (one exp) | |||
| Tanh | ~10× (one exp in practice) | |||
| ReLU | 1× (one test) |
Three observations to keep in mind: (1) only ReLU and Identity match the low cost of the dot product that feeds them; (2) only sigmoid and tanh saturate and therefore trigger vanishing gradient; (3) only ReLU can die, which the modern variants below fix.
Relative costs are orders of magnitude: the exact value depends on hardware (CPU vs GPU, FP32 vs FP16, presence of vector instructions or lookup tables for the exponential). On a CPU without hardware-accelerated exponential, the figures above hold; on a modern FP16 GPU, the gap between ReLU and sigmoid narrows significantly. Practical note: tanh is usually computed as , which requires only one exponential despite the defining formula seemingly asking for two.
The modern ReLU family
ReLU has been the default since 2012, but state-of-the-art architectures (transformers, foundation models, diffusion) have adopted smoother variants that solve the dying ReLU problem while preserving most of its advantages.
Leaky ReLU
Equation: , with a small coefficient typically set to . Derivative: on the negative side, on the positive side.
Instead of clamping the output to zero for negative inputs, we let a small slope through. The neuron keeps a non-zero gradient on both sides and can no longer die. Introduced by Maas, Hannun and Ng (2013) for speech recognition.
GELU
Equation: , where is the cumulative distribution function of the standard normal distribution. It is the activation of late-2010s transformers (BERT, GPT-2, GPT-3) and remains widely used in most contemporary transformers. More recent proprietary architectures (GPT-4, Claude) do not publish their activation choice. Derivative: where is the Gaussian density.
Geometrically, GELU behaves almost like ReLU far from zero, but it is smooth everywhere (and therefore differentiable everywhere, unlike ReLU). Introduced by Hendrycks and Gimpel (2016).
SiLU (Swish)
Equation: , where is the sigmoid. It is the same function as the “Swish” of Ramachandran et al. (2017), simply renamed. It is the inner activation of SwiGLU, the block used in Llama (all generations), Mistral, Mixtral, PaLM and most recent open-weight LLMs. It also shows up directly in EfficientNet. Derivative: .
Shape very close to GELU, slightly cheaper to compute (no error function, just a sigmoid). Independently described by Elfwing, Uchibe and Doya (2018) under the name SiLU.
ELU
Equation: if , else , with typically equal to . A smooth shape on the negative side that asymptotes to . Derivative: on the positive side, on the negative side.
More expensive than ReLU (an exponential on the negative branch only) but centres the output around zero, which helps convergence. Introduced by Clevert, Unterthiner and Hochreiter (2015).
In 2026, the trio ReLU / GELU / SiLU covers the overwhelming majority of architectures deployed in production. Practical rule: ReLU for raw simplicity and speed, GELU for transformers, SiLU for recent dense architectures.
Play with the functions
The idea of an activation function is easier to see than to describe. The component below is a full workshop: you can enable or disable each function individually (click its label), display the derivatives as dashed lines, enable a cursor that places a vertical line and reveals the local tangent at the hovered point for each curve (the slope of that tangent is exactly ). A collapsible panel at the bottom also lets you tune the domain if you want to zoom in or out.
x domain
| Fonction | f(x) | f'(x) |
|---|---|---|
| Identity | 0.000 | 1.000 |
| Sigmoid | 0.500 | 0.250 |
| ReLU | 0.000 | 0.000 |
| Tanh | 0.000 | 1.000 |
Les segments en pointillés autour de chaque point sont les tangentes locales. Une tangente plate signifie un gradient faible (apprentissage lent) ; une tangente raide signifie un gradient fort.
Five experiments to try:
- At , check that and . The sigmoid tangent is clean.
- At , see that already fell toward . The sigmoid saturates fast, its tangent is nearly flat.
- At , ReLU and its derivative are strictly zero. On that branch no gradient flows back.
- Toggle off Identity and Tanh to compare only sigmoid and ReLU: the contrast between saturation and piecewise linearity becomes obvious.
- Compare at the centre the tangent slope of sigmoid (max ) with ReLU’s for (always exactly ). That is the root of the vanishing gradient.
The vanishing gradient problem
When we differentiate the sigmoid, we get . The maximum is reached at and equals . So at every layer crossed, the gradient is multiplied by a factor of at most .
For a 10-layer network using sigmoid, the gradient at the first layer is multiplied by at most . That is extremely small: the first layer no longer learns. This is the vanishing gradient problem, first observed by Hochreiter in his thesis (1991), analysed for recurrent networks by Bengio, Simard and Frasconi (1994), then quantified and addressed for deep feedforward networks by Glorot and Bengio (2010).
ReLU largely solves this problem: on its active part, the gradient is exactly . Multiplying by does not shrink the gradient. This is one of the two reasons (with its computational speed) for its modern dominance.
Watch the gradient collapse
The component below simulates a deep network. Change the number of layers and switch the activation. Each bar represents the effective gradient at a given layer, top to bottom from output to input. On sigmoid you see the bars shrink visibly as depth grows. On ReLU they keep their length.
The output layer receives a reference gradient of 1. Each layer crossed back toward the input multiplies it by the maximum derivative of the chosen function: sigmoid 0.25, tanh 1, ReLU 1.
The key experiment: push to layers with sigmoid and look at the first-layer gradient. It is on the order of , totally insufficient to update a weight. Now switch to ReLU and watch the bars all match again. That is the exact technical reason why sigmoid was dropped from hidden layers in favour of ReLU starting in 2012.
How to choose in practice
A simple heuristic that works 95 % of the time:
| Layer | Default choice | Variants |
|---|---|---|
| Hidden layers | ReLU | Leaky ReLU, ELU, GELU for dying-ReLU cases |
| Binary classification output | Sigmoid | (none) |
| Multi-class classification output | Softmax (chapter 6) | (none) |
| Regression output | Identity | (none) |
| Recurrent layers (RNN, LSTM, GRU) | Tanh + sigmoid on gates | (none) |
This table is a heuristic, not dogma. Specific architectures (transformers, GANs, diffusion models) use other functions (GELU, Swish, Mish). But for a standard network, ReLU everywhere except output is an excellent starting point.
In one sentence
The activation function is what stops a deep network from collapsing into a linear regression. Sigmoid and tanh ruled until 2010, then ReLU took over the hidden layers from 2012 on. The choice depends on context but rarely turns out to be tricky in practice.
On to chapter 4
You now have every ingredient to understand the learning machine: vector inputs (chap. 2), weighted sum, bias and activation function (this chapter). Chapter 4 introduces the perceptron, the first neuron that adjusts its weights on its own from examples. It is the birth of machine learning as we know it.
A subtlety you will explore: this chapter has just established that an activation function must be differentiable to compute a gradient. Yet the perceptron uses the Heaviside step function, which is almost everywhere differentiable with derivative zero. How does it learn at all, without a gradient? That is precisely the question opening the next chapter.
Exercises
Exercise 1: differentiate the sigmoid
Starting from the definition , prove that .
Exercise 2: compute a derivative at a specific value
Compute , then . Compare those two values. What does this tell you about the behaviour of the gradient for inputs far from zero?
Exercise 3: compare the speed
Consider a neuron taking 1000 inputs. How many elementary arithmetic operations (additions, multiplications, exponentials) are needed to compute the output of a neuron with ReLU activation? With sigmoid activation? Does the difference look marginal or significant when multiplied by millions of evaluations?
Sources
- Nair, V. & Hinton, G. E. (2010). “Rectified Linear Units Improve Restricted Boltzmann Machines.” ICML 27. ICML link
- Glorot, X. & Bengio, Y. (2010). “Understanding the difficulty of training deep feedforward neural networks.” AISTATS 9, 249-256. AISTATS link
- Cox, D. R. (1958). “The Regression Analysis of Binary Sequences.” Journal of the Royal Statistical Society 20(2), 215-242. (Historical origin of the logistic function.) DOI 10.1111/j.2517-6161.1958.tb00292.x
Further reading
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Section 6.3 on activation functions and their variants. deeplearningbook.org
- Maas, A. L., Hannun, A. Y. & Ng, A. Y. (2013). “Rectifier Nonlinearities Improve Neural Network Acoustic Models.” ICML 30, Workshop on Deep Learning for Audio, Speech and Language Processing. (Introduction of Leaky ReLU.) Stanford link
- Hendrycks, D. & Gimpel, K. (2016). “Gaussian Error Linear Units (GELUs).” arXiv. arXiv 1606.08415
- Ramachandran, P., Zoph, B. & Le, Q. V. (2017). “Searching for Activation Functions.” arXiv. (Automated search for activations, introduces the Swish family, equivalent to SiLU.) arXiv 1710.05941
- Elfwing, S., Uchibe, E. & Doya, K. (2018). “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning.” Neural Networks 107, 3-11. (Original description of SiLU.) arXiv 1702.03118
- Clevert, D.-A., Unterthiner, T. & Hochreiter, S. (2015). “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).” arXiv. arXiv 1511.07289
1. Why is a non-linear activation function essential in a deep network?
2. What is the derivative of the sigmoid σ(x)?
3. Why has ReLU dominated hidden layers of deep networks since 2012?
4. For a binary classification, which activation function is typically used at the output?
5. For a 10-layer deep network using sigmoid everywhere, what can happen during training?