← Back to notes
Concept AI Published Published 31 May 2026

Local learning: learning without a global gradient

What if a network could learn without ever propagating an error backward? Three families of purely local rules against backpropagation.

Backpropagation dominates AI, but it is global: it requires a backward sweep that reuses every weight and locks all updates. The brain, by contrast, learns locally. This piece pits backpropagation against three local alternatives (Forward-Forward, predictive coding, STDP) and explores what we gain when each synapse decides on its own.

  • #apprentissage-local
  • #retropropagation
  • #forward-forward
  • #predictive-coding
  • #stdp
  • #plasticite-hebbienne
  • #efficacite-energetique
  • #apprentissage-biologique

Training a large model costs on the order of a megawatt over weeks. The human brain, by contrast, reasons, learns continually and generalizes on about twenty watts. That gap of five to six orders of magnitude is not a hardware detail: it points to a different computational principle. At the heart of the difference, the artificial neuron is stateless (a weighted sum, effectively a logistic regression, Cox 1958), and the whole network is trained by one global mechanism, backpropagation. Depth creates non-linearity, not richness. The question of this piece fits in one sentence: can we learn without propagating error backward?

The three locks of backpropagation

Backpropagation is remarkably effective, but it cannot be a local rule. Three reasons forbid it by construction.

  • Weight transport: the backward pass reuses the forward weights, transposed. To compute its error, a neuron would need to know the weights of the synapses downstream of it, which makes no biological sense: a synapse does not read the synapses of its neighbors.
  • Update locking: no layer updates until the full backward sweep completes. The whole network advances in lockstep, in a global lock.
  • Two separate phases: activity must first flow forward, then error must flow backward. The two phases never overlap.

Run a learning cycle. In global mode, the error sweeps the network backward. Switch on local mode to see the three locks disappear.

Backpropagation: non-local, it requires a global error signal.

The lesson is sharp: backpropagation is powerful, but non-local by construction. That is precisely what a local rule must give up.

Why aim for local

If giving up backpropagation costs so much, why attempt it? Because locality promises four things the global mechanism cannot offer.

  • Energy proportional to real activity: only the units that fire spend, instead of a systematic global sweep.
  • Continual learning: no frozen phase, the network adapts on the move, without separating training and inference.
  • Biological plausibility: a local rule resembles what living systems actually do.
  • Co-location of memory and computation: the synapse is both where you store and where you compute, which removes the costly back-and-forth between memory and processor.

The cost is clear: today, no local method matches backpropagation at scale. So this is a research bet, not a solved problem.

Forward-Forward: two passes, no backward

Hinton (2022) proposes replacing the forward + backward pair with two forward passes. The first processes positive data (real), the second processes negative data (corrupted). Each layer locally adjusts a goodness (the sum of squared activities) so that it is high on positive data and low on negative data. There is no backward pass, no global error: each layer learns from its own local information only.

Two forward passes, never a backward pass. Each layer locally adjusts its goodness to be high on real data and low on corrupted data.

No backward pass: learning is purely local.
Layer 1Goodness 50%
Threshold 50%
Layer 2Goodness 50%
Threshold 50%
Layer 3Goodness 50%
Threshold 50%
Layer 4Goodness 50%
Threshold 50%

The limit is real: the architecture stays fixed, the synapses stay scalar, and the method remains below backpropagation on benchmarks. But it proves that a deep network can learn without ever sending error back.

Predictive coding: predicting to learn

Another path flips the perspective: each layer predicts the activity of the next one and locally minimizes its prediction error. Millidge, Tschantz and Buckley (2022) showed that predictive coding can approximate backpropagation along arbitrary computation graphs, using only local error units.

Write x^l\hat{x}_l for the prediction coming from layer l+1l+1. The local prediction error is:

εl=xlx^l\varepsilon_l = x_l - \hat{x}_l

and the local update of a weight depends only on quantities available at the synapse:

Δwlεlf(xl+1)\Delta w_l \propto \varepsilon_l \, f(x_{l+1})

It is currently seen as the most promising energy-based alternative to backpropagation.

STDP: causality through timing

STDP (spike-timing-dependent plasticity) is a biologically documented local rule. The weight change depends only on the relative timing of the two spikes:

Δt=tposttpre\Delta t = t_{post} - t_{pre}

If the pre spike fires before the post, the synapse strengthens (potentiation): it took part in the causality of the following spike. If the post spike fires before the pre, it weakens (depression). A third factor (a reward or a neuromodulator) can gate consolidation: this is three-factor learning. Diehl and Cook (2015) combine STDP with winner-take-all inhibition to recognize digits with no backpropagation at all; Halvagal and Zenke (2023) extend this line.

+10-1-500+50Δt (ms)dw

Tune the timing difference between the two spikes. The synapse strengthens if the pre fires before the post, and weakens otherwise. Everything is local: only the relative timing matters.

Pre-synaptic beforebefore Post-synaptic
Δt
20 ms
dw
0.37
Current weight
0.500

Potentiation: the synapse strengthens

STDP connects directly to my own work: it is exactly the kind of local rule sought by the SOAG program, and the criterion that could steer the mutations described in structural plasticity.

A minimal formalization

Let us cleanly contrast the global chain rule with a local rule. Backpropagation computes:

Lwij=δjxi,δj=f(aj)kwjkδk\frac{\partial \mathcal{L}}{\partial w_{ij}} = \delta_j \, x_i, \qquad \delta_j = f'(a_j) \sum_k w_{jk}\,\delta_k

where the sum over kk runs over all units downstream: the update of wijw_{ij} depends on the entire rest of the network. A local rule replaces that global δj\delta_j by a signal available at the synapse itself:

Δwij=ηg(xi,xj,m)\Delta w_{ij} = \eta \, g(x_i, x_j, m)

with mm an optional global modulator (a reward). The whole debate is right there: can a local gg match the global δ\delta?

A comparative table

PropertyBackpropForward-ForwardPredictive codingSTDP
Local learningnoyesyesyes
No backward passnoyespartialyes
Biological plausibilitynopartialpartialyes
Requires weight transportyesnonono
Online learningnopartialpartialyes

Each alternative drops at least one lock of backpropagation, yet none matches it at scale.

Limits and open questions

We should stay honest about the real state of the field.

  • Local methods still underperform backpropagation on large benchmarks.
  • We lack a theory saying which local rule yields which capability.
  • Comparing these methods is hard, because they optimize different objectives.
  • The open question remains: is locality a constraint to overcome, or the key to efficiency?

Test your understanding

Quiz
  1. 1. What does "weight transport" mean, and why is it biologically implausible?

  2. 2. What does Forward-Forward replace the backward pass with?

  3. 3. In STDP, which way does the weight change if the pre fires before the post?

  4. 4. What is the core claim of predictive coding?

Sources