1. Foreword

Current Large Language Models (LLMs) suffer from a severe form of anterograde amnesia.

The moment pre-training concludes, the model’s synaptic connections freeze. While In-context Learning grants a fleeting form of working memory, this is merely a transient fitting to the prompt. The model remains incapable of transmuting new information into long-term weights without re-initiating the computationally exorbitant pre-training cycle.


Figure 1: Architectural Schematic. A comparative visualisation of a standard Transformer block versus the proposed HOPE module, highlighting the integration of nested learning mechanisms [1].

Google Research's recent paper, Nested Learning (NL) [1], attempts to break this impasse. However, I argue that the paper's primary contribution is not a State-of-the-Art (SOTA) result, but a radical ontological reconstruction:

“Depth” is an illusion. Neural networks are fundamentally a set of Nested Optimisation Loops, distinguished only by their update frequency.

Without this insight, one sees merely another complex Transformer variant; with it, one perceives a unified mathematical universe.


Figure 2: The Nested Optimisation Spectrum. A topological reinterpretation of neural architecture. Instead of vertical depth, layers are viewed as nested loops distinguished by update frequency, ranging from instantaneous In-Context Attention to frozen Pre-trained Weights.

2. Everything is Associative Memory

In the traditional view, the Model and the Optimiser are distinct species.

  • Model: $y = f(x; \theta)$ — responsible for inference.
  • Optimiser: $\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}$ — responsible for learning.

Nested Learning presents an elegant mathematical proof: Backpropagation (BP) is itself a self-referential process of associative memory.

Stripping away the notation, the Attention mechanism and Gradient Descent (GD) are mathematically isomorphic. They both perform the same function: information compression.

  1. Attention ($f=\infty$): Compresses the Context into hidden states via Query-Key matching at the moment of inference.
  2. SGD ($f \approx 0$): Compresses the Dataset into Weights via Gradient signals over broad epochs.

If one accepts this premise, current architectures leave a massive vacuum between frequencies $0$ and $\infty$. Why, then, do we lack layers operating at $f=10$ or $f=100$?

3. Reconstructing HOPE


Figure 3: Temporal Unrolling of the Continuum Memory System (CMS). This diagram illustrates the asynchronous update mechanism. Unlike standard synchronous transformers, CMS layers (depicted as horizontal tracks) update conditionally based on their assigned frequency. Solid blocks denote active plasticity, whilst ghosted blocks represent frozen inference states.

Based on this theory, the authors propose the HOPE architecture. Its core component, the Continuum Memory System (CMS) is essentially a differentiable time-divider.

This contradicts the “Synchronous Update” paradigm of the Transformer. CMS allows distinct modules to “breathe” at different rates.

To replicate this in code, one must dissolve the boundary between Training and Inference [3]. Consider the following logic:

The Logic of CMS (Continuum Memory System)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import torch
import torch.nn as nn

class CMSBlock(nn.Module):
def __init__(self, input_dim, frequencies=[1, 16, 512]):
"""
Args:
input_dim: Dimension of the input features.
frequencies: List of update periods.
1 = Fast adaptation (Contextual/Short-term),
512 = Slow consolidation (Long-term memory).
"""
super().__init__()
# Creating independent MLPs for each frequency level.
# These act as parallel processors operating on distinct time scales.
self.levels = nn.ModuleList([
nn.Sequential(
nn.Linear(input_dim, input_dim * 4),
nn.SiLU(),
nn.Linear(input_dim * 4, input_dim)
) for _ in frequencies
])
self.freqs = frequencies

# Each level maintains its own 'optimiser state' (simplified here as momentum).
# In a production implementation, this distributes the Adam state
# directly into the forward pass logic.
self.states = [torch.zeros(input_dim, input_dim) for _ in frequencies]

def forward_and_learn(self, x, global_step):
output = x

# 1. Forward Pass: Signal propagates through all frequency levels.
# This follows standard residual connection behaviour.
for mlp in self.levels:
output = mlp(output) + output

# 2. Nested Learning: Conditional update logic.
# Crucial: This blurs the line between training and inference.
# The 'backward' pass is fragmented and executed locally.
if self.training:
for i, (mlp, freq) in enumerate(zip(self.levels, self.freqs)):
# Only trigger plasticity if the current step aligns with the frequency.
# This creates the 'nested' temporal structure.
if global_step % freq == 0:
self.local_update(mlp, x, self.states[i])

return output

def local_update(self, mlp, x, state):
"""
The soul of NL: Asynchronous, local self-modification.
This relies on local reconstruction error rather than global backpropagation.
"""
# Self-supervised prediction (reconstruction task).
pred = mlp(x.detach())
loss = (pred - x.detach()).norm(p=2) # L2 Regression Loss acts as the objective.

# Manual gradient calculation locally.
grads = torch.autograd.grad(loss, mlp.parameters())

# Execute update (Simulating Momentum SGD).
# Note: In a full implementation, this implies the 'M3' optimiser logic.
with torch.no_grad():
for param, grad in zip(mlp.parameters(), grads):
# Update momentum state (memory of gradients).
state.mul_(0.9).add_(grad)
# Update actual weights.
param.sub_(0.01 * state)

The Logic of Self-Modifying Titans [2]

This section often appears esoteric to readers. In Control Theory, however, it is simply adaptive gain.

The model is no longer a passive recipient of a manually tuned Learning Rate; it generates its own parameters based on the “surprise” of the current data. This is effectively Meta-learning at token-level granularity.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class SelfModifyingTitan(nn.Module):
def forward(self, x, memory_state):
# 1. Parameter Generation
# All control parameters are predicted directly from the current input x.
# The model is effectively 'programming' itself on the fly.
q = self.W_q(x)
k = self.adapter_k(x)
v = self.adapter_v(x)

# The model determines the learning rate (eta) and decay (alpha) dynamically.
eta = torch.sigmoid(self.adapter_eta(x))
alpha = torch.sigmoid(self.adapter_alpha(x))

# 2. Memory Retrieval
y = torch.matmul(memory_state, q)

# 3. Memory Update (The Differentiable Optimiser Step)
# Utilising a variant of the Delta Rule / Hebbian Learning.
# Formula: M_new = alpha * M_old - eta * grad_L(M)

predicted_v = torch.matmul(memory_state, k)
error = predicted_v - v # The error signal (Surprise).

# Outer product for weight update.
# This is where 'learning' occurs during the forward pass.
update_term = torch.einsum('b d, b k -> b d k', error, k)
new_memory_state = alpha * memory_state - eta * update_term

return y, new_memory_state

4. Critical Thoughts

Drawing on OpenReview [4] discussions and experience in High-Performance Computing (HPC), we must examine this technology critically.

  1. Consciousness vs. Smart Cache
    Reviewers astutely noted that this mechanism resembles a differentiable smart cache. Fast Layers correspond to L1 Cache (CPU), while Slow Layers correspond to Disk. HOPE simply transforms the Cache into a trainable neural component. While this mitigates Catastrophic Forgetting, it does not fundamentally solve logical reasoning. It simply excels at “rote memorisation”.

  2. The Hidden Cost of Sparse Updates
    The paper claims CMS is computationally efficient. However, anyone familiar with GPU architectures (e.g., NVIDIA H100) will recognise that logic such as if step % freq == 0 is an HPC nightmare.

Warp Divergence: Non-uniform Control Flow causes CUDA Core utilisation to plummet.
Pipeline Bubbles: In distributed training, asynchronous updates render gradient synchronisation (AllReduce) excessively complex.

In deployment, the wall-clock time gains may be significantly lower than the theoretical reduction in FLOPs.


Figure 4: Hardware Efficiency Analysis (Trace View). A comparative profile of GPU pipeline utilisation. Panel (A) shows the dense compute of Standard Transformers, whilst Panel (B) demonstrates how Nested Learning induces 'Pipeline Bubbles' and warp divergence due to conditional sparse updates, highlighting the trade-off between theoretical FLOPs reduction and wall-clock latency.

  1. Financial Implications: Handling Regime Change
    From a Quantitative Finance perspective, Catastrophic Forgetting is simply a regime switch. Traditional LLMs assume a stationary distribution for training data. NL allows the model to adjust local parameters dynamically during inference—essentially online risk adaptation. For high-frequency trading or real-time risk modelling, this capability (“learning while running”) may prove far more disruptive than its applications in NLP.

5. Final Thoughts

Nested Learning acts as a mirror. It reveals that Parameters are not sacred, static entities; they are simply memory variables with an extremely low frequency.

When one writes param.sub_(0.01 * state), one is not merely writing code; one is designing a digital entity with multiple temporal perceptions.

The architectural battles of the future will not be fought over Depth, but over frequency bandwidth.

Copyright Notice
This article, except for the referenced content below, is the original work of Junhao. The author retains the exclusive rights to its final interpretation. If there are any issues regarding copyright infringement, please contact me for removal. Reproduction or distribution of this content without my explicit permission is prohibited.

6. References

[1]. Behrouz, A., Razaviyayn, M., Zhong, P., & Mirrokni, V. (2025). Nested Learning: The Illusion of Deep Learning Architecture. Google Research. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS).

[2]. Behrouz, A., Pezeshki, M., et al. (2025). Titans: Learning to Memorize at Test Time. Google Research. arXiv preprint arXiv:2501.00663 (2024).

[3]. Sun, Y., et al. (2020). Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. International Conference on Machine Learning (ICML).

[4]. OpenReview Forum. (2025). Nested Learning: The Illusion of Deep Learning Architecture. Available at: https://openreview.net/forum?id=nbMeRvNb7A