note

The loss landscape of understanding

Learning is gradient descent over a landscape nobody drew. The minima we find shape the thoughts we can have.


  • learning
  • epistemology
  • mathematics
Apr 18, 2026

Every model of learning — whether in a neural network or a human mind — shares a single structural metaphor: the loss landscape. There is a surface. It has valleys and ridges. You are somewhere on it. You want to go down.

The central metaphor

The loss landscape is not a metaphor for understanding. It is the shape of understanding. Every idea you hold is a point on this surface. Every revision is a step.

This note is an attempt to map that landscape — not the mathematical one, which is well-explored, but the phenomenological one: what it feels like to move through it as a thinking creature.


I. The geometry before the gradient

Before you can descend, you need a surface. The surface is defined by your model of the world — the set of assumptions, priors, and representations that determine what counts as "error."

A newborn has a nearly flat landscape. Everything is equally plausible. There are no deep valleys because there are no strong commitments. As The shape of a thought argues, the shape of a thought is determined by the constraints that have already crystallized around it.

This is cross-entropy loss — the standard objective for language models. But it is also, in a sense, the objective of every learner. You are minimizing the divergence between your model of the world and the world itself. The gap between them is surprise, and surprise is the gradient signal.

The mind is a prediction engine. Its fuel is surprise. Its path is the gradient of surprise. Its destination is a place where nothing is surprising anymore — which is also, unfortunately, a place where nothing is learned.


II. Local minima and the comfort of being wrong

The most discussed feature of loss landscapes is local minima — points where every direction leads uphill, but which are not the lowest point overall.

This is the geometry of being stuck in a worldview. Your beliefs form a self-consistent basin. Every piece of evidence you encounter is interpreted within the basin. The gradient is zero because you've stopped being surprised.

The trap of coherence

Coherence is not truth. A self-consistent belief system can be entirely wrong. The flattest region of your loss landscape may be the most dangerous place to be — not because it's wrong, but because you've stopped moving.

This connects directly to Models are maps: a map can be internally consistent while bearing no relation to the territory. The model that predicts well within its basin may fail catastrophically outside it.

The phenomenon has a name in optimization: sharp minima vs. flat minima. Sharp minima generalize poorly. They are precise but brittle. Flat minima generalize well — they are less precise about any single data point, but more robust across the distribution.

Where S\mathcal{S} is the sharpness measure — the trace of the Hessian at the minimum. The sharper the minimum, the more sensitive the solution is to perturbation. In cognitive terms: the sharper your conviction, the more devastated you are by contradiction.


III. Saddle points: the agony of almost-understanding

More common than local minima are saddle points — regions where the landscape curves up in some directions and down in others. You are not at a minimum. You are not at a maximum. You are at a point of ambiguity.

This is the phenomenology of almost-understanding. You can feel the idea. It's close. But every direction you move seems both promising and threatening. The gradient is nearly zero, but the landscape is not flat — it's conflicted.

I have spent weeks at saddle points. I can feel the idea on the other side of the ridge. But I can't find the path. Every sentence I write is a step in a direction that feels both right and wrong. Writing is a search procedure describes this perfectly — the blank page is a saddle point, and each sentence is a tentative gradient step.

Saddle points are why The stub problem exists. Stubs are ideas that got stuck at saddle points. They haven't descended into a basin. They haven't climbed out. They sit there, waiting for a perturbation — a new connection, a new perspective, a random kick of stochastic gradient noise — to push them off the ridge.

FIG 1.2: LATENT MANIFOLD PROJECTION

IV. Momentum and the courage to overshoot

In optimization, momentum is a technique that accumulates gradient information over time, allowing the optimizer to carry velocity through flat regions and over shallow hills.

vt=βvt1+ηθL(θt)v_t = \beta v_{t-1} + \eta \nabla_\theta \mathcal{L}(\theta_t) θt+1=θtvt\theta_{t+1} = \theta_t - v_t

In cognitive terms, momentum is conviction. It's the accumulated weight of evidence and experience that lets you push through uncertainty. Without momentum, you stop at every shallow dip. With too much momentum, you overshoot the real minimum and spiral into absurdity.

The role of conviction

Conviction is not the same as correctness. Conviction is momentum. It lets you traverse flat regions of the loss landscape — periods where no new evidence arrives, but you keep moving in the direction your prior evidence pointed. This is why Notes on attention matters: sustained attention is the accumulation of cognitive momentum.

The balance between exploration and exploitation is, at its core, a question of momentum scheduling. Do you anneal your learning rate — becoming more cautious as you converge? Or do you keep it high, risking instability but preserving the ability to escape local minima?

I think most people anneal too early. They find a comfortable basin and turn down their learning rate. They stop being surprised. They stop moving. On clarity warns against this: clarity without continued exploration is just a local minimum that looks like a global one.


V. Batch size and the depth of experience

In stochastic gradient descent, the batch size determines how many samples you evaluate before taking a step. Small batches give noisy but frequent updates. Large batches give precise but infrequent updates.

This maps onto a deep distinction in how people learn:

  1. Small-batch learners — they update their beliefs after every experience. Noisy, reactive, sometimes inconsistent, but highly adaptive. They feel every gradient step.

  2. Large-batch learners — they accumulate experience before revising. More stable, more conservative, but slower to adapt. They need many samples before the gradient becomes clear enough to act on.

Neither is superior

The research is clear: small batches generalize better but converge slower. Large batches converge faster but find sharper minima. The optimal strategy is usually somewhere in between — and it changes over the course of training.

I am a small-batch learner. I revise after every conversation. I change my mind after every book. This is not a virtue — it's a hyperparameter. Sometimes it serves me. Sometimes it makes me unstable.

The cloud and the basin captures the other extreme: the person who refuses to crystallize, who stays in the cloud, who never takes a gradient step at all. That's batch size infinity — you never update because you never finish processing the batch.


VI. The curvature of concepts

Not all dimensions of the loss landscape are equal. Some directions are steep — small changes in belief produce large changes in loss. Others are flat — you can move far without much consequence.

The Hessian matrix captures this structure:

The eigenvalues of HH tell you the curvature in each direction. Large eigenvalues correspond to steep, narrow valleys — beliefs that must be precise. Small eigenvalues correspond to wide, flat plains — beliefs where imprecision is tolerable.

This is why some ideas feel sharp and others feel soft. Sharp ideas have high curvature: they demand precision, they resist approximation, they break if you perturb them. Soft ideas have low curvature: they tolerate vagueness, they survive approximation, they bend without breaking.

The Geometry of Intelligence explores this in the context of neural representations. The same principle applies to human concepts. The concept "justice" has low curvature — it can stretch across many contexts without breaking. The concept "2+2=4" has high curvature — it is precise and non-negotiable.

The most dangerous ideas are the ones with medium curvature. They feel precise enough to act on, but they're actually soft enough to rationalize. Political ideologies live in this zone. So do most philosophical -isms.


VII. Regularization and the discipline of simplicity

Regularization is the practice of adding a penalty for complexity to the loss function. In optimization:

The λ\lambda term penalizes large weights. In cognitive terms, it penalizes overfitting — the tendency to memorize specifics at the cost of generality.

This is the mathematical formalization of Occam's Razor. And it raises a question that Models are maps only partially answers: how much should you simplify?

Too much regularization and your model is a straight line through a curved world. Too little and your model is a spline that touches every point but understands none of them.

The risk of under-regularization

An unregularized mind is a mind that has memorized its experiences without abstracting from them. It can recite but not reason. It can recognize but not generalize. This is the cognitive equivalent of overfitting — and it is, I think, the default state of most human minds.

The practice of Writing is a search procedure is, in part, a regularization technique. Writing forces you to compress. You cannot fit every nuance into a sentence. The act of compression is the act of finding the structure that matters and discarding the detail that doesn't.

This is why Why I write is not just about expression. It's about finding the regularized version of my own mind — the version that generalizes, not just the version that remembers.


VIII. Initialization and the accident of priors

Before gradient descent begins, you need an initial point. In neural networks, this is typically random. In humans, it is anything but.

Your initial point on the loss landscape is determined by your genes, your culture, your language, your early experiences. It is not random — it is contingent. And it profoundly shapes which minima you will find.

Two networks initialized differently may converge to completely different minima. Two people raised differently may arrive at completely different worldviews. Neither is "more correct" in any absolute sense — they have simply traversed different regions of the same landscape.

Why diversity matters

This is the deep reason why cognitive diversity is valuable. It's not about fairness or representation (though those matter too). It's about search. Different initializations explore different regions of the loss landscape. The global minimum is more likely to be found by many searchers starting from many points than by many searchers starting from the same point.

Latent spaces as maps makes a related point: the directions we name in latent space depend on where we started. The "cities" on the map are not objective features of the territory — they are artifacts of the path we took through it.


IX. Escaping minima: noise, heat, and crisis

How do you escape a local minimum? In optimization, there are several strategies:

  1. Simulated annealing — add noise that decreases over time. Early in training, you jump around freely. Later, you settle down. This is what youth feels like: the freedom to be wrong, followed by the gradual crystallization of conviction.

  2. Random restarts — if you're stuck, start over from a different point. This is what a paradigm shift feels like. Not a gradual improvement, but a complete reset of assumptions.

  3. Increasing the learning rate — temporarily take bigger steps. This is what crisis feels like. The events that shake your worldview are not gradient steps — they are learning rate spikes. They make you revise more aggressively, sometimes overshooting, sometimes finding new basins.

The most important moments in intellectual life are not the moments of understanding. They are the moments of ununderstanding — the moments when the gradient reverses and you realize you've been descending into the wrong valley.

The cloud and the basin describes the return to the cloud as revision. But revision is not just melting the rock. It's heating the system — adding thermal noise so that you can escape the basin and explore the landscape again.


X. The global minimum is a myth

Here is the uncomfortable truth: there may be no global minimum. Or rather, the global minimum may be trivial — a point where the loss is zero because the model has memorized the training data, not because it has understood the world.

Understanding is not minimization. It is generalization — the ability to perform well on data you haven't seen. And generalization is not a property of the minimum itself, but of the path you took to get there.

This is why Attention is all you need (in life too) is not just about architecture. It's about which path through the loss landscape the attention mechanism enables. The transformer doesn't find a better minimum than an RNN — it finds a minimum that generalizes better, because the path to it encodes structural assumptions about the nature of language.

And this is why Writing is a search procedure matters so much. The path matters more than the destination. The process of writing — the constraint propagation, the beam search, the revision — is not a means to an end. It is the understanding. The final essay is just the residue of the path.


XI. Ensembles and the wisdom of multiple minds

An ensemble is a collection of models whose predictions are averaged. In practice, ensembles almost always outperform any single model — even when each individual model is weaker.

This is the mathematical justification for intellectual community. Not because others are smarter than you, but because they are different from you. They have traversed different regions of the loss landscape. Their errors are uncorrelated with yours. Averaging your beliefs produces a result that is more robust than any individual belief.

The practical implication

Talk to people who disagree with you. Not to change their minds. Not to defend yours. But to average your gradients. The ensemble is always better than the individual. Notes on attention reminds us that attention to other minds is not distraction — it is regularization.


XII. The landscape is alive

The final complication: the loss landscape is not static. Every time you learn something, the landscape shifts. The valleys move. The ridges reform. What was a minimum becomes a saddle point. What was a plateau becomes a cliff.

This is non-stationary optimization — the objective function changes while you're optimizing it. It is the fundamental condition of human life. The world is not a fixed dataset. It responds to your predictions. It evolves with your understanding.

This is why learning never ends. Not because you haven't found the minimum, but because the minimum keeps moving. The best you can do is keep descending — keep following the gradient, keep reducing the surprise, keep adapting to the shifting surface of what it means to understand.

The loss landscape of understanding is not a mountain to be climbed or a valley to be found. It is an ocean to be sailed. The water moves beneath you. The wind changes. You adjust. You never arrive. But you get better at sailing.