A personal notebook

LM·LAB

Learn how ChatGPT works — from the very beginning.

Not a course. Not a tutorial. A walk through the ideas.

05 chapters80 years~45 min

PROLOGUE

In 1948, Claude Shannon asked a deceptively simple question: can we predict the next letter in a sentence, given only the ones that came before?

The answer took the next eighty years to unfold. It took counting, then learning, then attention — and finally, scale. Each era solved what the last one couldn't, and each one left a fingerprint you can still find inside the models you use every day.

This is a quiet walk through those four ideas. Not a tutorial, not a pitch. Just the notebook of someone who wanted to understand, written in case you do too.

THE JOURNEY

ERA I1948 — 1990s

Just count.

Counting letters sounds too simple to work. It isn't.

Bigrams and N-grams can predict text, generate language, and reveal the hidden structure of any corpus — all without a single neural weight. Shannon's idea, sharpened over decades of statistical NLP, is still the baseline every modern system is quietly measured against.

01Bigram ModelRead →

Predict the next letter using only the one before it. Simple — and surprisingly powerful.

02N-Gram ModelRead →

Give the model more memory. Predictions improve — until they hit a wall.

BIGRAM · FREQUENCY

hover any cell — which letter follows?

hover any cell— which letter follows?

ERA II1986 — 2017

Then it learns.

Counting has a ceiling. What if, instead of memorising every pattern, the machine could figure them out on its own — from raw data?

Layers of simple operations, stacked on each other, begin to discover structure no human wrote down. It took thirty years and the patience of a few researchers for the idea to become practical. It changed what a model could be.

03Neural NetworksRead →

Layers of simple operations that, together, discover patterns no human programmed.

04MLP Language ModelRead →

Replace the counting table with a network. Predictions stop hitting the wall.

MLP · FORWARD PASS

input→ prediction

ERA III2017

The model learns to look.

"Attention Is All You Need" changed everything.

Instead of reading word by word, the model learns which parts of the input matter for each prediction. Transformers were born — and with them, the GPT era. A single move that quietly replaced almost everything that came before.

05TransformerRead →

Self-attention: every word looks at every other word — and decides how much each one matters.

06GPT GridSoon

Take the Transformer. Make it enormous. Train it on everything. Coming soon.

ATTENTION · FLOW

hover any token — see where it looks

the

cat

sat

the

mat

the

cat

sat

the

mat

sourceweighted targets →

Enter the laboratory