Chapter 1 · The Counting Era

The Bigram Model

In 1948, Claude Shannon made a bet: you could predict the next letter in a sentence just by counting which letters tend to follow which. No grammar. No understanding. Just counting. In this chapter, you'll test that bet yourself.

~10 min read · 6 interactive demos

1The Challenge

The Challenge

Let's start with a game. You see a sentence with one letter missing — can you guess what comes next?

Before we dive in — try this. Type any letter below and see what happens.

You just saw a prediction. The machine looked at one letter and guessed what comes next. But how? Let's find out.

You just did something incredible: you predicted the next letter without thinking about it. Your brain used the letters before it — the context — to make an educated guess.

But here's the question that started it all:

How could we teach a machine to do the same thing?

A computer can't "understand" language. It can't read. It doesn't know what words, grammar, or meaning are. It only knows numbers. So we need a strategy so simple that even a calculator could do it. Let's invent one together.

2The Simplest Idea

What If We Just Counted?

What if there's a hidden pattern in every piece of text ever written? Let's see if you can spot it.

Did you notice? Some pairs come up again and again — 'th', 'he', 'in', 'er'. These aren't random. Every language has favorite letter combinations. What if we counted all of them?

What you just discovered has a name. Linguists call a pair of two consecutive characters a what letter usually comes after this one?. The idea is embarrassingly simple: count pairs, then guess based on the counts.

It's almost embarrassingly simple. But it works better than you'd expect.

3The Full Picture

The Transition Table

You have hundreds of counted pairs. But where do you store them all?

Think about it: every character in the vocabulary could be followed by any other character. That means for each starting character, you need a slot for every possible next character. How many slots is that total?

Every pair has exactly two parts — the current letter and the next letter. What if we organized them into a grid?

Rows = current letter. Columns = next letter. Each cell = how many times we saw that pair.

Let's start small — just 5 characters — and see what this table looks like:

Now it's your turn. Type any text below and watch how each character pair adds exactly +1 to its cell. By the end, you'll have built a complete transition table from scratch.

Now zoom out. The 5×5 table above covers just a handful of characters. The real table below covers all 96 printable ASCII characters — trained on thousands of sentences. Brighter cells mean the model saw that pair more often.

You built the table. It holds everything the model knows about which letters follow which. But raw counts aren't predictions — how do we turn them into actual probabilities?

4From Counts to Chances

Turning Counts Into Chances

We have counts — but how do we turn "h→e appeared 3,481 times" into "there's a 32% chance 'e' comes after 'h'"?

Simple: we divide each count by the row total. If 'h' was followed by any character 10,800 times total, and 'h→e' appeared 3,481 times, then the chance is 3,481 ÷ 10,800 ≈ 32%. Now every row adds up to 100%.

Let's put it all together. Pick a character below and walk through the full prediction pipeline: look up the row, see the counts, normalize to probabilities, and roll the weighted dice.

The model can now make concrete predictions: "After 'h', there's a 32% chance the next letter is 'e', 15% chance it's 'a', and so on."

Key Takeaway

Turning raw counts into percentages (0% to 100%) is what lets the model make actual predictions. Each row adds up to 100% — a valid probability distribution.

5Let It Write

Let the Model Write

Our table is ready. Now let's do something fun: let the model write text on its own.

The process is simple — we call it writing one letter at a time: pick a starting letter, look up its row in the table, roll a weighted dice to pick the next letter, then use that letter as the new starting point. Repeat.

Now let's give the model a starting letter and let it write. The playground below has a temperature slider — try low for safe and predictable, high for chaos and surprise.

Generate some text and notice something: a model with only one letter of memory produces gibberish that somehow feels letter-like. The pairs are right, but the words are wrong. Why?

6The Fatal Flaw

The One-Letter Amnesia

You built a working text predictor from scratch. It counts pairs, normalizes them into probabilities, rolls a weighted dice, and writes text. That's real. But there's a devastating weakness hiding in plain sight.

Take a moment to appreciate what you've done: starting from nothing but raw text, you built a system that learns letter patterns, makes predictions, and generates new text. Every language model — including GPT — started from this same intuition. But now watch what happens when we push it.

Ask the model what comes after 'th'. It doesn't know about 't' — it only sees 'h'. So it gives the exact same prediction as 'sh' or 'wh'. The context before 'h' is invisible. Gone forever. Try it yourself:

The model is not just forgetful — it's structurally blind. No matter how much training data we feed it, the bigram will never distinguish 'th' from 'sh' from 'wh'. This isn't a bug we can fix with more data. It's a ceiling built into the architecture.

What if we let the model remember more than one letter? That changes everything.

Key Takeaway

A bigram only sees one letter of context. That's its fundamental limitation — and exactly why we need n-grams and neural networks.

What Comes Next?

Next: What If We Remember More?

The bigram forgets everything except the last letter. What if we let it see two? Three? Five? Welcome to the N-gram model.