What If We Remember More?
The bigram model could only see one character behind. What happens when we give it two? Three? Five? The answer is both thrilling and devastating.
~15 min read · 8 interactive demos
Beyond a Single Character
Remember the fatal flaw? 'th', 'sh', 'wh' all gave identical predictions because the model saw only one character. What if we gave it two? Three? Five?
Click the buttons above. What happens to the prediction as N grows?
Predicting the next character after:
After just "a", the model knows only one character — too little to narrow down the options.
Notice how confidence jumps from 18% at N=1 to 94% at N=5. More memory transforms a blind guesser into a capable predictor.
An N-gram model looks at the previous N characters before it guesses the next one. Example: N=2 means it can see two characters of context.
More context makes guesses smarter. After "th", the model can strongly expect "e" — it has seen that pattern many times.
But more memory has a hidden cost. We are about to watch that cost grow faster than your intuition expects.
Counting with Context
The core idea is unchanged from bigrams — we still count. But now, instead of asking 'what follows this one character?', we ask 'what follows this sequence of N characters?' The table gets deeper, but the logic stays simple.
For every position in the training text, the model extracts the N-character context and records which character comes next. At prediction time it looks up the matching context row and reads off the stored probability distribution — pure table lookup, no math.
With N=1 (bigram) the table is a flat V×V grid. With N=2 it becomes a stack of grids — one per two-character prefix. Each additional character of context adds another dimension. The table doesn't just grow; it multiplies.
Remember the bigram’s fatal flaw? ‘th’, ‘sh’, and ‘wh’ gave identical predictions because the model only saw the last ‘h’. Now look at what N=2 sees instead:
Each row is now a two-character context. The distributions for ‘th’ and ‘sh’ are different — the model can finally tell them apart. This is the whole point.
That difference is real. The widget below puts bigram and trigram counting side by side on the same training text so you can measure it directly.
The Prediction Gets Better
More context means less ambiguity. When the model can see two characters instead of one, it rules out far more candidates — and the remaining predictions become dramatically more confident.
After 'h', dozens of characters are plausible. After 'th', the model strongly expects 'e'. After 'the', a space becomes almost certain. Each extra character of context narrows the field.
Notice the jump: 18% confidence at N=1, over 80% at N=3. Each extra character of context collapses ambiguity. The model isn’t guessing — it’s remembering.
That confidence gain compounds across a whole sentence. Below, the same seed feeds models with different memory sizes simultaneously — watch what one extra character of memory does to the output:
Look at the N=4 column: phrases that almost read like English. Look at N=1: random noise. Same logic, same training data — only the memory window differs. Three extra characters bought us a language model.
Now it's your turn. Pick a seed phrase, choose how much memory the model gets (N=2, 3, or 4), and watch it write.
You've built a much more powerful predictor. The 4-gram writes phrases that almost look like English. But every improvement has a price — and this one grows faster than you think.
The Price of Memory
Every improvement has a price. And this one grows faster than you think.
With 96 possible characters, every extra character of context multiplies the table by 96. N=1: 96 contexts. N=2: 9,216. N=3: 884,736. N=4: 85 million. N=5: over 8 billion.
The numbers are abstract. Let's make this concrete — what do these tables actually look like?
But the explosion is only half the story. Building a bigger table is hard — but filling it is impossible. As N grows, most of the table stays empty:
Even with all the text ever written, could you fill the table? Use the slider below:
So far we've been counting characters — just 96 possible tokens. Real language models use words instead. That changes everything:
Character-Level Tokens
Small, fixed vocabulary (~96 ASCII characters). Every input is representable. Simple to implement and visualize — ideal for understanding fundamentals. But each token carries almost no semantic meaning.
Vocab: ~96 | Example: ['t', 'h', 'e']
Word-Level Tokens
Semantically rich units that convey meaning per token. But vocabulary explodes to 50,000–500,000 entries, making the transition matrix enormous. Rare words cause sparsity; unseen words cause complete failure.
Vocab: ~50,000 | Example: ['the', 'cat', 'sat']
The combinatorial explosion at the word level makes even simple N-grams computationally infeasible:
Word-level models are also rigidly language-dependent. A model tokenized for English words breaks completely when given Spanish input, requiring an entirely new vocabulary and matrix. Character-level models, while less semantically rich per token, can handle multiple languages sharing the same alphabet.
The table is too big to build, too empty to use, and gets catastrophically worse with words. Three facets of one fundamental problem.
The Deeper Problem
The explosion is a practical problem — you can't build a big-enough table. But there's a conceptual problem that's even worse: even with infinite data, counting still fails.
Imagine the text starts with 'the cat sat on the'. If the model has seen that exact context, it can predict what comes next from memory.
Now change one word: 'the dog sat on the'. A human sees it's almost the same situation. The N-gram model treats it like a completely new, unrelated context.
N-grams have no concept of 'similar.' The contexts 'the cat' and 'the dog' are as different to the model as 'the cat' and 'xyzq'. Each is a separate row in the table, with zero connection between them.
These aren't edge cases — they're everyday situations. First, watch what happens when a user makes a single typo:
Even worse: the model can't recognize that similar words should behave similarly. Try it below:
The table is too big and too empty. But even if we could fill it — even with infinite data — there's a deeper reason counting fails. It's time to step back and see the full picture.
The End of Counting
We've reached the end of what counting can do.
We started with bigrams, which remember one character. We pushed to N-grams, which remember more, and we watched predictions improve.
Then we hit two walls. The explosion wall: the table grows too fast to fill. More memory multiplies the table again and again.
The generalization wall: each context is an island. The model cannot share knowledge between similar contexts, so it fails on new phrases.
Take a step back and notice what you've built in your mind:
✓You know how counting pairs becomes a prediction engine.
✓You know why more context helps — and why it has a cost.
✓You know that the table explodes and most of it stays empty.
✓You know that counting cannot generalize: unseen = unknown.
Every one of these problems points to the same insight: we need models that don't just memorize — they need to learn patterns. What if, instead of storing each context as an isolated row in a table, we could compress contexts into dense vectors where similar meanings live close together? That's exactly what neural networks do.
The era of counting is over. The era of learning begins.
In the next chapter, we replace the table with a neural network. Instead of counting, it learns. Instead of memorizing, it generalizes. The jump is dramatic — and it starts with a single idea: embeddings.