LLMs Shouldn't Work
2026-03-10
The real credit of LLMs belongs to you, your ancestors, and everyone who ever wrote anything down.
I remember when GPT3.5 was release (in the form of ChatGPT). I was shown it by a colleague (literally the week it came out) and it felt like magic. At this time I was dealing heavily with timeseries data. And so the notion of "next word" prediction felt familiar. To me, it was the same as predicting the next value in a timeseries sequence. I had dabbled in some more modern deep learning processes (like diffusion models for images and BERT for text), so I knew about the transformer architecture and 'attention'. This whole thing was not new to me. But the ability for ChatGPT to generalise it's response, and it's breadth of capabilities really floored me. This felt different to the AI I was used to.
It was pretty clear to me that a step change had been introduced. With image generation tooling I can kind of understand the ludicrous results that it produces - after all, image compression algorithms are well known and embedded in the engineering Zeitgeist. It's not too much of a leap to go from this to generating the compression using deep learning and then parsing it back into something understandable (a naive but useful simplification).
But text generation, especially to the level we see today, is a different beast. It is impressive. And so looking at these chat interfaces stream out text as I was in the weeds of timeseries analysis got me thinking - could you apply the same power to timeseries problems?
I reasoned that what LLMs are doing is not much different to a timeseries prediction problem. Text is encoded into IDs at runtime (tokenised) then fed into the model to produce the next ID. Both timeseries models and LLMs are dealing with numbers as the prediction target. A naive view, but it seemed sensible.
A Timeseries View of Text
And so only recently have I tried to exercise this curiosity. Starting by visualising what LLMs are ingesting under the hood, just as we would with timeseries data. In the image below you can see the raw tokens from a segment of text, scaled between 0 and 1. The text in question, is an excerpt from great expectations:

And here you can see the same sequence but with the natural log of the token value taken.

The log plot makes the underlying behaviour more clear. A noisy signal, with clear set of levels, corresponding to more frequently used tokens. If we look at a frequency histogram on the log'd data we can see the multi-modality more clearly:

OK, so in this representation, there seems nothing predictable about the text. I could draw from this same distribution random token IDs to produce data that conforms to this distribution. I could then plot it, and I would not be able to tell whether this plot is a meaningful segment of text.
So the missing link is not just word frequency, but relationship of one word to every other word - nothing new here. This is exactly where LLMs (and in particular, transformers) excel. They capture this relationship perfectly, resulting in predictions that make sense.
The Gap between Language and Timeseries
On the surface, the comparison of text to timeseries data looks sensible. Token IDs are just a sequence of numbers. Timeseries data is just a sequence of numbers. Same problem, same solution.
Wrong, and the reason why took me a while to pin down.
When an LLM produces a prediction, you can validate it almost instantly - not because you've run a statistical test, but because you understand what the words mean. That understanding isn't something you bring to the text from outside. It's already in the text, in the language, and in us, encoded across billions of instances of human beings writing things down, arguing with each other, explaining ideas, telling stories. The language is the structure.
Timeseries data is different. A stock price, a temperature reading, a heart rate... These are symptoms of some underlying process that the numbers themselves don't describe. The meaning lives elsewhere: in market dynamics, in weather systems, in human physiology. The signal is just a projection of it.
This is why analytical methods work for validating timeseries predictions but fall apart when applied to text. And it's why the reverse is also true - the intuition you'd use to evaluate language doesn't transfer either. When you read a bad sentence, you feel it. When a timeseries prediction goes wrong, you need a test to tell you.
That said, LLMs do fool us - but even then, there's usually a gut feeling something is off. We rarely need a test.
So when I watched GPT-3.5 stream out coherent prose, my instinct was to see a sequence model doing sequence things. But what I was actually watching was something far more profound: a model that had, absorbed the structure of language and thereby human thought. The next token is almost beside the point. The power of LLMs simply demonstrate the incredible efficiency of language to encode meaning that is deeply coupled with the human condition.
It is a miracle that such structure has been encoded in language through what feels like an evolutionary process: iterative, emergent, and deeply biological. And so I strongly believe the real credit of modern AI in the form of LLMs, as I said at the beginning of this post, belongs not to AI giants like OpenAI, Anthropic, etc., but to all of our ancestors in constructing such an efficient mechanism for encoding human thought - language.
And that's why LLMs, against all odds, do work.