The scaling law and the “bitter lesson” of AI

How bigger models, more data, and more compute keep beating clever tricks.

Published: September 23, 2025

The scaling law says that the reason models keep getting better is that we’re throwing more computing resources at them
These three equations show that it’s easy to improve AI performance at first, but much trickier to improve upon super large models trained on huge amounts of data
Researchers are at odds on whether the best strategy for building better AI is designing new architectures that leverage human knowledge, or scaling up what’s already been done
Ultimately, building more capable AI will require both smart architecture design and massive scale

Terms Mentioned

Companies Mentioned

OpenAI

A couple of weeks ago, I wrote a piece on here about model architectures, which are essentially the blueprints behind AI systems. I covered some of the big architecture types like Transformers and Convolutional Neural Networks, and the fancy tricks researchers use to help AI process enormous amounts of data.

But I left out a pretty major ingredient of today’s biggest and baddest AI models: scale. Frontier models are massive, and trained on legions and legions of high powered servers – so much so that AI labs are raising tens of billions of dollars just to secure more of them.

So today, I’m back to tell you all about the AI scaling hypothesis. What is the scaling hypothesis and what does it tell us about building more intelligent systems? Is there a limit to how much we can achieve by throwing more compute at the problem? And how does this tie into what some researchers call the “bitter lesson” of AI?

What is the scaling hypothesis?

The scaling hypothesis is a law (technically three laws) that predicts AI model performance based on three key factors:

How many parameters the model has – essentially how big the model is. You can train larger or smaller models, it’s up to you.
How much data the model is trained on – remember, a big piece of why LLMs today are so good is they’re pre-trained on the entire internet.
The amount of computing resources available for training – the biggest models are trained on hundreds of thousands of advanced computing chips, or semiconductors.

The relationship between model performance (measured in terms of model loss, or the error rate at test time) and each of these three factors follows a power law.

Quick refresher on power laws, for those who have been out of the classroom for a minute: a power law is a function relating two variables, x and y. How much x changes relative to its previous value determines how much y changes relative to its previous value.

Loading image...

For example, take the power law above. Doubling x always reduces y to a quarter of its previous value. When x is relatively small, this means you get a big bang for your buck: doubling x from 1 to 2 drops y from 20 all the way down to 5. But as x gets bigger and bigger, you need an increasingly larger jump in x to see meaningful change in y.

Ok, now back to AI. A few years ago, a group of OpenAI researchers wanted to figure out what makes some AI models better than others. They trained a bunch of Large Language Models (LLMs), tweaking characteristics like model size and shape, dataset size, and batch size. What they found was pretty remarkable: the relationships between model size, quantity of training data, and computing resources each follows a natural power law. Picture the graph above, but with model size/quantity of training data/compute on the x-axis, and model loss on the y-axis. By increasing the number of model parameters, the amount of text training data, or the number of GPUs used in training, you could actually increase your LLM’s accuracy by a specific, predictable amount.

Why scaling works

What is the scaling hypothesis?

The scaling hypothesis is a law (technically three laws) that predicts AI model performance based on three key factors:

How many parameters the model has – essentially how big the model is. You can train larger or smaller models, it’s up to you.

How much data the model is trained on – remember, a big piece of why LLMs today are so good is they’re pre-trained on the entire internet.

The amount of computing resources available for training – the biggest models are trained on hundreds of thousands of advanced computing chips, or semiconductors.

The relationship between model performance (measured in terms of model loss, or the error rate at test time) and each of these three factors follows a power law.

Loading image...

The scaling law and the “bitter lesson” of AI

Terms Mentioned

Training

LLM

Scaling

Companies Mentioned

OpenAI

What is the scaling hypothesis?

Why scaling works

Access the full post in a knowledge base

AI, it's not that complicated

Where to next?

The vibe coder’s guide to real coding

How to build AI products that are actually good

Comparing available LLMs for non-technical users

The scaling law and the “bitter lesson” of AI

Terms Mentioned

Training

LLM

Scaling

Companies Mentioned

OpenAI

What is the scaling hypothesis?

Why scaling works

Access the full post in a knowledge base

AI, it's not that complicated

Where to next?

The vibe coder’s guide to real coding

How to build AI products that are actually good

Comparing available LLMs for non-technical users