↑ BACK TO TOP
open sidebar menu
  • AI, it's not that complicated/The Generative AI wave
    Knowledge Bases
    Analyzing Software CompaniesBuilding Software ProductsAI, it's not that complicatedWorking With Data Teams
    Sections
    1: The Basics
    2: The Generative AI wave
    It was never about LLM performanceWhat is RAG?What's a vector database?How do AI models think and reason?How to build apps with AIWhat is MCP?What is Generative AI?The beginner’s guide to AI model architecturesA deep dive into MCP and its associated serversThe scaling law and the “bitter lesson” of AI
    3: Tools and Products
Sign In

The scaling law and the “bitter lesson” of AI

How bigger models, more data, and more compute keep beating clever tricks.

ai

Published: September 23, 2025

  • The scaling law says that the reason models keep getting better is that we’re throwing more computing resources at them
  • These three equations show that it’s easy to improve AI performance at first, but much trickier to improve upon super large models trained on huge amounts of data
  • Researchers are at odds on whether the best strategy for building better AI is designing new architectures that leverage human knowledge, or scaling up what’s already been done
  • Ultimately, building more capable AI will require both smart architecture design and massive scale

Terms Mentioned

Training

LLM

Scaling

Companies Mentioned

OpenAI logo

OpenAI

$PRIVATE

A couple of weeks ago, I wrote a piece on here about model architectures, which are essentially the blueprints behind AI systems. I covered some of the big architecture types like Transformers and Convolutional Neural Networks, and the fancy tricks researchers use to help AI process enormous amounts of data.

But I left out a pretty major ingredient of today’s biggest and baddest AI models: scale. Frontier models are massive, and trained on legions and legions of high powered servers – so much so that AI labs are raising tens of billions of dollars just to secure more of them.

So today, I’m back to tell you all about the AI scaling hypothesis. What is the scaling hypothesis and what does it tell us about building more intelligent systems? Is there a limit to how much we can achieve by throwing more compute at the problem? And how does this tie into what some researchers call the “bitter lesson” of AI?

What is the scaling hypothesis?

The scaling hypothesis is a law (technically three laws) that predicts AI model performance based on three key factors:

  1. How many parameters the model has – essentially how big the model is. You can train larger or smaller models, it’s up to you.
  2. How much data the model is trained on – remember, a big piece of why LLMs today are so good is they’re pre-trained on the entire internet.
  3. The amount of computing resources available for training – the biggest models are trained on hundreds of thousands of advanced computing chips, or semiconductors.

The relationship between model performance (measured in terms of model loss, or the error rate at test time) and each of these three factors follows a power law.

Quick refresher on power laws, for those who have been out of the classroom for a minute: a power law is a function relating two variables, x and y. How much x changes relative to its previous value determines how much y changes relative to its previous value.

Loading image...

For example, take the power law above. Doubling x always reduces y to a quarter of its previous value. When x is relatively small, this means you get a big bang for your buck: doubling x from 1 to 2 drops y from 20 all the way down to 5. But as x gets bigger and bigger, you need an increasingly larger jump in x to see meaningful change in y.

Ok, now back to AI. A few years ago, a group of OpenAI researchers wanted to figure out what makes some AI models better than others. They trained a bunch of Large Language Models (LLMs), tweaking characteristics like model size and shape, dataset size, and batch size. What they found was pretty remarkable: the relationships between model size, quantity of training data, and computing resources each follows a natural power law. Picture the graph above, but with model size/quantity of training data/compute on the x-axis, and model loss on the y-axis. By increasing the number of model parameters, the amount of text training data, or the number of GPUs used in training, you could actually increase your LLM’s accuracy by a specific, predictable amount.

Why scaling works

Access the full post in a knowledge base

Knowledge bases give you everything you need – access to the right posts and a learning plan – to get up to speed on whatever your goal is.

Knowledge Base

AI, it's not that complicated

How to understand and work effectively with AI and ML models and products.

$0.00

What's a knowledge base? ↗

Where to next?

Keep learning how to understand and work effectively with AI and ML models and products.

Comparing available LLMs for non-technical users

How do ChatGPT, Mistral, Gemini, and Llama3 stack up for common tasks like generating sales emails?

Tools and Products
What does OpenAI do?

OpenAI is the most popular provider of generative AI models like GPT-4.

Tools and Products
Databricks is apparently worth $100B. What do they even do?

What we should really be asking is “What does Databricks not do?”

Tools and Products
Support
Sponsorships
Twitter
Linkedin
Privacy + ToS