Technically
AI Reference
Your dictionary for AI terms like LLM and RLHF
Company Breakdowns
What technical products actually do and why the companies that make them are valuable
Learning Tracks
In-depth, networked guides to learning specific concepts
Posts Archive
All Technically posts on software concepts since the dawn of time
Terms Universe
The dictionary of software terms you've always wanted

Explore learning tracks

AI, it's not that ComplicatedAnalyzing Software CompaniesBuilding Software ProductsWorking with Data Teams
Loading...
I'm feeling luckyPricing
Log In

The scaling law and the “bitter lesson” of AI

How bigger models, more data, and more compute keep beating clever tricks.

Published Sep 23, 2025ai
Justin Gage
Justin Gage
Read within learning track:AI, it's not that complicated
  • The scaling law says that the reason models keep getting better is that we’re throwing more computing resources at them
  • These three equations show that it’s easy to improve AI performance at first, but much trickier to improve upon super large models trained on huge amounts of data
  • Researchers are at odds on whether the best strategy for building better AI is designing new architectures that leverage human knowledge, or scaling up what’s already been done
  • Ultimately, building more capable AI will require both smart architecture design and massive scale

Terms Mentioned

Training

LLM

Scaling

Companies Mentioned

OpenAI logo

OpenAI

PRIVATE

A couple of weeks ago, I wrote a piece on here about model architectures, which are essentially the blueprints behind AI systems. I covered some of the big architecture types like Transformers and Convolutional Neural Networks, and the fancy tricks researchers use to help AI process enormous amounts of data.

But I left out a pretty major ingredient of today’s biggest and baddest AI models: scale. Frontier models are massive, and trained on legions and legions of high powered servers – so much so that AI labs are raising tens of billions of dollars just to secure more of them.

So today, I’m back to tell you all about the AI scaling hypothesis. What is the scaling hypothesis and what does it tell us about building more intelligent systems? Is there a limit to how much we can achieve by throwing more compute at the problem? And how does this tie into what some researchers call the “bitter lesson” of AI?

What is the scaling hypothesis?

The scaling hypothesis is a law (technically three laws) that predicts AI model performance based on three key factors:

  1. How many parameters the model has – essentially how big the model is. You can train larger or smaller models, it’s up to you.
  2. How much data the model is trained on – remember, a big piece of why LLMs today are so good is they’re pre-trained on the entire internet.
  3. The amount of computing resources available for training – the biggest models are trained on hundreds of thousands of advanced computing chips, or semiconductors.

The relationship between model performance (measured in terms of model loss, or the error rate at test time) and each of these three factors follows a power law.

Quick refresher on power laws, for those who have been out of the classroom for a minute: a power law is a function relating two variables, x and y. How much x changes relative to its previous value determines how much y changes relative to its previous value.

Loading image...

For example, take the power law above. Doubling x always reduces y to a quarter of its previous value. When x is relatively small, this means you get a big bang for your buck: doubling x from 1 to 2 drops y from 20 all the way down to 5. But as x gets bigger and bigger, you need an increasingly larger jump in x to see meaningful change in y.

Ok, now back to AI. A few years ago, a group of OpenAI researchers wanted to figure out what makes some AI models better than others. They trained a bunch of Large Language Models (LLMs), tweaking characteristics like model size and shape, dataset size, and batch size. What they found was pretty remarkable: the relationships between model size, quantity of training data, and computing resources each follows a natural power law. Picture the graph above, but with model size/quantity of training data/compute on the x-axis, and model loss on the y-axis. By increasing the number of model parameters, the amount of text training data, or the number of GPUs used in training, you could actually increase your LLM’s accuracy by a specific, predictable amount.

Why scaling works

Continue reading with an all-access subscription

Continue reading with all-access

In this post

  • Why scaling works
  • How far can scaling take us?

More in this track

How do Large Language Models work?

Breaking down what ChatGPT and others are doing under the hood

What's GPT-3?

GPT-3 is a Machine Learning model that generates text.

$15/month

30-day money-back guarantee

Or use
Up Next
A practical breakdown of the AI power situationPaid Plan

Why Three Mile Island is back in action after a 50 year hiatus.

The vibe coder’s guide to real codingPaid Plan

Things you should know as a vibe coder that will make your app actually work / not die.

2026 vibe coding tool comparisonPaid Plan

Comparing Replit, v0, Lovable and Bolt, in a bakeoff to decide who will be Vandalay Industries go-to vibe coding tool.

Content
  • All Posts
  • Learning Tracks
  • AI Reference
  • Companies
  • Terms Universe
Company
  • Pricing
  • Sponsorships
  • Contact
Connect
SubscribeSubstackYouTubeXLinkedIn
Legal
  • Privacy Policy
  • Terms of Service

© 2026 Technically.