open sidebar menu
  • Toggle menu
    Skip to Section
    On This Page
    What is pre-training?
    What is the "sentence re-arranging game"?
    How does pre-training use the entire internet?
    Why is pre-training called "commoditized"?
    What does a pre-trained model look like?
    How long does pre-training take?
    What comes after pre-training?
    Frequently Asked Questions About Pre-Training

Learn

AI Reference
Your dictionary for AI terms like LLM and RLHF
Company Breakdowns
What technical products actually do and why the companies that make them are valuable
Knowledge Bases
In-depth, networked guides to learning specific concepts
Posts Archive
All Technically posts on software concepts since the dawn of time
Terms Universe
The dictionary of software terms you've always wanted

Explore knowledge bases

AI, it's not that ComplicatedAnalyzing Software CompaniesBuilding Software ProductsWorking with Data Teams
Loading...

Meet Technically

Technically exists to help you get better at your job by becoming more technically literate.

Learn more →

Solutions for Teams

For GTM Teams
Sell more software to developers by becoming technically fluent.
For Finance Professionals
Helping both buy-side and sell-side firms ask better technical questions.
General Team Inquiries
Volume discounts on Technically knowledge bases.
Loading...
Pricing
Sign In
← Back to AI Reference

Pre-Training

intermediate

When training an LLM, pre-training is the "undergrad degree" phase where the model builds its foundational knowledge and world model.

  • The amazing GenAI models you know today are the product of several phases of training, of which pre-training is the first (and hardest)
  • Pre-training teaches the model foundational knowledge like grammar, facts, and (to some extent) reasoning.
  • The way it works is pretty simple: the model teaches itself by hiding words in sentences and trying to guess them.
  • A pre-trained model is smart but unfocused (it’s a walking encyclopedia, not a helpful assistant).
  • Pre-training LLMs is incredibly resource intensive and expensive: this step can take thousands of GPUs and millions of dollars.

What is pre-training?

Think of a brand new, untrained AI model as a blank brain. It knows absolutely nothing—not what a "cat" is, not how to do math, and certainly not how to write Python code.

Pre-training is the process of locking that empty brain in the world's largest library and forcing it to read every book, website, and Reddit thread in existence. This is actually not that far off from what’s really going on in training code.

The goal isn't to teach it a specific job (like "be a customer support agent"). The goal is to teach it everything. By the end of pre-training, the model has a statistical understanding of language, logic, and facts. It knows that "Paris" is the capital of "France" and that "Run" is a verb.

But a pre-trained model is not a chatbot yet. It’s like a socially awkward genius who has read every book in the world but has never held a conversation.

But don’t just take my word for it. Some platforms will actually let you chat with a pre-trained model that hasn’t undergone any of the subsequent training steps to make it a helpful assistant. You’ll find it to be knowledgeable, but outputting essentially rambling nonsense.

What is pre-training?

Loading image...

Think of a brand new, untrained AI model as a blank brain. It knows absolutely nothing—not what a "cat" is, not how to do math, and certainly not how to write Python code.

Pre-training is the process of locking that empty brain in the world's largest library and forcing it to read every book, website, and Reddit thread in existence. This is actually not that far off from what’s really going on in training code.

The goal isn't to teach it a specific job (like "be a customer support agent"). The goal is to teach it everything. By the end of pre-training, the model has a statistical understanding of language, logic, and facts. It knows that "Paris" is the capital of "France" and that "Run" is a verb.

But a pre-trained model is not a chatbot yet. It’s like a socially awkward genius who has read every book in the world but has never held a conversation.

But don’t just take my word for it. Some platforms will actually let you chat with a pre-trained model that hasn’t undergone any of the subsequent training steps to make it a helpful assistant. You’ll find it to be knowledgeable, but outputting essentially rambling nonsense.

What is the "sentence re-arranging game"?

How do you teach a computer to read? You play the world's biggest game of Mad Libs.

In technical terms, this is often called "Masked Language Modeling" (the good MLM) but let's stick to the game analogy. You take a sentence from the internet, hide a word, and force the computer to guess what it is.

The Loop:

  1. Input: "The quick brown fox _____ over the lazy dog."
  2. Computer Guesses: "Sits?"
  3. Correction: "No. It's 'jumps'."
  4. Update: The computer adjusts its internal math to make "jumps" more likely next time.

This seems simple enough, but was actually a fundamental breakthrough when it first became popular. For decades, Machine Learning models needed to be trained on carefully curated datasets with clear input output pairs.

But MLM doesn't require humans. You don't need to pay people to label data. The internet is the data. And so you can train models on orders of magnitude larger datasets…which is part of why we are where we are.

What is the "sentence re-arranging game"?

Loading image...

How do you teach a computer to read? You play the world's biggest game of Mad Libs.

In technical terms, this is often called "Masked Language Modeling" (the good MLM) but let's stick to the game analogy. You take a sentence from the internet, hide a word, and force the computer to guess what it is.

The Loop:

  1. Input: "The quick brown fox _____ over the lazy dog."
  2. Computer Guesses: "Sits?"
  3. Correction: "No. It's 'jumps'."
  4. Update: The computer adjusts its internal math to make "jumps" more likely next time.

This seems simple enough, but was actually a fundamental breakthrough when it first became popular. For decades, Machine Learning models needed to be trained on carefully curated datasets with clear input output pairs.

But MLM doesn't require humans. You don't need to pay people to label data. The internet is the data. And so you can train models on orders of magnitude larger datasets…which is part of why we are where we are.

How does pre-training use the entire internet?

The secret sauce of modern AI is scale. Researchers realized that if you play the Mad Libs game on a small amount of text, you get a bad model. But if you play it on all the text, magic happens.

Pre-training datasets are colossal. They include:

  • Common Crawl: Basically a copy of the entire web.
  • Books: Millions of digitized volumes.
  • Code: All of GitHub (yes, that’s why it can write code).
  • Forums: Reddit, StackOverflow, etc.

By processing this ocean of text, the model doesn't just learn grammar; it learns world knowledge. It learns that clouds are usually white, that water is wet, and that people on the internet really like arguing about politics. It’s not entirely clear if the model knows these things in the way that we do, but in practice it seems to.

How does pre-training use the entire internet?

The secret sauce of modern AI is scale. Researchers realized that if you play the Mad Libs game on a small amount of text, you get a bad model. But if you play it on all the text, magic happens.

Pre-training datasets are colossal. They include:

  • Common Crawl: Basically a copy of the entire web.
  • Books: Millions of digitized volumes.
  • Code: All of GitHub (yes, that’s why it can write code).
  • Forums: Reddit, StackOverflow, etc.

By processing this ocean of text, the model doesn't just learn grammar; it learns world knowledge. It learns that clouds are usually white, that water is wet, and that people on the internet really like arguing about politics. It’s not entirely clear if the model knows these things in the way that we do, but in practice it seems to.

Why is pre-training called "commoditized"?

In the early days (like, 2019), pre-training was the cutting edge. Now, many would say there’s not much innovation left there.

We call it "commoditized" because almost everyone does it roughly the same way, using roughly the same data. Whether it's Google, OpenAI, or Meta, they are all feeding their models the same internet. Roughly is the operative word, because they all are trying to improve their pre-training dataset by curating it better, deduplicating, and things like that. But many researchers would argue that differentiated models don’t come from differentiated pre-training anymore.

Instead, the alpha today is coming from what you teach the model after it leaves the library – the fine tuning and reinforcement learning steps that come after pre-training. Pre-training is just the generic base layer.

Why is pre-training called "commoditized"?

In the early days (like, 2019), pre-training was the cutting edge. Now, many would say there’s not much innovation left there.

We call it "commoditized" because almost everyone does it roughly the same way, using roughly the same data. Whether it's Google, OpenAI, or Meta, they are all feeding their models the same internet. Roughly is the operative word, because they all are trying to improve their pre-training dataset by curating it better, deduplicating, and things like that. But many researchers would argue that differentiated models don’t come from differentiated pre-training anymore.

Instead, the alpha today is coming from what you teach the model after it leaves the library – the fine tuning and reinforcement learning steps that come after pre-training. Pre-training is just the generic base layer.

What does a pre-trained model look like?

If you talk to a "raw" pre-trained model, it is a weird experience. It doesn't know it's supposed to be an assistant. It just wants to complete sentences.

You: "How do I make a cup of coffee?"

ChatGPT (Final Version): "Here is a step-by-step guide to making coffee..."

Pre-trained Model (Raw Version): "...is a question asked by many people. The history of coffee dates back to Ethiopia. The beans are roasted at 400 degrees. I like coffee. Coffee is good."

See the difference? The raw model is technically generating valid English related to coffee, but it's rambling. It’s just trying to predict the next word, not answer your question.

What does a pre-trained model look like?

Loading image...

If you talk to a "raw" pre-trained model, it is a weird experience. It doesn't know it's supposed to be an assistant. It just wants to complete sentences.

You: "How do I make a cup of coffee?"

ChatGPT (Final Version): "Here is a step-by-step guide to making coffee..."

Pre-trained Model (Raw Version): "...is a question asked by many people. The history of coffee dates back to Ethiopia. The beans are roasted at 400 degrees. I like coffee. Coffee is good."

See the difference? The raw model is technically generating valid English related to coffee, but it's rambling. It’s just trying to predict the next word, not answer your question.

How long does pre-training take?

It totally depends on the size of your model. But typically for LLMs like GPT-5:

  • Time: Weeks or even months.
  • Hardware: Thousands of GPUs (the expensive chips usually used for gaming) running 24/7 in a massive data center.
  • Cost: Millions or more of dollars in electricity and compute time.

How long does pre-training take?

It totally depends on the size of your model. But typically for LLMs like GPT-5:

  • Time: Weeks or even months.
  • Hardware: Thousands of GPUs (the expensive chips usually used for gaming) running 24/7 in a massive data center.
  • Cost: Millions or more of dollars in electricity and compute time.

What comes after pre-training?

If pre-training is the undergraduate degree, the next steps are the specialized job training.

  1. Pre-training: The model learns to speak English and knows facts.
  2. Instructional Fine Tuning: The model learns that when I ask a question, it should give an answer (not just generate more text).
  3. RLHF (Reinforcement Learning from Human Feedback): Humans rate the answers to teach the model to be polite, safe, and helpful.

Without these next steps, you don't have a helpful assistant; you just have a very expensive autocomplete engine.

What comes after pre-training?

If pre-training is the undergraduate degree, the next steps are the specialized job training.

  1. Pre-training: The model learns to speak English and knows facts.
  2. Instructional Fine Tuning: The model learns that when I ask a question, it should give an answer (not just generate more text).
  3. RLHF (Reinforcement Learning from Human Feedback): Humans rate the answers to teach the model to be polite, safe, and helpful.

Without these next steps, you don't have a helpful assistant; you just have a very expensive autocomplete engine.

Frequently Asked Questions About Pre-Training

Can you skip pre-training?

No. If you skip pre-training, your model is literally a random number generator. It needs that foundational understanding of language to function. It’s like trying to teach someone to write poetry before they know the alphabet.

Do all AI models need pre-training?

Pretty much any model dealing with complex stuff like language or images does. If you're just making a simple Excel formula to predict next month's sales, you don't need it. But for "AI" as we know it today? Yes.

Can you pre-train on private data?

You can, but it's rare. Pre-training requires massive amounts of data (trillions of words). Most companies don't have enough internal documents to fill a disk, let alone train a model from scratch. Usually, companies take a model that has already been pre-trained on the internet and then fine-tune it on their private data.

What happens if pre-training data is biased?

The model becomes biased. Period. Since the internet contains racism, sexism, and other -isms, the model absorbs those patterns during pre-training. Companies spend a lot of time in the later stages (Fine-tuning/RLHF) trying to "un-teach" these bad habits, but it's an ongoing battle.

How often do companies redo pre-training?

Almost never. It costs way too much money (we're talking tens of millions for the big models). Companies usually pre-train a base model once (like GPT-5) and then spend the next several months just tweaking and fine-tuning it.

Frequently Asked Questions About Pre-Training

Can you skip pre-training?

No. If you skip pre-training, your model is literally a random number generator. It needs that foundational understanding of language to function. It’s like trying to teach someone to write poetry before they know the alphabet.

Do all AI models need pre-training?

Pretty much any model dealing with complex stuff like language or images does. If you're just making a simple Excel formula to predict next month's sales, you don't need it. But for "AI" as we know it today? Yes.

Can you pre-train on private data?

You can, but it's rare. Pre-training requires massive amounts of data (trillions of words). Most companies don't have enough internal documents to fill a disk, let alone train a model from scratch. Usually, companies take a model that has already been pre-trained on the internet and then fine-tune it on their private data.

What happens if pre-training data is biased?

The model becomes biased. Period. Since the internet contains racism, sexism, and other -isms, the model absorbs those patterns during pre-training. Companies spend a lot of time in the later stages (Fine-tuning/RLHF) trying to "un-teach" these bad habits, but it's an ongoing battle.

How often do companies redo pre-training?

Almost never. It costs way too much money (we're talking tens of millions for the big models). Companies usually pre-train a base model once (like GPT-5) and then spend the next several months just tweaking and fine-tuning it.

Before-and-after illustration showing how pre-training fills a model with knowledge.
Diagram explaining AI learning by predicting missing words in sentences.
Comparison showing a raw pre-trained model versus a helpful assistant response.

Related posts

How do you train an AI model?

A deep dive into how models like ChatGPT get built.

Appliedai
How are companies using AI?

Enough surveys and corporate hand-waving. Let's answer the question by looking at usage data from an AI compute provider.

Appliedai
2026 vibe coding tool comparison

Comparing Replit, v0, Lovable and Bolt, in a bakeoff to decide who will be Vandalay Industries go-to vibe coding tool.

Appliedai

Impress your engineers

70K+ product managers, marketers, bankers, and other -ers read Technically to understand software and work better with developers.

Newsletter
Support
Sponsorships
X + Linkedin
Privacy + ToS

Written with 💔 by Justin in Brooklyn