How do Large Language Models work?
Breaking down what ChatGPT and others are doing under the hood
Last updated: March 3, 2025
Large Language Models (LLMs) like ChatGPT, the new “Sydney” mode in Bing (which still exists apparently), and Google’s Bard have completely taken over the news cycles. I’ll leave the speculation on whose jobs these are going to steal for other publications; this post is going to dive into how these models actually work, from where they get their data to the math (well, the basics you need to know) that allows them to generate such weirdly “real” text.
Machine learning 101, a crash course
LLMs are a type of Machine Learning model like any other. So to understand how they work, we need to first understand how ML works in general. Disclaimer: there are some incredible visual resources on the web that explain how Machine Learning works in more depth, and probably better than me – I’d highly recommend checking them out! This section will give you the basics in Technically style.
The simplest way to understand basic ML models is through prediction: given what I already know, what’s going to happen in a new, unknown situation? This is pretty much how your brain works. Imagine you’ve got a friend who is constantly late. You’ve got a party coming up, so your expectation is that he’s going to, shocker, be late again. You don’t know that for sure, but given that he has always been late, you figure there’s a good chance he will be this time. And if he shows up on time, you’re surprised, and you keep that new information in the back of your head; maybe next time you’ll adjust your expectations on the chance of him being late.
Your brain has millions of these models working all the time, but their actual internal mechanics are beyond our scientific understanding for now. So in the real world, we need to settle for algorithms – some crude, and some highly complex – that learn from data and extrapolate what’s going to happen in unknown situations. Models are usually trained to work for specific domains (predicting stock prices, or generating an image) but increasingly they’re becoming more general purpose.
Logistically, a Machine Learning model is sort of like an API: it takes in some inputs, and you teach it to give you some outputs. Here’s how it works:
-
Gather training data – you gather a bunch of data about whatever you’re trying to model
-
Analyze training data – you analyze that data to find patterns and nuances
-
Pick a model – you pick an algorithm (or a few) to learn that data and how it works
-
Training – you run the algorithm, it learns, and stores what it has learned
-
Inference – you show new data to the model, and it spits out what it thinks
You design the model’s interface – what kind of data it takes, and what kind of data it returns – to match whatever your task is.
So what is the algorithm actually doing? Basically, it’s a really good analyst. It’s finding the relationships between the data that you give it, which are often too subtle and complex for you to figure out manually. The data usually has some sort of X – characteristics, settings, details – and a Y – what ended up happening. If you’re looking at this data:
You don’t need ML to tell you that when X is 15, Y will be somewhere around 150,000. But what happens when there are 30 different X parameters? Or when the data is shaped all strange? Or when you’re dealing with text? ML is fundamentally about modeling complex domains with complex data, where human manual ability falls sadly short. That’s really all it is.
That’s why ML algorithms can be as simple as linear regression – which you may have learned about in Statistics 101 – or as complex as a neural network with millions of nodes. The kinds of models that have made headlines recently are mind bogglingly complex, and took the work of hundreds of people (not to mention decades of collective research). But you’ll often find Data Scientists at companies using pretty simple algorithms, and getting good results.
🖇 Workplace Example
Creating powerful ML models from scratch as an incredibly specialized discipline. While many Data Scientists and ML Engineers indeed do that with frameworks like PyTorch and Tensorflow , others build on top of existing open source models and extend their functionality. And you can even outsource the entire model development process, and use someone else’s right out of the box.
Model development is iterative: unless your data is super simple, you’ll likely need to try different algorithms, and tweak them constantly before your model begins to make any sense. This is part science and math, part art, and part plain randomness.
Language models and generating text
When your data has a time component to it – say you want to predict stock prices in the future, or understand what’s going to happen in an upcoming election – it’s pretty easy to understand what a model is doing. It’s using the past to predict the future. But many ML models don’t work with time series data at all; language models are a great example of that.
Language models are just ML models that work with text data. You train them on what’s called a corpus (or just body) of text, and then use them for any number of different things, like:
-
Answering questions
-
Searching
-
Summarizing
-
Transcription
The concept of the language model has been around for ages, but deep learning with neural networks has been making a big wave recently; we’ll cover both.
Probabilistic language models
Statistically...