How do AI models think and reason?

All about "reasoning" language models like OpenAI's o3 and Deepseek's R1.

The answers you already love from AI models like ChatGPT and Claude might seem like they’re deeply thought out and researched. But under the hood these models have historically been pretty simple: they’re just playing the word guessing game. Enter a new generation of models – like DeepSeek’s R1 and OpenAI ’s o3-mini – that are actually learning to think and reason like humans. This post will walk through how models actually do this and how you can use reasoning models to get better answers.

The word guessing game vs. actual reasoning

One of the most counterintuitive things about large language models is that sure, they’re incredibly complicated and powerful, but also…very simple at their core. I explained how these models worked a couple of years back:

The way that ChatGPT and LLMs generate these entire paragraphs of text is by playing the word guessing game, over and over and over again. Here’s how it works:

  • You give the model a prompt (this is the “predict” phrase”)

  • It predicts a word based on the prompt

  • It predicts a 2nd word based on the 1st word

  • It predicts a 3rd word based on the first 2 words

It’s really very primitive when you get down to it. But it turns out that the word guessing game can be very powerful when your model is trained on all of the text on the fucking internet! Data Scientists have long said (about ML models) that “garbage in means garbage out” – in other words, your models are only as good as the data you’ve used to train them. With OpenAI’s partnership with Microsoft, they’ve been able to dedicate tremendous amounts of computing resources towards gathering this data and training these models on powerful servers .

Things have changed since then: model architectures have gotten more complicated, and models are guessing more than one word at a time. But fundamentally, word guessing is how LLMs work. This is why sometimes, model responses have obvious logical inconsistencies, or miss things that humans never would. They’re not really thinking in the human sense.

But what if they could? A seminal research paper in 2022 developed an idea they called STaR. It’s a way to teach models to have real chains of thought, just like humans might, so models can “think” instead of just generating words right away.

language model q&a loop

The gist of the STaR method is teaching a model how to develop rationales. Rationales are reasons why the model gives the answer it did, instead of just the answer itself. The paper develops a method to fine tune these models using the somewhat convoluted process in the diagram above, but the idea is pretty simple. If the model gets the right answer with a good rationale, it gets rewarded. If it gets the wrong answer, or the right answer with the wrong rationale, we start again.

The training data set is made up of questions like this:

q&a pairs for language model training

The answer is obviously (b). But why? The question here has 3 possible rationales, all true to varying degrees (readers who have taken the LSAT may take issue with this). The most correct rationale is the second one: and though it may seem basic, teaching this logic to the model helps it learn the basic structure of why an answer might be right or wrong. Repeat this over thousands of questions, and the model starts to get good.

This is just one paper and method among a growing body of literature on what researchers are calling “chain of thought” models. I’ll include more at the end for those curious readers among us.

Models starting to reason has implications beyond just better answers. It starts to shift a model’s compute usage **from training to inference **. So far, the story for most of GenAI is that these huge models cost a metric ton to train, but were pretty cheap to use once they were trained. With reasoning, models need to turn gears and think during inference, and that’s happening on expensive, supply constrained GPUs. So how much you want a model to reason isn’t just a quality question, it’s an economic one.

Reasoning models in practice: ChatGPT

Alright, fancy research, but are these models any good? Let’s take a look at ChatGPT, which is using OpenAI’s new o3-mini model under the hood. It’s a model specifically trained to reason, and then share its reasoning with you so you know how it got to the answer it did.

I’ve been feeling a little bit of career uncertainty recently. Let’s see if ChatGPT can help me out:

a chatgpt conversation

This is the answer from ChatGPT’s “standard” model, which at the time of writing is GPT-4o. It’s concise, and by my best attempt at judgement, likely a correct answer to the question. There’s a little rationale and explanation here, but not a ton. How did ChatGPT arrive at this answer? Who knows.

Luckily for us, we can dig deeper. Clicking on this “Reason” option tells ChatGPT to use their reasoning model to respond.

turning on ai reasoning

With o3-mini, I get a completely different answer to the same question:

performance enhancement to a reasoning model

It’s significantly more verbose, and lists out several supporting points for each career option. Let’s take a look at this little thing though:

![showing the reasoni...