What’s inference?
AI didn’t take long to become totally ubiquitous. It answers your questions, summarizes your meeting notes, codes up your apps, and, if you're a bad friend, writes your birthday cards. All of it, every AI product you touch, boils down to two things: training and inference.
Training is how a model learns. You take a massive amount of text and use it to produce a mathematical equation that captures patterns in language: what words tend to follow what, how ideas relate to each other, what a reasonable response to a question looks like. Models like GPT and Claude were trained on enormous amounts of text from the internet, and what emerged are systems that can produce generally fluent, useful language.
Inference is when the model actually does its job. Every time you send a message to ChatGPT and it responds, that's inference. Every time Claude summarizes a document or an AI coding assistant autocompletes a function, that's inference too. It's the model taking what it learned during training and applying it to your input, in real time.
Training models has historically been the sexy part of AI, done by people in glasses, wearing turtlenecks, and holding PhDs. They are paid many dollars. Serving models, on the other hand is plumbing, but as it turns out, as every company races to add AI to their app, the plumbers are doing pretty well for themselves. Dedicated inference providers have emerged as a major force in the AI application stack, offering a convenient managed inference service for developers building AI-powered features.
In its purest and most managed form, the OpenAI and Anthropic (we’ll call them frontier labs) APIs are AI inference providers. They have both trained the models and set them up for you to use easily. You make an API call: (“write my wedding vows”) and get back an inference: (“Webster’s dictionary defines love as…”). So what are we talking about here? Why would you use something special?
Why use an inference provider?
As someone using AI or building into your app, you have two major choices. You can use a closed-source model like ChatGPT or Claude. These are proprietary models – nobody knows how they were trained – and the only way to use them is through the people who trained them, OpenAI and Anthropic. These labs (as well as other closed-source providers like Runway, Midjourney, and Google) are very good at training models, and generally their models will be the best on the planet, but also cost the most.
Then there are open source models. These are built by companies or the community, and released for all to use. The weights of these models (the key to the fancy prediction equation) and sometimes even the training code are open to the public. You can download and run them yourself or even tweak them if you’d like. The most popular ones are Llama from Meta, Qwen from Alibaba, and DeepSeek.
With open-source models, you have a choice in how you actually use them. You can self-host by spinning up your own infrastructure, loading the model weights, and managing everything yourself. Or you can instead use an inference provider: a company that hosts open-source models for you and lets you access them through an API, just like you would with a closed-source lab.
There are a few reasons a developer might use an inference provider over just calling the inference endpoints of the frontier labs directly, or running OSS models themselves.
Cost. The newest Frontier models are expensive. If your use case doesn't actually need a state of the art model like GPT-5.4 or Claude Opus, you can run an open-weights model like LLaMA on an inference provider for a fraction of the price. For high-volume, straightforward tasks like summarization, classification, or extraction, you probably don't need the best model on the market, and the savings add up fast.
Speed. The big frontier labs generally optimize for model capability over latency. If you're building something where response time matters (autocomplete, real-time agents), inference providers running on optimized hardware can get you significantly faster performance.
Resilience at scale. If you're big enough that OpenAI having an outage means your product has an outage, you have a problem. Inference providers let you route across multiple models and backends through a single API, so you're not in hell it every time one provider has a bad day.
Native integration with your cloud. The public cloud inference provider offerings like AWS Bedrock or Google VertexAI have an advantage in the enterprise because their models and endpoints are colocated with your other infrastructure, which can matter to security and data-residency-conscious buyers. They also tend to offer the most straightforward paths to fine-tuning and customization, which is when you take a base model and train it further on your own data so it performs better for your specific use case.
All this, at least for me, actually reverses the question - why would you call the frontier labs’ APIs when the inference providers are cheaper and faster? There are at least a few reasons:
- You need the latest and greatest models (remember, the inference providers are often hosting the open weights models, which are older)
- Cost and/or latency are not a concern for you (congrats, happy for you)
- You’re a noob and didn’t know about inference providers (me, until recently)