What’s an inference provider?

How the rise of open source AI models is fueling the growth of a new infrastructure category.

Last updated Apr 16, 2026ai

Read within learning track:

Amidst the frenzy of AI, a new category of infrastructure vendor has emerged: The inference provider. This white-hot newish(ish) vertical is producing some pretty big headlines for a category that, as we’ll see, is often overshadowed by big AI labs like OpenAI and Anthropic. TogetherAI is in talks to raise at $7.5B, Fireworks raised at $4B last year, and Modal is in talks to raise at $2.5B – all of these are doubling and doubling every few months it would seem. NVIDIA splashed $20 billion for Groq’s inference chip technology in December, its largest acquisition to date.

So yes, inference is white hot. But what do these companies actually do? And why are they worth so much money?

Let’s dig in, shall we?

Terms Mentioned

Benchmark

Open Source

LLM

IaaS

Rate Limit

Server

Cloud

Infrastructure

Integration

Networking

API

Metric

ChatGPT

Deploy

Endpoint

Cache

Machine Learning

API calls

Scaling

Inference

Token

Companies Mentioned

OpenAI

PRIVATE

AWS

AMZN

What’s inference?#

AI didn’t take long to become totally ubiquitous. It answers your questions, summarizes your meeting notes, codes up your apps, and, if you're a bad friend, writes your birthday cards. All of it, every AI product you touch, boils down to two things: training and inference.

Training is how a model learns. You take a massive amount of text and use it to produce a mathematical equation that captures patterns in language: what words tend to follow what, how ideas relate to each other, what a reasonable response to a question looks like. Models like GPT and Claude were trained on enormous amounts of text from the internet, and what emerged are systems that can produce generally fluent, useful language.

Loading image...

Inference is when the model actually does its job. Every time you send a message to ChatGPT and it responds, that's inference. Every time Claude summarizes a document or an AI coding assistant autocompletes a function, that's inference too. It's the model taking what it learned during training and applying it to your input, in real time.

Loading image...

Training models has historically been the sexy part of AI, done by people in glasses, wearing turtlenecks, and holding PhDs. They are paid many dollars. Serving models, on the other hand is plumbing, but as it turns out, as every company races to add AI to their app, the plumbers are doing pretty well for themselves. Dedicated inference providers have emerged as a major force in the AI application stack, offering a convenient managed inference service for developers building AI-powered features.

In its purest and most managed form, the OpenAI and Anthropic (we’ll call them frontier labs) APIs are AI inference providers. They have both trained the models and set them up for you to use easily. You make an API call: (“write my wedding vows”) and get back an inference: (“Webster’s dictionary defines love as…”). So what are we talking about here? Why would you use something special?

Why use an inference provider?#

As someone using AI or building into your app, you have two major choices. You can use a closed-source model like ChatGPT or Claude. These are proprietary models – nobody knows how they were trained – and the only way to use them is through the people who trained them, OpenAI and Anthropic. These labs (as well as other closed-source providers like Runway, Midjourney, and Google) are very good at training models, and generally their models will be the best on the planet, but also cost the most.

Then there are open source models. These are built by companies or the community, and released for all to use. The weights of these models (the key to the fancy prediction equation) and sometimes even the training code are open to the public. You can download and run them yourself or even tweak them if you’d like. The most popular ones are Llama from Meta, Qwen from Alibaba, and DeepSeek.

With open-source models, you have a choice in how you actually use them. You can self-host by spinning up your own infrastructure, loading the model weights, and managing everything yourself. Or you can instead use an inference provider: a company that hosts open-source models for you and lets you access them through an API, just like you would with a closed-source lab.

There are a few reasons a developer might use an inference provider over just calling the inference endpoints of the frontier labs directly, or running OSS models themselves.

Cost. The newest Frontier models are expensive. If your use case doesn't actually need a state of the art model like GPT-5.4 or Claude Opus, you can run an open-weights model like LLaMA on an inference provider for a fraction of the price. For high-volume, straightforward tasks like summarization, classification, or extraction, you probably don't need the best model on the market, and the savings add up fast.

Speed. The big frontier labs generally optimize for model capability over latency. If you're building something where response time matters (autocomplete, real-time agents), inference providers running on optimized hardware can get you significantly faster performance.

Resilience at scale. If you're big enough that OpenAI having an outage means your product has an outage, you have a problem. Inference providers let you route across multiple models and backends through a single API, so you're not in hell it every time one provider has a bad day.

Native integration with your cloud. The public cloud inference provider offerings like AWS Bedrock or Google VertexAI have an advantage in the enterprise because their models and endpoints are colocated with your other infrastructure, which can matter to security and data-residency-conscious buyers. They also tend to offer the most straightforward paths to fine-tuning and customization, which is when you take a base model and train it further on your own data so it performs better for your specific use case.

All this, at least for me, actually reverses the question - why would you call the frontier labs’ APIs when the inference providers are cheaper and faster? There are at least a few reasons:

You need the latest and greatest models (remember, the inference providers are often hosting the open weights models, which are older)
Cost and/or latency are not a concern for you (congrats, happy for you)
You’re a noob and didn’t know about inference providers (me, until recently)

The inference provider spectrum#

Like many software categories, there is a spectrum from:

Continue reading with an all-access subscription

What’s inference?#

Loading image...

Why use an inference provider?#

There are a few reasons a developer might use an inference provider over just calling the inference endpoints of the frontier labs directly, or running OSS models themselves.

All this, at least for me, actually reverses the question - why would you call the frontier labs’ APIs when the inference providers are cheaper and faster? There are at least a few reasons:

You need the latest and greatest models (remember, the inference providers are often hosting the open weights models, which are older)

Cost and/or latency are not a concern for you (congrats, happy for you)

You’re a noob and didn’t know about inference providers (me, until recently)

Explore learning tracks

What’s an inference provider?

Terms Mentioned

Benchmark

Open Source

LLM

IaaS

Rate Limit

Server

Cloud

Infrastructure

Integration

Networking

API

Metric

ChatGPT

Deploy

Endpoint

Cache

Machine Learning

API calls

Scaling

Inference

Token

Companies Mentioned

OpenAI

AWS

What’s inference?#

Why use an inference provider?#

The inference provider spectrum#

In this post

More in this track

What is Machine Learning?

How do Large Language Models work?

Why do models hallucinate?

What is RAG?

AI and the — em dash

Explore learning tracks

What’s an inference provider?

Terms Mentioned

Benchmark

Open Source

LLM

IaaS

Rate Limit

Server

Cloud

Infrastructure

Integration

Networking

API

Metric

ChatGPT

Deploy

Endpoint

Cache

Machine Learning

API calls

Scaling

Inference

Token

Companies Mentioned

OpenAI

AWS

What’s inference?#

Why use an inference provider?#

The inference provider spectrum#

In this post

More in this track

What is Machine Learning?

How do Large Language Models work?

Why do models hallucinate?

What is RAG?

AI and the — em dash