What are code sandboxes?

Why coding agents need a safe place to play, just like we all do.

Last updated Jun 17, 2026ai

Read within learning track:AI, it's not that complicated

Sandboxes.

Loading image...

Though for most of my readers (who for some reason I assume were born in the 90’s) this word may evoke a quaint yet hauntingly-long-ago visage of their suburban childhoods, it means something else entirely to a software developer.

The TL;DR#

Sandboxes are isolated computing environments where you can run code that you don’t entirely trust. Think of them as solitary confinement, but for a program.

Much of software today requires you to run untrusted code – code written by a Large Language Model or even a human you don’t know.
Running untrusted code on your system is risky: it can breach your security, delete your data, and generally just mess with things.
A sandbox puts iron walls around this untrusted code so it can’t mess with things, while still allowing it to run and (hopefully) accomplish its task.
Sandboxes are absolutely exploding in popularity because of several different LLM-related use cases.

Sandboxes have been around for as long as I’ve been in tech. For years, products like Retool (my former employer) have needed ways to let their users run arbitrary code…without being subject to the risks of their users running arbitrary code.

But with LLMs taking the center stage, they’ve gotten a second wind for the ages. There are now dozens (dozens!) of sandboxes for LLM startups vying for the business of small companies and enterprises alike who need a place to run random code. And I will explain them to you.

Terms Mentioned

Training

LLM

Client

Server

Cloud

Infrastructure

Backend

Networking

API

Metric

API calls

Database

Merging

Companies Mentioned

AWS

AMZN

Coding agents, RL training, and other forms of untrusted code#

To understand why sandboxes are having such a moment, we must first detour to understand this concept of “untrusted code” a little more deeply.

Simply put, untrusted code is code that you, yourself, have not examined closely line by line. Because you haven’t examined it closely line by line, you don’t know what it does. And because you don’t know what it does, running the code in the same places that you have your other code running is very risky. Who knows what it’s going to do?

Imagine you’re a prison warden (if I have any readers who are prison wardens, let me know in the comments). You’ve got your hundreds of inmates who are, for the most part, playing pretty nicely together. Then you get a call that a new prisoner is on his way; you don’t know anything about him, just that he’s rumored to be a high-profile gangbanger. Is he well behaved? Could he incite a riot? Will he try to escape? Who knows.

Loading image...

With that uncertainty, you’re certainly not going to put him in Cell Block A with everyone else. Instead, you’ll cordon him off in solitary confinement where whatever he does won’t impact the orderly conduct of the rest of your subordinates.

In software engineering this is referred to as isolation. The prison is your infrastructure – the server your existing code runs on, the database it’s backed by, the APIs powering it. Everything for the most part is working well, and you don’t want to risk a new inmate fucking around with things.

So who are the gangbangers in our analogy? There are a few:

Internal coding agents – developers using LLMs to generate code for new features, fixing bugs, etc. This code is untrusted, so it needs a sandbox to test it before merging into the rest of the codebase.
Reinforcement Learning (RL) – when teams train their own coding agents, they need environments where the agents can play around and learn how to get better. Since the code they’re generating and running is (very) untrusted, they need a sandbox.
(old school) Users – sometimes you want your users to be able to write code and run it on your systems.

Sandboxes give developers an isolated environment to run and test these outputs – these gangbangers if you will – where they physically cannot negatively impact other parts of your stack. If they mess up, if they turn malicious, if they start a riot, the scope of damage can never extend beyond the sandbox itself.

Loading image...

By the way, this gangbanging is not theoretical. Code can absolutely get nasty and mess with your stuff. You’ve probably seen examples on X of coding agents “accidentally” deleting someone’s entire database. And for RL, logs of coding agents in sandboxes often show them attempting destructive behavior because they’re basically try anything (that’s how RL works).

How a sandbox works: dynamically creating infrastructure at runtime#

OK, so a sandbox is just isolated infrastructure. How is that different from a virtual machine (VM), the foundational building block of every cloud provider like AWS? And how do you actually use a sandbox?

There are many answers, but I will focus on 3 here because they’re the most important, and because I’ve grown tired of writing.

Sandboxes are at a different scale than VMs

For internal coding agents, the number of sandboxes you need is pretty manageable. An engineer might have a few, maybe even dozens, of agents working on different features that each needs a sandbox.

But for RL, we are talking about a completely different scale. To quote myself from another post:

“At scale, RL training becomes a problem of orchestrating massive numbers of sandboxes. One of Modal’s customers, a major AI lab, is already running on the order of 100,000 concurrent sandboxes for RL workloads, with a stated goal of reaching 1 million.”

In RL, your goal is for the young coding agent to explore different ways of writing code and find the best ones. The more sandboxes you can spin up means the more “shots on goal” your agent has to figure that out, ergo the better your result will be and the faster you can get there.

Now. 1 million might be on the higher end of what demand looks like now. But even for more pedestrian numbers like 100K, you are already way out of band of what you can realistically do with VMs. AWS will not let you create one hundred thousand virtual machines in one fell swoop.

This is why in practice, sandbox providers like Modal will put multiple sandboxes on one VM. And in many cases, the underlying cloud provider will have multiple VMs on one server. And then multiple servers in a rack…and multiple racks in a data center…it’s all coming together baby.

Sandboxes have thick walls

Unlike VMs, which are built to communicate with other machinery like databases and APIs, sandboxes need to be mostly self sufficient. By default all of their networking ports are closed. In many cases they are forbidden from communicating with any outside systems at all (e.g. not API calls). You can change all of these settings, but the defaults reflect the solitary confinement ethos.

Sandboxes are defined and spun up differently than VMs

For most use cases, VMs are things that you create with the intention of running for a while. Let’s say you’re building a new email client, or a competitor to X, or B2B software for dog groomers. For all of these, you will need ongoing backend infrastructure for your app that you will run indefinitely, basically until you go out of business (or switch providers).

Sandboxes, on the other hand, are ephemeral. They’re designed to be spun up, used for a very short period of time, and then demolished. Short here could mean only a few seconds, more typically a few minutes, and almost never longer than a few hours.

And because short is short, they’re built differently than VMs (which are here for a good time and a long time). They need to spin up much, much faster than a VM, which can take minutes. They don’t need as much attached storage, since they’re here for a short time. And other design decisions like these.

Sandbox providers: an exercise in translation#

For my final act before releasing you, let’s stress test this post by taking a look at a couple of sandbox providers and how they pitch their product.

First, here is Daytona. I like them because I was at a talk that the CEO gave once, and he was wearing a Platinum Day Date, which let’s be honest is pretty sick.

Loading image...

Lightning-fast infrastructure for AI development: you can spin up these sandboxes really fast.
Separated & isolated runtime protection: what happens in the sandbox stays in the sandbox. Untrusted code can’t break out.
Massive parallelization for concurrent AI workflows: you can spin up a metric fuckton of sandboxes at the same time.

And here is Modal’s same but different.

Loading image...

Built for concurrency: you can spin up a metric fuckton of sandboxes at the same time.
Fast on any image: you can spin up these sandboxes really fast.
Deep GPU or CPU capacity: this is more of a reflection of Modal’s impressive work on GPU infra, but unlike some other providers they can get you GPUs for your sandboxes.

Like I said, there are many, many sandbox providers now that the market for them has seemingly coalesced.

So next time your hear about sandboxes and you think about the rectangular playdates of yore, and think to yourself:

“Wait a minute, I don’t have a child. I’m not even married. In fact, I’m not even a human being…I’m a horseshoe crab.”

Let this post be your guide.

What are code sandboxes?

Why coding agents need a safe place to play, just like we all do.

Last updated Jun 17, 2026ai

Justin Gage

Read within learning track:AI, it's not that complicated

Sandboxes.

Loading image...

The TL;DR#

Sandboxes are isolated computing environments where you can run code that you don’t entirely trust. Think of them as solitary confinement, but for a program.

Much of software today requires you to run untrusted code – code written by a Large Language Model or even a human you don’t know.
Running untrusted code on your system is risky: it can breach your security, delete your data, and generally just mess with things.
A sandbox puts iron walls around this untrusted code so it can’t mess with things, while still allowing it to run and (hopefully) accomplish its task.
Sandboxes are absolutely exploding in popularity because of several different LLM-related use cases.

Terms Mentioned

Training

LLM

Client

Server

Cloud

Infrastructure

Backend

Networking

API

Metric

API calls

Database

Merging

Companies Mentioned

AWS

AMZN

Coding agents, RL training, and other forms of untrusted code#

To understand why sandboxes are having such a moment, we must first detour to understand this concept of “untrusted code” a little more deeply.

Loading image...

So who are the gangbangers in our analogy? There are a few:

Internal coding agents – developers using LLMs to generate code for new features, fixing bugs, etc. This code is untrusted, so it needs a sandbox to test it before merging into the rest of the codebase.
Reinforcement Learning (RL) – when teams train their own coding agents, they need environments where the agents can play around and learn how to get better. Since the code they’re generating and running is (very) untrusted, they need a sandbox.
(old school) Users – sometimes you want your users to be able to write code and run it on your systems.

Loading image...

How a sandbox works: dynamically creating infrastructure at runtime#

There are many answers, but I will focus on 3 here because they’re the most important, and because I’ve grown tired of writing.

Sandboxes are at a different scale than VMs

For internal coding agents, the number of sandboxes you need is pretty manageable. An engineer might have a few, maybe even dozens, of agents working on different features that each needs a sandbox.

But for RL, we are talking about a completely different scale. To quote myself from another post:

“At scale, RL training becomes a problem of orchestrating massive numbers of sandboxes. One of Modal’s customers, a major AI lab, is already running on the order of 100,000 concurrent sandboxes for RL workloads, with a stated goal of reaching 1 million.”

Sandboxes have thick walls

Sandboxes are defined and spun up differently than VMs

Sandbox providers: an exercise in translation#

For my final act before releasing you, let’s stress test this post by taking a look at a couple of sandbox providers and how they pitch their product.

First, here is Daytona. I like them because I was at a talk that the CEO gave once, and he was wearing a Platinum Day Date, which let’s be honest is pretty sick.

Loading image...

Lightning-fast infrastructure for AI development: you can spin up these sandboxes really fast.
Separated & isolated runtime protection: what happens in the sandbox stays in the sandbox. Untrusted code can’t break out.
Massive parallelization for concurrent AI workflows: you can spin up a metric fuckton of sandboxes at the same time.

And here is Modal’s same but different.

Loading image...

Built for concurrency: you can spin up a metric fuckton of sandboxes at the same time.
Fast on any image: you can spin up these sandboxes really fast.
Deep GPU or CPU capacity: this is more of a reflection of Modal’s impressive work on GPU infra, but unlike some other providers they can get you GPUs for your sandboxes.

Like I said, there are many, many sandbox providers now that the market for them has seemingly coalesced.

So next time your hear about sandboxes and you think about the rectangular playdates of yore, and think to yourself:

“Wait a minute, I don’t have a child. I’m not even married. In fact, I’m not even a human being…I’m a horseshoe crab.”

Let this post be your guide.

Explore learning tracks

What are code sandboxes?

The TL;DR#

Terms Mentioned

Training

LLM

Client

Server

Cloud

Infrastructure

Backend

Networking

API

Metric

API calls

Database

Merging

Companies Mentioned

AWS

Coding agents, RL training, and other forms of untrusted code#

How a sandbox works: dynamically creating infrastructure at runtime#

Sandbox providers: an exercise in translation#

Subscribe for full access

What’s a data science notebook?Free

AI and neuroscience

What's GPT-3?Free

Explore learning tracks

What are code sandboxes?

The TL;DR#

Terms Mentioned

Training

LLM

Client

Server

Cloud

Infrastructure

Backend

Networking

API

Metric

API calls

Database

Merging

Companies Mentioned

AWS

Coding agents, RL training, and other forms of untrusted code#

How a sandbox works: dynamically creating infrastructure at runtime#

Sandbox providers: an exercise in translation#

Subscribe for full access

What’s a data science notebook?Free

AI and neuroscience

What's GPT-3?Free