Ye Olde Way: Become Ungovernable
Your garden-variety application these days runs on infrastructure hosted by cloud service providers like AWS and Azure. Most applications need a database to store and retrieve data, a web server to host the web app you click around in, another server to host the API, and a long tail of other things like file storage, search indexes, network load balancers, etc. This plethora of infrastructure needs is exactly why AWS is such a big business; their customers need a lot of things.
You need a database? Sure, I’ll make you a database. On AWS, this is what it looks like to spin up a database.
There are similar menus for the other server types we need, and yet more menus to ensure they can talk to each other over the network and are securely open to the internet so your users can… use them. These menus exist because infrastructure is essentially infinitely configurable. There are simply very many knobs and buttons because every customer needs something slightly different. So you end up with a screen like this.
If your application requirements never change — you never need to handle more traffic, you never add more features, you never need to fix bugs or patch security vulnerabilities — this system works quite well. For my app that predicts how long Chicago train delays will last, this is exactly what I did! I clicked through menus to get the infra I needed and never touched it again.
But real applications differ in (at least) two key ways from my train delay app.
They evolve. New features, growth, and software updates all require changes to infrastructure. A series C company I talked to recently said they make about five changes to their infrastructure a week.
There are many people involved. Different teams need to make infrastructure changes at different times. That same company I talked to has one infra team member for each of their eight engineering teams.
If the infra team in question were to log into the AWS console to create, change, and maintain infrastructure 5 times a week for 8 different teams, a few things would break down badly. And as we know, when things at the infra layer break down, bad, bad, stuff happens. Here’s what you can expect to go down if you stick with click-driven infra management:
Collisions and chaos: Without a centralized and auditable place to make infra decisions, people step on each other’s toes. What to one team may be a routine database version bump may break core functionality in another team’s features. There is no way to tell what some configurations depend on, and nothing to stop two engineers from going in and making the changes that are optimal for their team at the expense of another one.
Quality issues: In application code, the pull request serves as the official proposal of a change, which almost always requires a review and approval before merging in. Provisioning infrastructure via the console lacks any review process, and thus bad configs can sneak in and go straight to production. You are just a single click away from disaster. This is pretty perverse given how dangerous mistakes at this layer can be.
Cost drift: Humans tend to create the infrastructure they need eagerly and tear down the infrastructure they don’t need lazily (shout out my old unused Google Cloud API server, which I get billed 53 cents for every month, luv u). Without a system to track what’s actually needed, resources tend to accumulate, and that costs money.
Unrecoverable disasters: When things do go wrong, and they always will, click-driven infra lacks a coherent way to step back through the changes that were made to return to a stable state. If you’re lucky, someone wrote down the plan and took notes on what they did. But let me ask you. Do you feel lucky?
The point here is that in ye olde days (or today for some companies), infrastructure lacked governance. Which is to say there was no central system to review and approve proposed changes, implement those changes, observe them, and roll them back if needed in a cost-efficient way. To solve this problem, infrastructure engineers borrowed from the very masters they serve: coders.
The Way Nouveau: Infrastructure as Code
The problems above are not unique to infrastructure; they’re just uniquely painful for infrastructure teams. The same needs for governance, including centralized management, observability, deployment, and rollbacks, exist for regular old “code’ and predate the management of cloud infrastructure by a few decades.
Things get complex when multiple engineers are working on the same, shared set of code. Ergo, people who write code need to review and approve changes to that code before it goes to production. They also need to understand change history and roll back when things go sideways. They did this by inventing and using Git, which allows users to collaborate on and govern code in text files. And it works! Git certainly has its issues, but it has become the de facto way engineers work on code together, and the world has remained intact.
Cool, easy. So let’s just do that for infrastructure, right?
The issue is that infrastructure isn't code you write; it's things out in the world, in data centers. It is real physical items. Git only works on text, on code. End of the line, right?
Wrong. The fundamental shift is to start thinking of infrastructure as code. By enshrining infrastructure requirements in text files, you make infrastructure tangible, unambiguous, and something on which multiple people can collaborate.
IAC has been around for much longer than it has been popular. In its purest form, IAC can be a script that spins up servers with command-line commands. Terraform, however, created by HashiCorp in 2014, really invented modern IAC and remains the most popular framework due to its strong community support and cloud-agnostic approach (it works with AWS, GCP, and Azure).
Its core idea is simple: what if you could describe all of the infrastructure your application needs in a text file, and then have a tool go out and make it real?
Here's what that looks like. Say you still need that Postgres database on AWS. Instead of clicking through the console, you write this:
resource "aws_db_instance" "my_database" {
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.micro"
allocated_storage = 20
db_name = "my_app_db"
username = "admin"
password = var.db_password
}
That's it. That's your database. You open your terminal and run terraform apply, Terraform talks to AWS, and a few minutes later you have a running Postgres instance. Need to bump the engine version? You change 15.4 to 16.2 in the file, run apply again, and Terraform figures out what needs to change and changes it.
There are three layers to understand here.
Terraform itself is the framework. It's the engine that reads your .tf files, figures out what infrastructure currently exists, compares it to what you've described, and makes a plan to reconcile the two. It doesn't know anything about AWS or Google Cloud or Azure on its own, it just knows how to understand the state of the world and orchestrate changes.
Terraform Providers are plugins that teach Terraform how to talk to specific platforms. AWS maintains the AWS provider. Google maintains the Google provider. When you write resource "aws_db_instance", the AWS provider is the thing that actually knows how to call the AWS API and create that database.
Terraform code itself is written by infrastructure engineers (or DevOps engineers, or platform engineers, depending on what year it is and how your company thinks about titles). They write .tf files, commit them to git, open pull requests, get reviews, and merge. Just like application code. Once the .tf file is merged, someone can apply it, but more often this is done by a GitHub action in CI/CD.
This setup solves a number of the problems we talked about above:
Collisions go away because infrastructure changes now go through pull requests that carefully specify what depends on what. When Team X wants to bump the database version, that's a diff in a file that Team Y can see, review, and flag before it goes live.
Quality goes up because changes get reviewed. A senior engineer can look at a proposed config and say "hey, you left this database open to the public internet" before it ever touches production.
Cost drift gets easier to manage because your .tf files are the source of truth. If a resource isn't defined in code, it shouldn't exist. Teams can audit what's declared, compare it to what's running, and kill the stuff nobody asked for. (My 53-cent Google Cloud server would not survive this process.) In highly automated shops, not only will Terraform flag infrastructure it doesn’t recognize, it will literally destroy it for you.
Rollback becomes possible because you have a version history in Git. If something broke after the last apply, you can see exactly what changed, revert the commit, and apply again.
One thing people especially like about Terraform is that it's declarative. You describe the end state you want, not the steps to get there. You say "I want a database with these properties," and Terraform figures out whether that means creating one, modifying one, or destroying and recreating one. One file to rule them all. If it's not in the .tf file, it shouldn't exist.
Now, I'm being slightly disingenuous here. The massive menu of options and clicks isn't the only way to work with infrastructure in cloud providers. They also offer APIs, CLIs, and even their own cloud-specific IaC tools like AWS’ CloudFormation. But these lock you into a single provider's ecosystem and syntax. Terraform was built broad by design, giving you one declarative language that worked across all of them, and a plugin model that let the community extend it to practically any API-driven service.
Everything everywhere all in code
To this point, we’ve been talking about real “capital I” infrastructure: databases, servers, networks. But it turns out engineers love the declarative, centralized governance that Terraform affords them. More and more companies are shipping Terraform providers that allow engineers to manage anything in their app declaratively via Terraform, not just servers.
Database tables in the warehouse are one example. Typically, you might see an infra engineer use Terraform to create a database, but let data engineers create tables by connecting directly to the database and running “create table order…” in SQL. This works, but again suffers all the consequences of being ungovernable. By setting up the data team to use Terraform, the folks closest to the data can enshrine and evolve table definitions in code and keep the warehouse in ship shape.
Another really interesting evolution in the utility of Terraform has been around permissions. Permissions – which engineers have which kind of access to which infrastructure resources – are notoriously ungoverned. People need permissions quickly, slacking whoever is the path of least resistance for admin access to broad swaths of company data. They also rarely give up permissions when they don't need them anymore, and an admin looking to get a picture of who has access to what is in for a world of pain. You’ll never guess what I’m about to say, but managing permissions in Terraform makes for a more orderly, auditable, and safe operating model.
There are, of course, some downsides to managing infrastructure and other application objects via Terraform.
Centralization can kill local velocity. If you used to be able to create database tables in SQL, and now someone tells you you have to ask the infra team to make a .tf PR, that’s going to slow you down and be annoying. Ultimately, the benefit of using IAC is a more global benefit that accrues slowly at the company level, and for this reason, it can feel heavy for smaller application objects.
The best system will be centralized where it matters (databases, k8s clusters) and delegated where it doesn’t (you should be able to create and manage your own GitHub repos, which have very little downside potential). The middle is where being thoughtful is important.
You have to learn Terraform. If your org is cool and allows you to manage your own domain resources via Terraform, you still need to learn a new language, Hashicorp Configuration Language (HCL). Again, globally and over time, this is great, but in the short term, it’s kind of annoying. For that matter, depending on what you were using before, you may have to learn git. Pulumi, a series C startup, is making progress on this front, allowing engineers to write infrastructure specs in almost any language, but has some ground to make up if it wants to compete directly with Terraform.
Fear not the blade
When I PM’d our search functionality at Prefect, which was just before AI-coding tools really took off, we had a pretty typical eng team set up to make it happen: one backend engineer, one front-end engineer, and a platform team member who wasn’t exactly in the squad, but was our go-to for infrastructure-related requests. Early on, we identified the need to spin up some search infrastructure and let our platform guy know. Then, for a month or so, we built the feature while he worked on setting up Elasticsearch with about 20% of his time. After all, the application code was the slowest-moving part of this process; he had a few weeks before he really had to care.
AI is changing the role of infrastructure, platform teams, and Terraform in the workplace. If we built this same feature today, the front end and back end would be done in a matter of days, if not hours, and the bottleneck would move to the infra team. This is happening in every company that’s seriously adopting AI programming tools.
I see a few ways this could play out in the near future.
Hire more platform engineers. This is the default answer when certain capabilities become a bottleneck. It tracks to me that infra is theorized to be one of the last four jobs.
Let the AI write Terraform. This would close the productivity gap, and after all, it's just code, right? This will probably happen to some extent, but it's also dangerous (did you know that the blade is sharp??). If platform teams do this, it will need to come with appropriate guardrails and review steps.
Further job collapse (or, the democratization of DevOps). We’ve already seen AI collapse the product manager and engineer jobs into one. Product managers in title are increasingly expected to be able to vibe code MVPs and contribute to codebases directly, and engineers are expected to prompt LLMs into Product Requirement Documents. I think we’ll continue to see LLMs enabling sharp engineers to annex the territory of their former neighbors. With this evolution, engineers will not only be responsible for their application code, but also for the infrastructure and associated .tf code. Platform teams will shift to a higher-level strategic and enabling role, rather than authoring all the .tf code directly.
As AI floods production with more code than we ever thought possible, a strong foundation is more important than ever. DevOps is still scary. It was scary before Terraform, and it’ll be scary as long as infrastructure slips cut deep. But now, we have governance that helps us avoid the worst of the nightmares of yore.