The core Datadog product: observability
Every app that you use on the internet is running on a server somewhere. Developers need to understand what’s going on with their apps and servers so that things run smoothly: they’re checking for how fast things run, what errors they run into, spikes in traffic, and stuff like that.
Datadog is a godsend for DevOps teams (and before you’re large enough to have a DevOps team, regular full stack developers). It hooks up to your infrastructure (Docker, Kubernetes, plain Linux, etc.) and automatically pulls metrics like CPU and Disk usage. Here’s how the basic product works:
- Install Datadog on your servers (they call this the “agent”)
- Datadog collects your performance data and stores it
- You visualize and set alerts on that stored data as you please
Like most developer tools, the Datadog product is a combination of code libraries (SDKs) that you have to integrate with your application, as well as a web interface that you use for admin tasks, dashboarding, setting alerts, etc. Here’s what a basic Datadog dashboard might look like:
Servers, apps, and metrics
This is the first time we’re writing about monitoring and observability, so it’s worth taking a step back to understand (a) why developers need this visibility and all of these metrics, and (b) how we got here (why didn’t Datadog exist 10 years ago?).
Every app that you use on the internet is running on a server somewhere. Until recently, that used to literally mean one server - a giant computer - so you had whatever computing power you had, and you had one place to look if you wanted to know why things were slow, and why your users were sending you infinite “fuck you” loops in Java. On your server, there are a few metrics you want to keep an eye on to make sure everything is running smoothly:
- CPU – how much processing your server is doing as a function of its total processing power (e.g. 3 out of 4 CPUs)
- RAM - how much memory your server is using (e.g. 3GB/4GB memory)
- Disk - how much storage your server is using (e.g. 450GB/500GB stored)
- IO - how fast your server is reading and writing things from memory and disk
There were basic utilities in Linux for monitoring a lot of this stuff, like the htop command – which is still used a lot, mind you – but this was mostly a reactive process. Something would go wrong, and you’d check why.
Then a few things changed:
1) Infrastructure got easier, but more complicated
Once Docker and Kubernetes entered the scene, things changed a lot - the concept of “servers” got a lot more complicated, because there was a thick layer of abstraction between your code and what infrastructure it was running on. That meant more surface area for problems – you could be having an issue with Docker or your server – but it also meant nicher things to worry about, like your Kubernetes cluster having a hard time restarting pods and other things I don’t fully understand.
2) The internet got bigger
The other thing that happened is that the internet became widely available, so apps are now used by like, billions of people. So when 2 billion people are loading Facebook.com every hour as opposed to 200, a lot more things can go wrong, and critically, it’s more important to fix them, and fix them quickly.
Beyond servers, developers also needed to start monitoring their . How long are requests to our taking to fulfill? Are users getting any errors on their profile page? Problems with any of these could be the server’s fault the application’s fault, which makes investigating all the more tedious.
These are the two big ones, but a lot of other stuff was happening behind the scenes too that contributed to an environment where Datadog could grow fast. The move into microservices from monolithic apps created more individual entry points for monitoring. And as companies moved from on-prem to the cloud, building native integrations to save time became more feasible.
So in summary, developers were faced with more complex infrastructure and more pressure to understand, monitor, and keep that infrastructure running smoothly. This is part of why DevOps (development operations) started to become its own discipline – companies were employing teams of developers just to deploy and monitor infrastructure. And from the ashes of 3AM pager buzzes and angry managers, a giant was born.