The basic Datadog product#
Datadog is a godsend for DevOps teams (and before you’re large enough to have a DevOps team, regular full stack developers). It hooks up to your infrastructure (Docker, Kubernetes, plain Linux, etc.) and automatically pulls metrics like CPU and Disk usage.
Here’s how the basic product works:
- Install Datadog on your servers (they call this the “agent”)
- Datadog collects your performance data and stores it
- You visualize and set alerts on that stored data as you please
Like most developer tools, the Datadog product is a combination of code libraries (SDKs) that you have to integrate with your application, as well as a web interface that you use for admin tasks, dashboarding, setting alerts, etc. Here’s what a basic Datadog dashboard might look like:
You’ll notice that it’s tracking page views, frontend errors, server errors, and long running server tasks - you can use the Datadog SDK to instrument these specific metrics in your app.
Though you can use it to monitor on-prem servers , Datadog itself only works in the cloud, which is interesting, because with their market cap and customer list, you'd assume they're selling into organizations for whom cloud is a non-starter. It's possible these companies are more comfortable using the cloud for monitoring since there's little/no PII involved, but this thread is worth following (note: big competitor New Relic also doesn't deploy on prem).
The Datadog product suite#
No $34B enterprise software company has just one product, and Datadog is no exception. Generally, their product suite fits into 4 distinct buckets. Dashboards and alerting is sort of like a meta-layer that works on top of all of this stuff.
1) Infrastructure monitoring#
Infrastructure monitoring is keeping an eye on your core infrastructure, below the application level. Datadog has agents (integrations) for Docker, Kubernetes, Heroku, Ubuntu, etc. – pretty much everything you’d ever use if you’re not stuck on some legacy data center run by a guy named Karl who wears cargo pants and a Slayer tee to work. For example, here are the metrics that Datadog collects by default for their Docker agent:
Again, this data isn’t very valuable sitting around - the real oomph of Datadog is the ability to pull this into a dashboard, visualize it, and set alerts to let you know when something is wrong.
2) Application monitoring (APM)#
Application monitoring is one level above server monitoring - it’s how developers look at their frontend and backend performance. If you have an endpoint that returns user information that you use to populate a profile page, you might want to monitor how fast the request goes, and if it returns the kind of data you expect it to.
APM is a category in and of itself, and there are entire other companies like AppDynamics, DynaTrace, and Sentry that make most of their money doing this.
3) Log management#
Recently, Datadog released a log management product - it gathers all of the logs that your server generates, stores them, and lets you analyze and search them. Using logs is generally the next step after monitoring - if Datadog shows you that something is going wack with your server, you’ll find the logs the server is emitting, which should contain useful info on what’s happening.
This is directly competitive with what Splunk and Elastic make their money on, so it will be interesting to see where it goes.
These first 3 categories - infrastructure metrics, APM, and logs - are often referred to as the 3 pillars of observability (is in some greek pantheon or whatever).
4) Other Stuff™ #
Sort of like Segment, and to an extent all maturing developer focused companies, Datadog has continued to release new products that aren’t groundbreaking on their own, and probably represent a tiny percentage of revenue, but help build their ecosystem and make the entire Datadog suite that much more attractive. A few examples:
Some of these are bound to sunset at some point; every enterprise software company eventually becomes an ecosystem, as the value of each individual product is a lot higher when they're integrated together into a single package (or at least that's what my strategy textbooks told me).
And Datadog's ecosystem strategy is clearly working — about 83% of customers use more than one product, 50% use more than four products, and 26% use more than six products.
Further reading#
- My friend Adam wrote a great piece on the history of DevOps and how Datadog came to be
- Datadog’s documentation is generally just ok, but their getting started guide is useful for breaking down the different product lines
---
Update: Datadog’s new security products#
Since the time of writing, Datadog has been investing heavily in new security products. Here’s a quick lowdown on what they are and what they do.
The way to think about these products is that they add a security view to the data that the product was already collecting. E.g. Datadog knows the commands running on your servers and what they’re doing because you installed it for performance monitoring: it’s easy for them to organize that data in a different way (different charts, different alerts, etc.) that help with security stuff instead of performance stuff.
1) Cloud security management#
Datadog CSM helps teams find any vulnerabilities in their infrastructure: unsecured endpoints that a hacker could get into, errant commands run on a machine that might be from a hacker, etc. The first thing you get is a dashboard that shows the status of each piece of your infrastructure, and how all of it relates to each other:
For each piece of infrastructure (a server, a database, etc.) you get a view of what commands are running in it, status, etc. If a suspicious command gets run (e.g. changing permissions on a server), Datadog will notify you. There’s also a facility for managing incidents – when a breach actually happens, or infrastructure goes down.
2) Application security management#
This product is similar to the above, but with respect to your application instead of your infrastructure. The basic idea is that Datadog will detect when someone is trying to attack your application, and notify you about what’s happening and where.
3) Cloud SIEM#
This one is interesting, since Datadog is getting into a space with some big players in it (like Splunk). SIEM (Security Information and Event Management) is the tried and true practice of ingesting massive amounts of mostly useless logs from your infrastructure and seeing what the fuck is going on with them.
Every time anything happens on a server, or in a more enterprise-focused SaaS product, a log is generated. That log is just a bunch of text – plus a timestamp – describing what happened. User logged in. Server restarted. Configuration changed. You get it. Anyway there are so many of these – and most are innocuous – that looking through them manually, especially for large companies, is all but impossible. SIEM products help teams ingest, transform, and analyze this big data to see what’s actually going on:
It’s beyond me to say if this is a good SIEM product, but they seem to positioning it as a cost effective, simpler alternative to products like Splunk.