Infrastructure monitoring#
Infrastructure monitoring is keeping an eye on any metrics that relate to non-application data. The easiest place to start is the actual server hardware that your app is running on, but it extends beyond that to things like Docker or your network performance.
Starting with your server, there are a few metrics you want to keep an eye on to make sure everything is running smoothly:
- CPU – how much processing your server is doing as a function of its total processing power (e.g. 3 out of 4 CPUs)
- RAM - how much memory your server is using (e.g. 3GB/4GB memory)
- Disk - how much storage your server is using (e.g. 450GB/500GB stored)
- IO - how fast your server is reading and writing things from memory and disk
Keep in mind that most popular modern applications run on distributed systems, so there’s likely way more than just one server involved. That means you need to keep an eye on metrics like these for all of those servers, and frameworks have evolved to do that for you automatically.
Observability isn’t just about identifying when things go wrong; it’s also about figuring out why they went wrong, and, of course, how to fix them. The answer usually lies in logs, which are little bits of text your app and infrastructure spit out when they do anything: like “hey we’re starting up” or “hey this specific thing went wrong.” And this is why monitoring tools are usually tightly coupled with log storage and management tools.
Beyond servers, teams will monitor other parts of their infrastructure. How much data is your database storing, and how much space is left? How long does it take for your Docker containers to build? How long does it take for different servers to speak to each other? These are all questions that developers want to be able to answer quickly and consistently.
Application monitoring#
Even with the most performant, 100% perfect infrastructure, you’re not guaranteed to have a high performing app: things can go wrong for any number of reasons. Recall that a modern app consists of a frontend – the visual, interactive part made up of HTML, CSS, and JavaScript – and a backend, which deals with the data and logic backing that frontend. Things can go wrong with both of those things, and even in the interaction between them. Some common use cases:
- Request performance: how quickly are your API endpoints responding?
- Error monitoring: are you getting HTTP errors on your requests? Is your code erroring out? If so, what kinds of errors? When are they happening?
- Frontend load times: how quickly is your app loading? Are interactions from your users quick?
In some cases, issues with these metrics can relate to underlying issues with your app’s infrastructure. But many times they’re localized issues that need fixing, like your code not being able to handle weird edge cases.
The core New Relic product#
With that extensive background in mind, you’re set up to understand what New Relic does. In short, they offer a comprehensive set of tools for all of the monitoring that developers need, from application to infrastructure and beyond. Datadog is a useful comparison.
Here’s how it works:
Install New Relic’s agents on whatever you want to monitor#
To gather the data you want to look at, you need to install some New Relic code – referred to as an agent – on any property you want to monitor. For infrastructure level items, this is pretty straightforward: New Relic provides easy to install agents for Kubernetes, Linux, AWS services, and even mobile platforms like iOS. Some configuration is required, but it’s mostly out of the box.
For application level stuff, you need to write some code. New Relic provides an SDK (a set of APIs) for picking which pieces of your application you want to monitor and how. For basic stuff like HTTP requests, these agents work out of the box, but you can also add custom instrumentation for however you want to monitor your app.
Build dashboards and visualizations#
Once your agents are installed and data is flowing, you can basically do whatever you want with it. Teams will usually build dashboards that pull together different data sources:
These dashboards are usually a combination of numerical data – how many errors have we seen on the frontend? How long are requests taking? – as well as aggregated data about logs and performance over time. Note that New Relic is actually storing the data they’re collecting on their servers, which is part of their value proposition; you’d need to store it yourself otherwise.
Dig deeper with traces and logs#
Monitoring isn’t just about knowing when things aren’t working; it’s also about making it easier to fix those things. So what exactly do you do when you see that your app is running into issues? Well, Sherlock, you’ll investigate, and the most common place to look is logs, bits of text that your application / infrastructure spit out whenever they do anything. If you’re seeing elevated error rates, you’ll check what those errors actually are, where they’re coming from, etc.
Note that you don’t need to use New Relic for storing logs just because you’re using them for APM and infrastructure monitoring (you could use Elastic, Splunk, etc.), but choosing the same platform for both of these does have its benefits.
Other New Relic goodies#
Beyond the standard monitoring workflow (above), New Relic has expanded the product suite to include a lot more DevOps related stuff. A few examples:
Browser monitoring#
As any developer will tell you, creating web apps for the hundreds of different browser versions out there is a huge pain in the ass. What you developed locally using Chrome might render slightly differently on a version of Safari from 2 years ago, which a big customer happens to be using (there are entire companies like BrowserStack that exist to help with this). New Relic’s browser monitoring product helps track common metrics like page load time, time spent on page, visual stability, and errors.
AI for monitoring#
We live in an era of AI for everything, and monitoring is no exception. New Relic’s applied intelligence product will automatically analyze your performance data to find common errors and root causes. I think of this as a nice to have side feature, and it comes free with New Relic’s normal pricing.
Workflows in your IDE#
New Relic bought CodeStream in June of 2021 and integrated it into the product suite pretty quickly. The gist is basically moving a lot of the stuff you’d be doing in GitHub or the New Relic UI into your development environment of choice, which for more people is VSCode. There’s a lot you can do with it. One popular use case is bringing comments on your pull requests into your IDE directly:
Source: TechCrunch
There are some basic VSCode extensions that let you do this already, but CodeStream was pretty popular for more advanced stuff. A nice thing you can do with this integration is find an error with your app in the NewRelic UI, click on it, and go directly to the offending code in your IDE instead of having to search for it manually.
The market is getting tougher with large, established options like Dynatrace, Splunk, Grafana, SumoLogic, Elastic, and especially Datadog, which New Relic specifically identifies as a key competitor in the observability category. We'll see where things go!