The Details: production databases (SQL and NoSQL)

A deep dive into the databases that power our apps.

This post is going to dive deeper into production databases . We’re going to look at what they’re used for, popular options (SQL vs. NoSQL – what’s the difference?), challenges with scaling applications, and where databases are going (serverless , basically). 

The first “Details” post covered ETL , and the last one covered data warehouses

Refresher: what’s a production database?

There are a lot of different types of databases out there. They come in all different shapes and sizes based on what engineering teams use them for. A few that you’ve probably come across:

  • Analytical databases: for storing analytical data and writing long, complex queries. E.g. Snowflake , BigQuery, Redshift

  • In-memory databases: for quick operations in memory. E.g. Redis, Memcached

  • Streaming databases: for streaming data from place to place. E.g. Kafka (sort of)

The largest category of databases though – both in terms of number of available options and total market size – is production databases. And those are the databases that teams use to store application data, the data that backs the apps we use every day like Gmail, Slack, Whatsapp, and even your Bloomberg Terminal.

For a quick refresher on what production databases do, now would be a good time to read the original post here. In short, production databases are optimized for transactions – small, highly frequent little operations that your app runs to keep things moving. For a live example, let’s take a look at my orders page on Amazon. 

amazon orders screenshot

Each one of these orders (and items too) is sitting in a database on Amazon’s servers. When I load this page, the site sends a request to Amazon’s backend saying “I need data about what Justin ordered recently.” Amazon’s backend sends that data back, and then the app formats it in a nice, reader friendly way with links, images, and all of that jazz. 

🔍 Deeper Look

When you load your orders page on Amazon, it isn’t issuing a query to the database directly. Amazon (and most other companies) set up API endpoints as an intermediary between their frontend and their database. Fundamentally though, learn to think about this page – and pages on any app you're using – as data.

This database “read” that you’re doing when you load this page is an example of a transaction. It might look something like this, if Amazon is using a SQL database behind the scenes:

SELECT  
  name,
  link,
 date,
 price
FROM orders
WHERE customer = ‘justin’

The same kind of thing happens when you buy an item on Amazon. Once I’ve loaded a product page (which, by the way, is also just data being read from a database), clicking “buy” will insert a new row into Amazon’s database and record that I just purchased this riveting read. 

how to win

That transaction might look something like this:

INSERT INTO orders (id, price, purchaser)
VALUES (ad73hfikd8d, $13.99, ‘justin’)

Transactions are the central tenet of how relational databases – the most popular type of production database – work. Most relational databases guarantee something called ACID, which is a rigid standard for transactions. It means that you can always count on your database to execute transactions in order, without them impacting each other. To understand why that’s important, let’s keep running with our Amazon example.

Imagine you (a sensible, rugged individual) want to purchase a copy of How to Win Friends & Influence People. You want to win friends and influence people. But it turns out Amazon only has one copy left. Now given that Amazon is a very popular website, there’s another person who also wants to buy it, right now (cheeky bastard). Unfortunately, they click “buy” before you’re able to; and that means you can’t buy it anymore. If Amazon isn’t able to update you quickly enough as to that unfortunate reality, you might also click buy, and pay them for something they can’t deliver to you. So as Amazon is executing the first transaction (inserting a row into the orders table), they need to lock the database and make sure you don’t see “in stock” for that particular book, or worse, pay them for it. 

This is called a dirty read, and it’s one example of where transactional integrity is important. Remember that Amazon has hundreds of millions of customers, so it’s pretty likely that stuff like this could happen all the time. Hence, the need for rigid database structure.

Scaling databases as you grow

As your application gets larger and more popular (hopefully), your database will come under an increasingly demanding load. Think about Amazon – they need to deal with thousands of inserts every second, and read petabytes (or more) of data all the time. This is called throughput – the number of operations per second that your database can handle. As you scale, you need to deal with increasing throughput, storing higher volumes of data, and making sure your database is resilient to failure (outages, etc.).

Infrastructure wise, most companies start out with a database on a single server, with some sort of attached storage (you could be using a managed service like RDS, or running it yourself). So how do you scale that? There are basically two ways to approach the problem:

  1. Scaling vertically – make your server bigger and faster

  2. Scaling horizontally – distribute your database across multiple, smaller servers

The difference is pretty intuitive. You can think about it like you want your horse-drawn carriage to go faster (a relatable, modern example). You can get a bigger, faster horse – or you can just get more horses. 

horizontal scaling

I could write an entire post on this, but the general rule is this: scaling vertically is easy, but scaling horizontally is efficient. As with our horse example: getting one, stronger and faster horse is intuitive and easy. But at some point, steroids aside, you’re going to hit a limit. Adding new horses to the fray is an efficient way to scale your speed, but it requires a lot more coordination and new equipment. You’ll also need to drive with more sensitivity and expertise.

The same thing is true with databases. If you want to make your single server bigger and faster, that’s pretty easy – you can do that with a few clicks in the AWS UI. But if you want to distribute your database across multiple servers, you have new problems to worry about:

  • Installing, patching and upgrading database software on each server

  • Distributing traffic across multiple servers based on load

  • Figuring out how to keep servers and their data in sync with each other

The last one is the hardest to do, especially when it comes to transactional rigidity. When you need to write a new row to the database, which server do you do it on? And then how do you make sure that change gets propagated to the other servers fast enough? 

These hairy problems are at the core of CAP Theorem, the first and last time you’ll see math referenced in this newsletter. The basic idea is that transactional integrity and high availability (i.e. distributed databases) are at odds with each other, and you can never really have both. This is why most NoSQL databases don’t promise integrity, but rather something called eventual consistency, which basically means that the database copies on each server will eventually make sense with each other.

SQL vs. NoSQL

Production databases – and even databases in general – generally fit into one of two categories:

  1. SQL – relational, structured, rigid

  2. NoSQL – flexible, unstructured 

Th...