What's a data warehouse?

A data warehouse is a special type of database designed for analytics instead of transactions.

The TL;DR

A data warehouse is a special type of database designed for analytics instead of transactions

  • There are two types of database use cases: transactional and analytical

  • Transactional databases like PostgreSQL are built for adding, updating, and removing data, but analytical DBs are for complex queries and joins

  • Data gets moved from transactional to analytical databases through **a process called ETL **

Data warehouses are one of the fastest growing segments in cloud , and power most of what’s happening in modern data science.

Transactional vs. analytical databases

🔮 Dependencies

To get the most out of this post, you’ll need to understand what a database is, how schemas and relations work, and how software runs in the cloud.

The core of whatever app you’re reading this in is a production database: it stores user information and other backend stuff that populates whatever you see on the screen (unless you printed this out, loser). That database is all about transactions:

  • Every time a new user signs up a new row gets added

  • If you change your password, your row gets updated

  • If you shut down your account, your row gets deleted (sometimes)

These kinds of processes and how they interact with a database are called OLTP, or online transactional processes.

Because a lot of these transactions can be happening at once (apps with many users or page loads), popular databases like MySQL are built with that in mind. They usually have special features that make sure these transactions don’t get messed up, and optimize for that kind of use case. It’s all about transaction after transaction, boom boom boom, making sure your app keeps working without getting messed up.

You can think of OLTP operations as text messages. It’s a great medium for short, clear directives. Need a quick favor. What’s Jason’s number? Are you going to the party tonight? Texts are about speed and efficiency.

There’s a whole other use case for databases though: analytics. If you have big questions you want to ask of your data, like how many users you have, how many orders get cancelled each month, or how much money you made last year, you use the database in a very different way. Analytical queries typically require looking at a lot more data, take a much longer time to run, and join a bunch of different sources together. The process is usually called OLAP, or online analytical processing.

transactional analytical

These kinds of analytical transactions are like email: it’s a much better format for longer form, more involved communication than texts are. Emails take longer to send, but they’re more organized and professional for when you need to send multiple paragraphs and embedded media.

Data warehouses and ETL

Because there’s such a big difference between transactional and analytical use cases, they require different tools; just like a bike and a car are both great, but for very different situations. A data warehouse is just a group of databases built fo...