This post is going to dive deeper into one of Technically’s most requested topics: data warehouses. We’re going to look at what they’re used for, popular options (Snowflake, BigQuery, Redshift: what’s the difference?), how data lakes fit into the picture, and how vendors are positioning themselves as platforms going forward.
Refresher: what’s a Data Warehouse?
A data warehouse is a specially designed database that holds analytical data. It’s built to handle long, complicated queries written by data scientists, analysts, and machine learning engineers.
If you’re iffy on what data warehouses do, now would be a good time to read the original post here. It covers:
- Why we use separate databases for our apps vs. analytics
- How data gets into a warehouse via ETL or ELT
- Typical types of analytical queries
And then come back here.
🚨 Confusion Alert
Part of where data warehouses get confusing is what they actually are. And the answer is that a data warehouse involves both:A different place to store your data (a separate database server)A different way of storing data (new data sources, structures, etc.)Most companies with data teams will have at least two separate places where they work with data: their production database, and a data warehouse. The production database only has data that’s relevant to the core operations of their app (users, business concepts, etc.). The data warehouse, on the other hand, might have copies of their production data, payment data, website traffic data, and lots of other stuff - whatever the data team wants to analyze.
All good? Cool – now let’s dive into the details.