The Details: ETL

A deep dive on how (tools, methods, use cases) companies move data back and forth.

Refresher: what’s ETL

ETL is the process of moving and transforming data to get it into a more useful format. 

Every company has this idealized vision of a data science and analytics team, with full visibility into how the business is doing, how the product gets used, how experiments are performing, super good looking and funny people, etc. The problem with getting there (and this is part of why data teams don’t get hired until later in the company lifecycle) is that the actual, cold hard data that you need to answer important questions typically lies all over the place. And it needs cleaning. 

I’ve decided to illustrate this principle with a sadly very real set of examples from my experience:

ETL use cases

All of this data sits in different systems - and what that means in practice is that you can’t just write a big SQL query that joins all of it together. You need to get it out of the source system and into a big fat warehouse so you can ask the questions you want. And what compounds this problem is that the value of this data is multiplied when it’s integrated, e.g. you can combine these sources together:

  • Website engagement + billing = which site pages drive signups who end up paying us the most money?

  • Paid marketing + Salesforce = which ad channels are driving the most revenue and large customers?

  • Product usage + Salesforce + billing = what features do our biggest customers use the most? What are early product signals of high engagement?

For those strategy minded citizens among us, you’ll notice that each of the vendors listed in the above table are actually incentivized to help you analyze the data they’re generating in their own tools and not out of them. So Stripe has Sigma, Salesforce has Dashboards, Google Analytics is a combined tracking tool and analytics tool, etc. They’ll let you get data out, but they want you to spend as much time in their software as possible, so they’ll make it easy to analyze it in place. 

So ETL - Extract, Transform, Load - is the process of getting siloed data out of source systems and into a central data warehouse , all together, where it can be analyzed easily and quickly. 

  • Extract: get data out of siloed source systems like Salesforce, Stripe, etc.

  • Transform: clean dirty data, reorganize it in more useful formats, join data together

  • Load: put the results of all of this into a data warehouse

This is a ubiquitous process - if you’re a company and you want to get value out of your data, you’re going to need to do some ETL sooner or later. And the canonical unit of ETL - one ETL job - can be as small as a few lines of SQL, or as big as a 20 step Python workflow.

Use cases / how it happens

Each job has (basically) 3 things that you need to worry about:

  1. A query (in SQL, Python, etc.) - the actual code to get the data, transform it, etc.

  2. A schedule -when the job should run (e.g. every day at 12PM)

  3. A dependency list or graph - what other ETL jobs need to run first

Unfortunately no catchy acronym here.

A basic example

Let’s walk through a very basic ETL job that aggregates the amount that we’ve charged each one of our customers in Stripe. Starting with the query, here’s the actual mechanics of what we want to do with source Stripe data: aggregate the total number of charges, and the total amount, per user. 

SELECT  
  user_id,
  SUM(amount),
  COUNT(*)
FROM users u
LEFT JOIN stripe.charges c ON u.user_id = c.user_id
GROUP BY 1

After we’ve got the query written, we need to figure out a schedule (let’s say we want this to run twice a day) and then identify if there are any dependencies for the query - since this just pulls directly from source Stripe data, there are none.

A more complex example

The query is really the hard part, in my experience - some of the queries I’ve built for ETL jobs can be hundreds of lines long. A good example is building state from events. A lot of product data exists in event formats, where every time a user do...