What's ETL?

The process of moving data around your internal systems to get it ready for analysis.

The TL;DR

ETL stands for Extract - Transform - Load, and it’s the process of moving data around your internal systems to get it ready for analysis. 

  • Data at most companies is siloed, dirty, and hard to work with: not ideal for analyzing it quickly and effectively

  • ETL is the process of moving, cleaning, and arranging that data in a way that makes your Data Science team happy (well, as happy as they’ll ever be)

  • Scheduling is one of the most frustrating parts of ETL, and an ecosystem of tools has popped up to make it easier

You’ve probably heard that 70% of Data Science is just moving and cleaning data, and nobody gets too excited about that. But ETL powers analysis at pretty much every company out there, so it’s worth understanding.

Background: the SDH problem

They say (well, I say) that all software is bad and complicated, but some is useful: the same thing is true about data. Most data is subject to what I call SDH: it’s siloed, dirty, and hard to work with.

Siloed

Because of how applications are architected, data tends to be stuck in the system that generates it. If you’re using Stripe for payments, your payment data is in the Stripe system. If you’re using Salesforce for your CRM, your sales data is in the Salesforce system. And if you’re using a PostgreSQL database to back your application, your user data is yet another system (yours). 

The problem is that for analysis – both for operational dashboards and longer term Data Science™ – the value comes from the integration of data, or being able to work with your payments, sales, and user data all in one place. You need to get it out of these siloed systems and into a central location.

Dirty

Data is dirty. Things are spelled incorrectly, some payment amounts are USD and others CAD, some engineer changed your storage format a year ago without telling anyone...the list goes on. You need to clean this data before analyzing it, and ideally do that on a regular basis. 

Hard to work with

Data has a format: it can be events data, user data, aggregated click counts, or any other form of captured reality. And the format that your data gets generated in is rarely the format you’re going to analyze it in. You need to transform it into something useful by aggregating, pivoting, filtering, and all of that fancy stuff you learn how to do in basic Excel classes. 

By now, you get it; we need to do a bunch of stuff to data before we can analyze it. That’s exactly what ETL is – it’s taking care of the SDH problem.

What’s ETL?

ETL stands for Extract - Transform - Load:

  • Extract data from where it lives

  • Transform it into a useful format through cleaning and aggregation

  • Load it into a warehouse for storage and analysis 

Let’s walk through each one of these so w...