Background: the SDH problem
They say (well, I say) that all software is bad and complicated, but some is useful: the same thing is true about data. Most data is subject to what I call SDH: it’s siloed, dirty, and hard to work with.
Siloed
Because of how applications are architected, data tends to be stuck in the system that generates it. If you’re using Stripe for payments, your payment data is in the Stripe system. If you’re using Salesforce for your CRM, your sales data is in the Salesforce system. And if you’re using a PostgreSQL database to back your application, your user data is yet another system (yours).
The problem is that for analysis – both for operational dashboards and longer term Data Science™ – the value comes from the integration of data, or being able to work with your payments, sales, and user data all in one place. You need to get it out of these siloed systems and into a central location.
Dirty
Data is dirty. Things are spelled incorrectly, some payment amounts are USD and others CAD, some engineer changed your storage format a year ago without telling anyone...the list goes on. You need to clean this data before analyzing it, and ideally do that on a regular basis.
Hard to work with
Data has a format: it can be events data, user data, aggregated click counts, or any other form of captured reality. And the format that your data gets generated in is rarely the format you’re going to analyze it in. You need to transform it into something useful by aggregating, pivoting, filtering, and all of that fancy stuff you learn how to do in basic Excel classes.
By now, you get it; we need to do a bunch of stuff to data before we can analyze it. That’s exactly what ETL is – it’s taking care of the SDH problem.