The TL;DR#
A Data Lake is an unstructured place to put data. It’s usually meant for long term storage and infrequent querying.
- Companies are collecting more data than ever before (yada yada), but not all of it gets used immediately
- A data lake is a place to drop data, in an unstructured format, usually for long term storage
- Data lakes are easiest to understand in contrast to data warehouses, where schemas are defined in advance
- A new trend is seeing data lakes act as quasi data warehouses, and it’s possible the categories might...merge?
You’ll often see data lakes as part of those infamous infrastructure diagrams that nobody (myself included) understands. So it’s worth understanding why companies use them, and where they fit into the modern data stack.