What does Databricks do?
Databricks sells a data science and analytics platform built on top of an open source package called Apache Spark.
Last updated: March 3, 2025
The TL;DR
Databricks sells a data science and analytics platform – i.e. a place to query and share data – built on top of an open source package called Apache Spark.
-
Apache Spark is an open source engine for running analytics and machine learning across distributed, giant datasets
-
Spark is notoriously hard to run on your own infrastructure and companies often don’t have the expertise to do that
-
Databricks provides a managed service for running Spark clusters, as well as notebooks for visualization and exploration, plus the ability to schedule pipelines
-
More recently, Databricks has been expanding the product portfolio to include ML and data warehousing
This is a pretty big company, all things considered - $28B was their most recent valuation, making it one of the most valuable private companies on the planet. And they’re planning on going public in 2021.
Terms Mentioned
Companies Mentioned
The Databricks core product: managed spark
Let’s start with Spark. Apache Spark is a tool for running distributed data pipelines (think: query this, move this to that place). As teams started storing more and more data than ever before, it stopped making sense to put all of it on a single server – so Spark distributes this data and compute across multiple servers, making everything faster and more efficient.
But, distributed systems are very, very complicated (and not just in the data realm). This isn’t the kind of thing that your typical software engineer is going to be comfortable configuring and setting up from scratch. So setting up a Spark “cluster” (a group of servers) is pretty difficult.
And that’s where Databricks comes in. They provide a fully managed Spark environment so you can focus on writing queries and pipelines instead of managing infrastructure. You also get a notebook-like interface to write Spark jobs (like that Python code we saw above) and make nice graphs.
Apache Spark, the OG
Since Databricks is built on top of this open source “Spark” thing, understanding Databricks means understanding Spark. So what’s Apache Spark exactly?
Spark is a tool for running distributed data pipelines (think:...