The Databricks core product: managed spark
Let’s start with Spark. Apache Spark is a tool for running distributed data pipelines (think: query this, move this to that place). As teams started storing more and more data than ever before, it stopped making sense to put all of it on a single server – so Spark distributes this data and compute across multiple servers, making everything faster and more efficient.
But, distributed systems are very, very complicated (and not just in the data realm). This isn’t the kind of thing that your typical software engineer is going to be comfortable configuring and setting up from scratch. So setting up a Spark “cluster” (a group of servers) is pretty difficult.
And that’s where Databricks comes in. They provide a fully managed Spark environment so you can focus on writing queries and pipelines instead of managing infrastructure. You also get a notebook-like interface to write Spark jobs (like that Python code we saw above) and make nice graphs.
Apache Spark, the OG
Since Databricks is built on top of this open source “Spark” thing, understanding Databricks means understanding Spark. So what’s Apache Spark exactly?
Spark is a tool for running distributed data pipelines (think: query this, move this to that place). As teams started storing more and more data than ever before, it stopped making sense to put all of it on a single server:
- Storage: if you’ve got 1 petabyte (one million gigabytes), you’d need to get a server with that much storage, which literally doesn’t exist. Plenty of teams are working with more data than that,
- Speed: writing queries, running pipelines, and building models would be very very slow if all of your data is in one place.
In a distributed system, data gets stored on different servers (some pieces here, some there) that stay in sync with each other. When you query that data, your query engine figures out where the data you need is and fetches it from there. One of the first such storage and query engines was Hadoop and the HDFS file system, which you’ve probably heard of.
Spark exists in this universe, but at a higher level of abstraction - it provides APIs for running distributed “jobs” like queries or pipelines. To get concrete, here’s something you might write in Spark (from their homepage):
df = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()
This bit of Python code reads some log files, filters them for people with an age over 21, and shows the “name.first” column. And while this might seem simple, Spark is taking care of a lot of complexity on the backend around distributed queries. And it’s very popular (almost 30K Github stars) and highly adopted among Data Science teams (we used it at DigitalOcean).
Distributed systems are very, very complicated (and not just in the data realm). This isn’t the kind of thing that your typical software engineer is going to be comfortable configuring and setting up from scratch. So setting up a Spark “cluster” (a group of servers) is pretty difficult. And that’s where Databricks comes in.
One thing to note is that while Spark itself is a distributed system, it can be used to query data that’s not distributed. An example is . In that case, the value of Spark is as a distributed query engine. Reminder: . It’s a place to query and analyze your already-stored data.