Technically
AI Reference
Your dictionary for AI terms like LLM and RLHF
Company Breakdowns
What technical products actually do and why the companies that make them are valuable
Learning Tracks
In-depth, networked guides to learning specific concepts
Posts Archive
All Technically posts on software concepts since the dawn of time
Terms Universe
The dictionary of software terms you've always wanted

Explore learning tracks

AI, it's not that ComplicatedAnalyzing Software CompaniesBuilding Software ProductsWorking with Data Teams
Loading...
I'm feeling luckyPricing
Log In

What does Databricks do?

Databricks sells a data science and analytics platform built on top of an open source package called Apache Spark.

Last updated Mar 23, 2026analytics
Justin Gage
Justin Gage

The TL;DR

Databricks sells a data science and analytics platform – i.e. a place to query and share data – built on top of an open source package called Apache Spark. 

  • Apache Spark is an open source engine for running analytics and machine learning across distributed, giant datasets
  • Spark is notoriously hard to run on your own infrastructure and companies often don’t have the expertise to do that
  • Databricks provides a managed service for running Spark clusters, as well as notebooks for visualization and exploration, plus the ability to schedule pipelines
  • More recently, Databricks has been expanding the product portfolio to include ML and data warehousing

Databricks is one of the largest private companies on the planet - $62B was their most recent valuation.

Terms Mentioned

Open Source

Server

Cloud

Framework

Infrastructure

Production

Backend

API

Data lake

Analytics

Data warehouse

Deploy

Machine Learning

Query

Companies Mentioned

Databricks logo

Databricks

PRIVATE
AWS logo

AWS

AMZN

The Databricks core product: managed spark

Let’s start with Spark. Apache Spark is a tool for running distributed data pipelines (think: query this, move this to that place). As teams started storing more and more data than ever before, it stopped making sense to put all of it on a single server – so Spark distributes this data and compute across multiple servers, making everything faster and more efficient.

But, distributed systems are very, very complicated (and not just in the data realm). This isn’t the kind of thing that your typical software engineer is going to be comfortable configuring and setting up from scratch. So setting up a Spark “cluster” (a group of servers) is pretty difficult.

And that’s where Databricks comes in. They provide a fully managed Spark environment so you can focus on writing queries and pipelines instead of managing infrastructure. You also get a notebook-like interface to write Spark jobs (like that Python code we saw above) and make nice graphs.

Loading image...

Apache Spark, the OG

Since Databricks is built on top of this open source “Spark” thing, understanding Databricks means understanding Spark. So what’s Apache Spark exactly?

Spark is a tool for running distributed data pipelines (think: query this, move this to that place). As teams started storing more and more data than ever before, it stopped making sense to put all of it on a single server:

  • Storage: if you’ve got 1 petabyte (one million gigabytes), you’d need to get a server with that much storage, which literally doesn’t exist. Plenty of teams are working with more data than that,
  • Speed: writing queries, running pipelines, and building models would be very very slow if all of your data is in one place.

In a distributed system, data gets stored on different servers (some pieces here, some there) that stay in sync with each other. When you query that data, your query engine figures out where the data you need is and fetches it from there. One of the first such storage and query engines was Hadoop and the HDFS file system, which you’ve probably heard of.

Spark exists in this universe, but at a higher level of abstraction - it provides APIs for running distributed “jobs” like queries or pipelines. To get concrete, here’s something you might write in Spark (from their homepage):

df = spark.read.json("logs.json") 
df.where("age > 21").select("name.first").show()

This bit of Python code reads some log files, filters them for people with an age over 21, and shows the “name.first” column. And while this might seem simple, Spark is taking care of a lot of complexity on the backend around distributed queries. And it’s very popular (almost 30K Github stars) and highly adopted among Data Science teams (we used it at DigitalOcean).

Loading image...

Distributed systems are very, very complicated (and not just in the data realm). This isn’t the kind of thing that your typical software engineer is going to be comfortable configuring and setting up from scratch. So setting up a Spark “cluster” (a group of servers) is pretty difficult. And that’s where Databricks comes in. 

🔍 Deeper Look

One thing to note is that while Spark itself is a distributed system, it can be used to query data that’s not distributed. An example is . In that case, the value of Spark is as a distributed query engine. Reminder: . It’s a place to query and analyze your already-stored data.

The core Databricks product

Surprisingly, nestled deep within a clandestine FAQ section on their site, Databricks does a half decent job of explaining what the core product does:

Continue reading with an all-access subscription

Continue reading with all-access

In this post

  • The core Databricks product
  • 1) A fully managed Spark cluster
  • 2) An interactive workspace for exploration and visualization
  • 3) A production pipeline scheduler
  • 4) A platform for powering your favorite Spark-based applications
$15/month

30-day money-back guarantee

Or use
Up Next
What your data team is using: the analytics stack

A deep dive into all of the tools that data teams use to do their work.

Justin GageJustin Gage
analytics
What's the Modern Data Stack?

The new set of tools data teams use to get their jobs done.

Justin GageJustin Gage
analytics
What's a Data Lake?

A Data Lake is an unstructured place to put data.

Justin GageJustin Gage
analytics
Content
  • All Posts
  • Learning Tracks
  • AI Reference
  • Companies
  • Terms Universe
Company
  • Pricing
  • Sponsorships
  • Contribute
  • Contact
Connect
SubscribeSubstackYouTubeXLinkedIn
Legal
  • Privacy Policy
  • Terms of Service

© 2026 Technically.