Databricks Compute Comparison: Classic Jobs vs Serverless Jobs vs SQL Warehouses
Databricks is a quickly evolving platform with several compute options available for users, leaving many with a difficult choice. In this blog post, we look at three popular options for scheduled jobs using Databricks own’ TPC-DI benchmark suite.
By the way, kudos to the Databricks team for creating such a fantastic test package. We highly encourage anybody here to use it for their own internal testing. Just to re-iterate, these workloads were not created by us, nor modified in any way. We’re big compute efficiency nerds here at Sync, so we appreciate all contributions in this space.
The goal of this blog post is to help readers understand the pros, cons, and performance tradeoffs of the various Databricks compute options, so they can make the best choice for their workloads.
The experiment
At a high level, the TPC-DI benchmark is an industry standard benchmark for data integration, and mimics many real world workloads that users utilize in their jobs and workflows. Below is a screenshot of the DAG that is created for the entire benchmark:
One of the main knobs to tune for the TPC-DI benchmark is the “Scale Factor” (SF), which changes the size of the data that is processed. The table below translates the SF to actual raw data size for comparison.
We focused on 3 scale factors:
Scale Factor | Total Raw Data |
100 | 9.66 GB |
1000 | 97 GB |
5000 | 485 GB |
In this study, we’ll compare the running TPC-DI on 3 different Databricks compute products:
- Classic Jobs Compute – User definable clusters. This is by far the most flexible infrastructure for users, although not all users want to deal with the level of options that are available. The infrastructure still runs inside the customer’s cloud environment.
- Serverless Jobs Compute – Databricks managed clusters ( also their newest product), where no knobs are available. It’s extremely easy to use, but does have a good number of restrictions at the moment. The infrastructure is run inside Databrick’s cloud environment.
- DBSQL Serverless – User selectable cluster sizes (e.g. small, medium, large), but that’s about it. Infrastructure runs inside Databrick’s cloud environment.
For those who want to dive into the details, here’s the full table with all of the details of the experiments and raw numbers:
At a high level, these were some of the choices we made in regards to some of the settings:
1) Classic Jobs compute – We only tuned the number of workers here, and used the same recommended instance types. We tuned the optimal cluster size and allowed a max runtime about 40% longer than the other compute platforms, which is consistent with the experience we see with users. In addition, we used all on-demand clusters.
2) Serverless Jobs Compute – We didn’t tune or change anything here, since there’s nothing to change!
3) DBSQL Serverless – We used the recommended warehouse sizes from the notebook
The results
You can see the cost graph across all 9 experiments below. The cost represents the total cost of DBUs and cloud costs (AWS in this case). Several observations can be made:
- Classic compute was cheapest – Using an optimized classic cluster was by far the cheapest option, which makes sense since it’s the choice with the most options and allows users to fine tune their compute to their needs. One note is we used on-demand clusters, so likely costs can be further improved with Spot nodes.
- DBSQL was much cheaper than Jobs Serverless – To our surprise DBSQL was consistently almost 2x cheaper than a Jobs Serverless cluster!
- Jobs Serverless was the most expensive – The cost of jobs serverless was about 5x more expensive than classic and about 2x more expensive than DBSQL
For completeness, here are the corresponding runtimes for all 9 experiments. As we mentioned above, we tuned the classic cluster to run slightly longer if it meant a cheaper cost. We tried to keep it to “reasonable” increases in runtime.
Overall, DBSQL and Jobs Serverless were about the same in terms of runtime.
Discussions
How do I find the optimal Classic cluster?
The big tradeoff with Classic clusters is you have to know what to pick. This is the dark art of Spark and experienced engineers. Usually people just do some trial and error analysis with a few different configs to get a rough idea of an optimal cluster size. They then typically only revisit the configuration once a quarter at best, while data pipelines change much more rapidly.
At Sync, we built a tool that automatically finds the best cluster for each of your jobs at scale. We call it Gradient. If you’re running Databricks jobs at scale and just can’t manually tune all of those clusters to lower costs, feel free to check us out to see if Gradient can help improve your Databricks efficiency automatically.
Why were DBSQL warehouse costs so much lower than Serverless Jobs costs?
Their runtimes were the same, so that means that the DBU consumption rate of both are quite different.
In fact, the DBU rate of a large SQL warehouse is 40 DBUs/hr. Since the runtimes were about the same but the end costs were almost 2x, we can roughly estimate that the jobs serverless DBU rate is about 80 DBUs/hr.
This is a pretty surprising finding (and is not the official DBU rate of Jobs Serverless), and is only based on our loose back of the envelope calculations. However, it does highlight the discrepancy between these different products and how the wrong choice can lead to exorbitant costs.
How did the query level performance change?
It’s interesting to break open the timeline view of the queries to see how the different compute platforms behave. Different queries ran at different runtimes with the different platforms.
One big difference is the “ingest_customermgmt” task was actually a “classic cluster” for all 3 (probably due to some limitations of serverless). This requires a spin up time of the classic cluster which delayed the serverless options. For the classic cluster, this step was very fast, since the whole compute is on classic.
The other observation is the distribution of the “long tasks” and whether or not there were dependencies. Meaning, you may have a long running task but it actually is in parallel with all other tasks so it’s fine that it’s slow. However, if one task is the core dependency for all other tasks (such as the “ingest_customermgmt” task), then a single long running task in your DAG can really hurt your performance.
Classic Jobs
Serverless Jobs
DBSQL
Conclusion
Hopefully this detailed study of the TPC-DI benchmark on Classic Jobs vs Serverless Jobs vs SQL Warehouse helps communicate how different your performance will be on Databricks, depending on which compute platform you select. At a high level the main dependencies are:
- The size of your compute resource (e.g. number of workers, warehouse size)
- The DBU rate of your compute resource (DBUs/hr)
- Workload specific features (e.g. DAG dependencies)
- Engineering time (e.g. Salary, opportunity cost)
- Runtime requirements (e.g. SLA goals)
If you’re at a company where your core backend is scaled on Databricks, then this level of optimization may be a critical step to help lower your total cost of goods sold (COGS).
However, if you’re at a company just exploring Databricks and costs aren’t a concern, then I recommend just sticking to serverless compute for the convenience.
If you’re interested in understanding what’s right for your company’s Databricks usage, feel free to book a time that works for you here! We’d love to chat.