Forma.ai is a B2B SaaS startup based in Toronto, Canada building an AI powered sales compensation system for enterprise. Specifically, they seamlessly unify the design, execution, and orchestration of sales compensation to better mobilize sales teams and optimize go-to-market performance.
Behind the scenes, Forma.ai deploys their pipelines on Databricks to process sales compensation pipelines for their customers. They process hundreds of terabytes of data per month across Databricks Jobs clusters and ad-hoc all-purpose compute clusters.
As their customer count grows, so will their data processing volumes. The cost and performance of their Databricks jobs directly impacts their cost of goods (COGs) and thus their bottom line. As a result, the efficiency of their jobs is of the utmost importance today and for their future sustainable growth.
What is their problem with Databricks?
Forma.ai came to Sync with one fundamental problem – how can they optimize their processing costs with minimal time investment? Thanks to their customer growth, their Databricks usage and costs were only increasing. They were looking for a scalable solution to help keep their clusters optimized without high overhead on the DevOps and Development teams.
Previously they had put some work into trying to optimize their jobs clusters, such as moving to different instance types for the most expensive pipelines. These pipelines and their clusters are updated frequently however, and manually reviewing configuration of every cluster regularly is simply not cost or time effective.
How Gradient Helps
Gradient provided the solution they were looking for – a way to achieve optimal clusters without the need to manually tune – freeing up their engineers to focus on building new features and accelerate development.
Furthermore, the configurations that Gradient does make are fully exposed to their engineers, so their team can actually learn and see what configurations actually matter and what the impact is. Enriching their engineers and leveling up their own Databricks experience.
Initial Results with Gradient
For a first test, Forma onboarded a real job they run in production with Gradient, enabled ‘auto-apply’ and then let Gradient control their cluster for each recurring run. After a couple cycles of learning and optimizing, the first results are shown below: an 18% cost savings and a 19% speedup without lifting a finger.
So what did Gradient do actually?
In this first initial result, the change that had the most impact was tuning the cluster’s EBS settings (AWS only). These settings are often overlooked in favor of CPU and Memory settings.
A table of the specific parameters before and after Gradient is shown below:
The initial settings reflect the typical settings Databricks provides, and is what most people use. The automatic EBS settings depend on the size of the instance chosen, with bigger instances getting more baseline storage according to AWS’s best practices. While these baseline settings are sufficient for running applications, they are often suboptimal.
We can see low level settings like IOPS and throughput are usually not set. In fact, they aren’t even available in the cluster creation Databricks console. You have to adjust these specific settings in the cluster JSON or with the Jobs API.
If you’d like to try out Gradient for your workloads, checkout the resources below:
In the self-written blog post below, a Sr. Data Engineer chronicles his experience with the Spark Gradient for EMR. In the blog post we helped accelerate a job from 90 to 24 minutes, which was amazing to see!
The first job I put into the gradient went from processing in around 90 minutes to 25 minutes after I changed the configurations, only using a slightly larger cluster. However, that time save makes up for using more nodes, so it definitely worked to our advantage.
Extrapolated over a full year, our anticipated savings to his company was over $100K on AWS! Obviously this doesn’t include the extra time savings of removing the current manual guesswork of provisioning clusters.
Databricks’ SQL warehouse products are a compelling offering for companies looking to streamline their production SQL queries. However, as usage scales up, the cost and performance of these systems become crucial to analyze.
In this blog we take a technical deep dive into the cost and performance of their serverless SQL warehouse product by utilizing the industry standard TPC-DI benchmark. We hope data engineers and data platform managers can use the data presented here to make better decisions when it comes to their data infrastructure choices.
What are Databricks’ SQL warehouse offerings?
Before we dive into a specific product, let’s take a step back and look at the different options available today. Databricks currently offers 3 different warehouse options:
SQL Pro – Improved performance and good for exploratory data science, runs inside customer’s cloud environment
SQL Serverless – “Best” performance, and the compute is fully managed by Databricks.
From a cost perspective, both classic and pro run inside the user’s cloud environment. What this means is you will get 2 bills for your databricks usage – one is your pure Databricks cost (DBU’s) and the other is from your cloud provider (e.g. AWS EC2 bill).
To really understand the cost comparison, let’s just look at an example cost breakdown of running on a Small warehouse based on their reported instance types:
In the table above, we look at the cost comparison of on-demand vs. spot costs as well. You can see from the table that the Serverless option has no cloud component, because it’s all managed by Databricks.
Serverless could be cost effective compared to pro, if you were using all on-demand instances. But if there are cheap spot nodes available, then Pro may be cheaper. Overall, the pricing for serverless is pretty reasonable in my opinion since it also includes the cloud costs.
We also included the equivalent jobs compute cluster, which is the cheapest option across the board. If cost is a concern to you, you can run SQL queries in jobs compute as well!
Pros and cons of Serverless
The Databricks serverless option is a fully managed compute platform. This is pretty much identical to how Snowflake runs, where all of the compute details are hidden from users. At a high level there are pros and cons to this:
You don’t have to think about instances or configurations
Spin up time is much less than starting up a cluster from scratch (5-10 seconds from our observations)
Enterprises may have security issues with all of the compute running inside of Databricks
Enterprises may not be able to leverage their cloud contracts which may have special discounts on specific instances
No ability to optimize the cluster, so you don’t know if the instances and configurations picked by Databricks are actually good for your job
The compute is a black box – users have no idea what is going on or what changes Databricks is implementing underneath the hood.
Because of the inherent black box nature of serverless, we were curious to explore the various tunable parameters people do still have and their impact on performance. So let’s drive into what we explored:
We tried to take a “practical” approach to this study, and simulate what a real company might do when they want to run a SQL warehouse. Since DBT is such a popular tool in the modern data stack, we decided to look at 2 parameters to sweep and evaluate:
The reason why we picked these two is they are both “universal” tuning parameters for any workload, and they both impact the compute side of the job. DBT threads in particular effectively tune the parallelism of your job as it runs through your DAG.
The workload we selected is the popular TPC-DI benchmark, with a scale factor of 1000. This workload in particular is interesting because it’s actually an entire pipeline which mimics more real-world data workloads. For example, a screenshot of our DBT DAG is below, as you can see it’s quite complicated and changing the number of DBT threads could have an impact here.
As a side note, Databricks has a fantastic open source repo that will help quickly set up the TPC-DI benchmark within Databricks entirely. (We did not use this since we are running with DBT).
To get into the weeds of how we ran the experiment, we used Databricks Workflows with a Task Type of dbt as the “runner” for the dbt CLI, and all the jobs were executed concurrently; there should be no variance due to unknown environmental conditions on the Databricks side.
Each job spun up a new SQL warehouse and tore it down afterwards, and ran in unique schemas in the same Unity Catalog. We used the Elementary dbt package to collect the execution results and ran a Python notebook at the end of each run to collect those metrics into a centralized schema.
Costs were extracted via Databricks System Tables, specifically those for Billable Usage.
Below are the cost and runtime vs. warehouse size graphs. We can see below that the runtime stops scaling when you get the medium sized warehouses. Anything larger than a medium pretty much had no impact on runtime (or perhaps were worse). This is a typical scaling trend which shows that scaling cluster size is not infinite, they always have some point at which adding more compute provides diminishing returns.
For the CS enthusiasts out there, this is just the fundamental CS principal – Amdahls Law.
One unusual observation is that the medium warehouse outperformed the next 3 sizes up (large to 2xlarge). We repeated this particular data point a few times, and obtained consistent results so it is not a strange fluke. Because of the black box nature of serverless, we unfortunately don’t know what’s going on under the hood and are unable to give an explanation.
Runtime in Minutes across Warehouse Sizes.
Because scaling stops at medium, we can see in the cost graph below that the costs start to skyrocket after the medium warehouse size, because well basically you’re throwing more expensive machines while the runtime remains constant. So, you’re paying for extra horsepower with zero benefit.
Costs across Warehouse Sizes.
The graph below shows the relative change in runtime as we change the number of threads and warehouse size. For values greater than the zero horizontal line, the runtime increased (a bad thing).
The data here is a bit noisy, but there are some interesting insights based on the size of the warehouse:
2x-small – Increasing the number of threads usually made the job run longer.
X-small to large – Increasing the number of threads usually helped make the job run about 10% faster, although the gains were pretty flat so continuing to increase thread count had no value.
2x-large – There was an actual optimal number of threads, which was 24, as seen in the clear parabolic line
3x-large – had a very unusual spike in runtime with a thread count of 8, why? No clue.
The Percent Change in Runtime as Threads Increase.
To put everything together into one comprehensive plot, we can see the plot below which plots the cost vs. duration of the total job. The different colors represent the different warehouse sizes, and the size of the bubbles are the number of DBT threads.
Cost vs duration of the jobs. Size of the bubbles represents the number of threads. Image by author
In the plot above we see the typical trend that larger warehouses typically lead to shorter durations but higher costs. However, we do spot a few unusual points:
Medium is the best – From a pure cost and runtime perspective, medium is the best warehouse to choose
Impact of DBT threads – For the smaller warehouses, changing the number of threads appeared to have changed the duration by about +/- 10%, but not the cost much. For larger warehouses, the number of threads impacted both cost and runtime quite significantly.
In summary, our top 5 lessons learned about Databricks SQL serverless + DBT products are:
Rules of thumbs are bad – We cannot simply rely on “rules of thumb” about warehouse size or the number of dbt threads. Some expected trends do exist, but they are not consistent or predictable and it is entirely dependent on your workload and data.
Huge variance – The costs ranged from $5 – $45, and runtimes from 2 minutes to 90 minutes, all due to different combinations of number of threads and warehouse size.
Serverless scaling has limits – Serverless warehouses do not scale infinitely and eventually larger warehouses will cease to provide any speedup and only end up causing increased costs with no benefit.
Medium is great – We found the Medium Serverless SQL Warehouse outperformed many of the larger warehouse sizes on both cost and job duration for the TPC-DI benchmark. We have no clue why.
Jobs clusters may be cheapest – If costs are a concern, switching to just standard jobs compute with notebooks may be substantially cheaper
The results reported here reveal that the performance of black box “serverless” systems can result in some unusual anomalies. Since it’s all behind Databrick’s walls, we have no idea what is happening. Perhaps it’s all running on giant Spark on Kubernetes clusters, maybe they have special deals with Amazon on certain instances? Either way, the unpredictable nature makes controlling cost and performance tricky.
Because each workload is unique across so many dimensions, we can’t rely on “rules of thumb”, or costly experiments that are only true for a workload in its current state. The more chaotic nature of serverless system does beg the question if these systems need a closed loop control system to keep them at bay?
As an introspective note – the business model of serverless is truly compelling. Assuming Databricks is a rational business and does not want to decrease their revenue, and they want to lower their costs, one must ask the question: “Is Databricks incentivized to improve the compute under the hood?:
The problem is this – if they make serverless 2x faster, then all of sudden their revenue from serverless drops by 50% – that’s a very bad day. If they could make it 2x faster, and then increase the DBU costs by 2x to counteract the speedup, then they would remain revenue neutral (this is what they did for Photon actually).
So Databricks is really incentivized to decrease their internal costs while keeping customer runtimes more or less the same. While this is great for Databricks, it’s difficult to pass on any acceleration technology to the user that results in a cost reduction.
Interested in learning more about how to improve your Databricks pipelines? Reach out to Jeff Chou and the rest of the Sync Team.
The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts.
Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs.
If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.
Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.
I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.
python -m venv ~/VirtualEnvironments/gradient-cli
Step 2: Install the Sync Library
Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.
You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.
Step 3. Configure the Sync Library
Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:
You will be prompted for the following values:
Sync API key ID:
Sync API key secret:
Default prediction preference (performance, balanced, economy) [economy]:
Would you like to configure a Databricks workspace? [y/n]:
Databricks host (prefix with https://):
Databricks AWS region name:
If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.
Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups.
These values are stored in ~/.sync/config.
Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.
Step 4. Example Uses
Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.
All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.
If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.
Alternatively, you can use the following command to run the job and request a recommendation with a single command:
Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.
As an independent company, here at Sync, we of course always want to verify AWS’s claims on the performance of Graviton instances. So in this blog post we run several experiments with the TPC-DS benchmark with various driver and worker count configurations on several different instance classes to see for ourselves how these instances stack up.
The goal of the experiment is to see how Graviton instances perform relative to other popular instances that people use. There are of course hundreds of instances types, so we only selected 10 popular instances to make this a feasible study.
As for the workload, we selected the fan favorite benchmark, TPC-DS 1TB, with all 98 queries run in series. This is different compared to what AWS used in their study, which was to look at individual queries within the benchmark. We decided to track the total job runtime of all queries since we’re just looking for the high level “average” performance to see if any interesting trends appear. Results of course may vary query by query, and of course your individual code is a complete wildcard. We make no claim that these results are generally true for all workloads or your specific workloads.
The details of the experimental sweeps are shown below:
Workload: TPC-DS 1TB (queries 1-98 run in series)
EMR Version: 6.2.0
Instances: [r6g, m5dn, c5, i3, m6g, r5, m5d, m5, c6g, r5d] (bold are the Graviton instances)
Cost data: True AWS costs extracted from the cost and usage reports, includes both EC2 and EMR fees
Below is a global view of all of the experiments run showing cost vs. runtime. Each dot represents a different configuration as described by the list above. Points that are in the bottom left hand corner edge are ideal as they are both cheaper and faster.
At a high level, we see that the c6g instances (light green dots) were the lowest cost with comparable runtimes, which was interesting to see. The other two graviton instance (r6g and m6g) skewed lower-left than most of the other instances as well.
One deviation is the c5 instances performed surprisingly well on both the cost and runtime curves. They were quite similar to the best graviton chip, the c6g.
To make the information a bit easier to digest, we take an average of the runtime and cost data to do a clear side by side comparison of the different instances. The salmon colored bars are the Graviton enabled instances.
In the graph below the runtime of Graviton instances were comparable with other instances. The r6g instances were the fastest instances, although not by much – only about 6.5% faster than m6g. The one negative standout was that the i3 instances took around 20% longer runtime than all of the other instances.
More variation is seen in the cost breakdown, where we see that the Graviton instances were typically lower cost than their non-Graviton counterparts, some by a wide margin. What really stole the show were the “c” class instances, where c5 actually was cheaper by about 10% than the m6g and r6g Graviton instances.
The global winner was the c6g instance, which was the absolute cheapest. It’s interesting to see the significant cost difference between the max (i3) and min (c6g), which shows a 70% cost difference!
Based on the data above, it’s interesting to see that the runtime of Graviton instances was comparable to other non-Graviton instances. So, what then was the cause of the huge cost differential? It seems at the end of the day the total job cost generally followed the trends of the list prices of the machines. Let’s look deeper.
The table below shows the list price of the instances and their on-demand list price, in order of lowest to highest cost. We can see the lowest instance cost was the Graviton instance c6g, which corresponds to the study above where the c6g was the lowest cost.
However, there were some exceptions where more expensive instances still had cheaper total job costs:
c5.xlarge – Was the 3rd lowest cost on-demand price, however had the 2nd cheapest overall job cost
R6g.xlarge – Was the 5th lowest cost on-demand price, however had the 3rd cheapest overall job cost
These two exceptions show that the actual list price of the instances doesn’t always guarantee overall total cost trends. Sometimes the hardware is such a great fit for your job that it overcomes the higher cost.
List Price On-Demand
So at the end of the day, do Graviton instances save you money? From this study, I’d say that on average their cost/performance numbers were in fact better than other popular instances. However, as we saw above, it is not always true and, like most things we post – it depends.
If you’re able to explore different instance types, I’d definitely recommend trying out Graviton instances, as they look like a pretty solid bet.
To revisit the claims that AWS had about Graviton instances being 30% cheaper and 15% more performant, based on the data above that is not always true and depends on a lot of cluster parameters.
For example, one thing we’ll note is that in the AWS study, they only used workers with *.2xlarge instances, whereas our study only looked at *.xlarge worker node instances. I also have no idea what Apache Spark configurations they used and if they matched what we did or not.
At the end of the day, everything depends on your workload and what your job is trying to do. There is no one-size-fits-all instance for your jobs. That’s why we built the Apache Spark Gradient to help users easily optimize their Apache Spark configurations and instance types to help hit their cost and runtime needs.