Blog

The Easy and Comprehensive Guide To Understanding Databricks Pricing: How It Works and How To Reduce Your Cost

Databricks is a popular unified analytics platform and a go-to solution for many organizations looking to harness the power of big data. Its collaborative workspaces have become the industry standard for data engineering and data science teams and an ideal environment for building, training and deploying machine learning and AI models at scale.

However—as with any cloud-based service— Databricks pricing structure is extremely complex and product dependent – understanding it is crucial for budgeting and cost management. In this article, we’ll explain in simple terms everything having to do with Databricks pricing, including its pay-as-you-go model, what factors affect your individual pricing, and examples of how to save on cost.

Databricks Pricing FAQ

  1. How is my Databricks cost calculated?
  1. What are the different Databricks pricing plans?
  1. Price for Databricks by workload
  1. Does Databricks offer free trials?
  1. How to save money on Databricks
  1. How do I find The cost of my Databricks?
  1. Additional costs for running Databricks
  1. How does Databricks price compare to Snowflake?

First off, what is the price of Databricks?

The short answer is that it depends. As we’ll explain below Databrick’s price depends on usage so there is no single answer to what it costs. However, based on the average data use of a medium-sized company, it’s fairly normal to see an expenditure of a midsize company to be somewhere between $100k to $1 million per year. 

How is my Databricks cost calculated?

In simple terms, Databricks cost is based on how much data you process, and the type of workload you’re executing and which product you’re using.  Each type of compute has a different price per processing unit—known as Databricks unit, or DBU.  To calculate your Databricks cost, you simply multiply the number of DBUs used by the dollar rate per DBU for that workload.

For instance, certain jobs such as Jobs Light Compute or Serverless Real-Time cost $0.07 per DBU. So if you use a job that requires 100 DBU it would cost $7.00. 

Keep in mind complex tasks such as All-Purpose Interactive workloads (typically used for data science or business intelligence) have higher costs of around $0.55 per DBU. This means that it’s not just the amount of data, but also the workload type. Data velocity (the frequency your data pipeline is used) and data complexity (how much work it takes to process your data set) can all add to the amount of DBUs needed, and thus raise the cost of your workload. It’s thus crucial to evaluate your ETL workflow before and during your Databricks subscription to understand if there are areas for optimization. 

Outside of the job function itself, prices for your Databricks subscriptions differ by cloud service provider, your pricing plan, and even your region (though within the contiguous U.S these prices are largely the same). Databricks price can also differ by size and type of instance, which refers to the type of virtual machine you are running on the Databricks lakehouse. 

In addition to Databricks costs, there can also be the cloud compute costs.  For example, if you run a job on a cluster, you have to pay for both the Databricks overhead as well as the cloud compute costs.  Typically the cloud compute costs can be larger than your Databricks cost, so keep this in mind.  As a result, the total cost of Databricks is a sum of two major components:

Total Cost of Ownership = Databricks Cost + Cloud Provider Cost

What is interesting, is both the Databricks and cloud costs scale with the cluster size. While that does make sense from the cloud provider’s perspective, since they are providing the compute — one may ask:

Why do Databricks costs scale with cluster size when they don’t run my cluster?

In reality, Databricks is a software layer on top of your cloud provider. Whether you run a 1 node cluster, or a 1000 node cluster, the actual costs to Databricks is fixed. While this doesn’t make any sense, that’s the reality of Databricks pricing.

What are the different Databricks pricing plans?

At the moment there are three different types of pricing plans: Standard Plans, Premium Plans, and Enterprise Plans. These plans differ in their features and types of workloads available, with Premium plan costing the same or more than Standard plan. 

Much of the premium plan’s benefit is for role-based access control (think assigning admins with more capabilities and permissions than users), and for higher levels of automation and authentication. There is also access to features like Audit Logs, Credential Pass Through (for Azure Databricks), and IP access list. Enterprise plans are customized per user so vary based on company size, contract size, and duration of the plan. 

For a full list of differences between standard and premium pricing, check out here

Price for Databricks by Workload

Below you will see a breakdown of Databricks cost by workload for the standard plan, using AWS as Cloud Service Provider and in the Central US region. 

Jobs Compute

  • Jobs Lite Compute: $.0.07 per DBU/hour 
  • Jobs Compute: $0.10 per DBU/hour
  • Jobs Compute Photon: $0.10 per DBU/hour

Delta Live Tables 

  • DLT Core Photon: $0.20 per DBU/hour 
  • DLT Pro Photon: $0.25 per DBU/hour
  • DLT Advanced Photon: $0.36 per DBU/hour

All Purpose Compute:

  • All Purpose Comptute: $0.40 per DBU/hour
  • All Purpose Compute Photon: $0.40 per DBU/hour

The following workloads are only available for premium subscriptions, and so their prices reflect as such. 

Serverless and SQL Compute:

  • SQL Classic: $0.22 per DBU/hour
  • SQL Pro: $0.55 per DBU/hour
  • SQL Severless: $0.70 per DBU/hour
  • Serverless Real-time Inference: $.0.07 per DBU/hour

Does Databricks offer free trials?

Yes, Databricks does offer free trials, with a free version with fully usable user-interactive notebooks available for 14 days. While the Databricks trial itself is free, you still need to pay for the underlying cloud infrastructure. 

If you want to continue to use Databricks for free (but with limited features) you can use the open-source Databricks Community Edition. This is great for those wanting to learn Apache Spark.

However, it’s also important to note that because there are no upfront costs and Databricks is priced on a pay-as-you-go model, the cost itself to get set up is very minimal. 

How To Save Money On Databricks

The great news about the Databricks pricing model is that because it’s based on usage, there are a number of ways to reduce your cost basis by altering your usage. Some of these ways include: 

  1. Optimize Your Job Clusters. By choosing the right size and type job cluster, companies can often save huge amounts of money through runtime reductions, without having to make any changes to hardware. For instance, Sync saved DuoLingo 55% on their machine learning costs simply through cluster optimization
  1. Use Spot Instances. This is for AWS customers specifically, “Spot Instances” are unused computing capacity on Amazon EC2, which are offered up at deep discounts of up to 90%.  However, one of the issues with Spot instances is that machines (or worker nodes) can be removed at any time.  This can cause unwanted delays in your job, which can end up increasing the cost of your job.  So while Spot Instance wil save you money most of the time, if you need reliable runtime and performance more than costs – then on-demand instances may be better.
  1. Use Photon. Photon is the next-generation engine on the Databricks Lakehouse Platform that provides massively parallel, extremely fast query performance at lower total cost. This makes it very efficient for highly complex workloads, but maybe overkill for certain simple jobs that are not Photon compatible. If your job is not compatible you could end up paying 2x the DBU costs for no benefit.  So we recommend testing your job with Photon to see if you get cost savings or not.  Read our blog on this topic to learn more.
  2. Autoscaling. Autoscaling is a Databricks configuration that helps to dynamically tune the number of workers for your workloads.  Activating autoscaling is a simple checkbox in the databricks UI that many people overlook. However, there is a cost to spinning up and down nodes where you’re paying for machines that are still warming up and not actually processing data.  This makes autoscaling best for ad-hoc notebook usage.  However, for production static Jobs, Autoscaling may end up costing more.  Read our blog on this topic to learn more.
  3. Optimize your code.  Apache Spark is a very rich programming framework, and Databricks has built a lot of optimizations within their platform.  For example: Optimize & Z-order, OptimizeWrite, Partitioning, File size tuning, Reduce shuffle, Cost based optimizer, Adaptive Query Engine, Salting, Data Skipping, Delta Lake optimizations, and Data Caching.  A lot of these techniques are very advanced, but thankfully Databricks has a great resource outlining best practices
  1. Don’t use Databricks.  Databricks is great for many use cases, but it is very expensive.  A lot of real world use cases don’t have data sizes at scale to really justify using Databricks.  There are alternative Apache Spark services, such as AWS EMR or using free open source Apache Spark.  These options are usually more time intensive to set up, which may mean you need more infrastructure engineers – however their per minute costs are typically cheaper.  

Additionally, there are some subscription parameters you can alter to maximize your savings when it comes to using Databricks. The major ones here include:

  1. Committed Use Discounts. Databricks offers big discounts for those who pre-pay for their processing units, in what’s known as Databricks Commit Units (DBCU). Like many things, the more DCBUs you buy the more you save. For instance, a customer buying $25,000 worth of DBCUs per year could save 6%, while one buying $1.25 million could save as much as 33%. See a full list of pre-purchase discounts for Azure here.
  1. Use A Different Cloud Service Provider. There are three different cloud service providers for Databricks: AWS, Microsoft Azure and Google Cloud. In our experience, Azure is the most expensive of these, roughly 1-2x higher per DBU than the other two (this is due to the Databricks being a first-party service and having included support from Microsoft).

How Do I Find The Cost Of My Databricks?

Finding the total cost of your databricks usage can be tricky.  Because pricing is based on both the Databricks and Cloud provider fees, it’s difficult to collect and attribute all of the costs.  There are several methods you can use depending on what you can access at your company:

1. First Find DBUs

You’ll always want to first asses the direct cost of your Databricks usage. To do this you can go to your admin page, and look at your data usage to isolate just your DBU costs. You can also go the new “system tables” under Databricks which will breakdown the DBU costs only for your jobs. 

2. Find Cloud Provider Costs

The good news about Cloud provider costs is that they should remain fairly static relative to Databricks costs. To find your Cloud Provider cost, you should be able to use the tags employed inside your Databricks clusters to find associated costs within your Cloud Provider account. For example, example in AWS, you can use costs explorer to find the cluster and tags associated with your bill.

One thing to note is that it can take from several hours to a day to wait for the billing information to be placed in both Databricks or your cloud providers endpoints.  This means you have to match costs to workflows from given days, and you can’t get real time results on costs. 

Real-time Estimated Total Costs with Gradient

Due to the complexities of extracting the actual cost of each workload, we put together our Gradient product to estimate the total cost (DBUs + cloud costs) of each of your jobs.  These costs are estimated based on the Spark eventlog and the cluster metrics from your cloud provider.  

In the image below from Gradient, we can see the runtime and the estimated total cost of the job before and after a sync recommendation. These cost values are provided instantly after each job run.  Of course they are only estimates and baked on list-pricing – but it will give you a good idea of cost trends.

Additional Costs For Running Databricks

Apart from workspace and compute costs, there are other factors to consider:

  • Data Migration and Storage: While Databricks itself doesn’t charge for data storage, you might incur costs based on the cloud provider’s storage and data transfer rates. Databricks also offers a data migration service from existing data warehouse.
  • Third-party Integrations: Databricks offers intelligent lakehouse monitoring provided by Unity Catalog, and Predictive Optimization powered by AI. Both operate under a DBU/hour model like standard pricing.  
  • Support and Training: Databricks offers various support and training packages, which come at an extra cost. Databricks public instructor-led courses average $1,000 to $1,500 per participant (though you can 20% for a limited time with the discount code ilt20)

How Does Databricks Price Compare To Snowflake?

While Databricks is a fairly unique product, the most common alternative companies consider is Snowflake. While both are cloud-based data solutions, Databricks is much more common for large-scale machine learning and data science jobs, whereas Snowflake is optimized for low-to-moderate SQL-based queries.  Snowflake is typically easier to use, however, users have much less fine grain control of their infrastructure.  

While both have a usage-based charge Snowflake charges clients directly for everything – from compute to storage.  Databricks on the other hand has 2 cost drivers, Databricks fees in addition to Cloud compute / storage fees. 

At the end of the day, there’s no real way to predict if either platform will be cheaper.  It all depends on how you use it and what kind of workloads you’re running.  One comment we can say is, you’ll only get as much efficiency as effort you put in – as both platforms require optimizations.

For more in-depth information read Databricks guide on evaluating data pipelines for cost performance. 

Databricks Pricing Page

Databricks Pricing Calculator

Pricing For Azure

How To Optimize Databricks Clusters

Databricks Instructor-Led Courses

Databricks Guided Access Support Subscription

Migrate Your Data Warehouse to Databricks

Databricks Support Policy 

New Gradient quick-start notebooks- Optimize your Databricks jobs in minutes

Since we launched Gradient to help control and optimize Databricks Jobs, one piece of feedback from users was crystal clear to us:

“It’s hard to set up”

And we totally agreed.  By nature, Gradient is a pretty deep infrastructure product that needs not only your Spark eventlogs but also cluster information from your cloud provider.

As a near term solution – we built an easy to use notebook that you run in your Databricks environment that will do all of the setup with a single click.

See the new Gradient quick-start guide here

Easy as 1-2-3-4

  1. Create a Sync Account
  2. Create a project in Gradient
  3. Run the “Gradient setup” notebook
  4. Review and test the recommendation with the “Apply Recommendation” notebook

For users who orchestrate their jobs with Databricks Workflows, this notebook helps to automate the setup process making it easier for people to get started quickly.

The “Gradient Setup” Notebook

This notebook is run locally in your Databricks environment, so no credential information is shared with Sync minimizing any security issues. At a high level this notebook does the following:

  1. User inputs the job-id of the desired Job they would like to test Gradient on
  2. Downloads the necessary file from Sync’s open source Github
  3. Creates Databricks secrets with your credentials
  4. Clones your job to make testing safer
  5. Adds an init script & environmental variables to the job cluster
  6. Appends a new task with a new single node cluster after the selected job to auto-submit logs

Below is a screenshot of the top cell of the notebook entry point where you replace the 6 values below which help point the notebook towards the job you want to optimize as well as your Gradient project.

Once this notebook is run, users can go into the Gradient UI to see their shiny new recommendation.  Users can also configure their optimization goals to hit different SLA runtimes.

The “Apply Recommendation” notebook

To easily apply the recommendation, we provide another notebook which performs the following:

  1.  User inputs the job-id of the desired Job they would like to update
  2.  Retrieves the recommendation from Gradient
  3.  Updates the cluster configurations detailed in the recommendation

Users can simply copy and paste the same input configuration text block and click “run notebook” to finish.

After this, simply re-run your newly configured job and verify the and after performance!

Prerequisites

One major prerequisite for the notebook, and Gradient in general, is permissions to retrieve cluster information from your cloud provider.  This permission is granted via instance profiles. 

For some small companies, creating an instance profile on AWS is pretty straight forward.  For larger companies, it may involve working with your devops or security teams!

For more details check out our docs on permissions.

Try it now!

Follow our quick-start guide here to try Gradient for free

Questions or feedback?

If you have any questions or feedback, please don’t hesitate to reach out to us via Intercom or email us at: support@synccomputing.com

Why Your Data Pipelines Need Closed-Loop Feedback Control

As data teams scale up on the cloud, data platform teams need to ensure the workloads they are responsible for are meeting business objectives.  At scale with dozens of data engineers building hundreds of production jobs, controlling their performance at scale is untenable for a myriad of reasons from technical to human.

The missing link today is the establishment of a closed loop feedback system that helps automatically drive pipeline infrastructure towards business goals.  That was a mouthful, so let’s dive in and get more concrete about this problem.

The problem for data platform teams today

Data platform teams have to manage fundamentally distinct shareholders from management to engineers.  Oftentimes these two teams have opposing goals, and platform managers can be squeezed by both ends.  

Many real conversations we’ve had with platform managers and data engineers typically go like this:


“Our CEO wants me to lower cloud costs and make sure our SLAs are hit to keep our customers happy.”

Okay, so what’s the problem?

“The problem is that I can’t actually change anything directly, I need other people to help and that is the bottleneck”

So basically, platform teams find themselves handcuffed and face enormous friction when trying to actually implement improvements.  Let’s zoom into the reasons why.

What’s holding back the platform team?

  • Data Teams are out of technical scope – Tuning clusters or complex configurations (Databricks, Snowflake) is a time consuming task where data teams would rather be focusing on actual pipelines and SQL code.  Many engineers don’t have the skillset, support structure, or even know what the costs are for their jobs.  Identifying and solving root cause problems is also a daunting task that gets in the way of just standing up a functional pipeline.

  • Too many layers of abstraction – Let’s just zoom in on one stack: Databricks runs their own version of Apache Spark, which runs on a cloud provider’s virtualized compute (AWS, Azure, GCP), with different network options, and different storage options (DBFS, S3, Blob), and by the way everything can be updated independently and randomly throughout the year.  The amount of options is overwhelming and it’s impossible for platform folks to ensure everything is up to date and optimal.

  • Legacy code – One unfortunate reality is simply just legacy code.  Oftentimes teams in a company can change, people come and go, and over time, the knowledge of any one particular job can fade away.  This effect makes it even more difficult to tune or optimize a particular job.

  • Change is scary – There’s an innate fear to change.  If a production job is flowing, do we want to risk tweaking it?  The old adage comes to mind: “if it ain’t broke, don’t fix it.”  Oftentimes this fear is real, if a job is not idempotent or there are other downstream effects, a botched job can cause a real headache.  This creates a psychological barrier to even trying to improve job performance.

  • At scale there are too many jobs – Typically platform managers oversee hundreds if not thousands of production jobs.  Future company growth ensures this number will only increase.  Given all of the points above, even if you had a local expert, going in and tweaking jobs one at a time is simply not realistic.  While this can work for a select few high priority jobs, it leaves the bulk of a company’s workloads more or less uncared for.  

Clearly it’s an uphill battle for data platform teams to quickly make their systems more efficient at scale.  We believe the solution is a paradigm shift in how pipelines are built.  Pipelines need a closed loop control system that constantly drives a pipeline towards business goals without humans in the loop.  Let’s dig in.

What does a closed loop control for a pipeline mean?

Today’s pipelines are what is known as an “open loop” system in which jobs just run without any feedback.  To illustrate what I’m talking about, pictured below shows where “Job 1” just runs every day, with a cost of $50 per run.  Let’s say the business goal is for that job to cost $30.  Well, until somebody actually does something, that cost will remain at $50 for the foreseeable future – as seen in the cost vs. time plot.

What if instead, we had a system that actually fed back the output statistics of the job so that the next day’s deployment can be improved?  It would look something like this:

What you see here is a classic feedback loop, where in this case the desired “set point” is a cost of $30.  Since this job is run every day, we can take the feedback of the real cost and send it to an “update config” block that takes in the cost differential (in this case $20) and as a result apply a change in “Job 1’s configurations.  For example, the “update config” block may reduce the number of nodes in the Databricks cluster.  

What does this look like in production?

In reality this doesn’t happen in a single shot.  The “update config” model is now responsible for tweaking the infrastructure to try to get the cost down to $30.  As you can imagine, over time the system will improve and eventually hit the desired cost of $30, as shown in the image below.

This may all sound fine and dandy, but you may be scratching your head and asking “what is this magical ‘update config’ block?”  Well that’s where the rubber meets the road.  That block is a mathematical model that can input a numerical goal delta, and output an infrastructure configuration or maybe code change.

It’s not easy to make and will vary depending on the goal (e.g. costs vs. runtime vs. utilization).  This model must fundamentally predict the impact of an infrastructure change on business goals – not an easy thing to do.

Nobody can predict the future

One subtle thing is that no “update config” model is 100% accurate.  In the 4th blue dot, you can actually see that the cost goes UP at one point.  This is because the model is trying to predict a change in the configurations that will lower costs, but because nothing can predict with 100% accuracy, sometimes it will be wrong locally, and as a result the cost may go up for a single run, while the system is “training.”

But, over time, we can see that the total cost does in fact go down.  You can think of it as an intelligent trial and error process, since predicting the impact of configuration changes with 100% accuracy is straight up impossible.

The big “so what?” – Set any goal and go

The approach above is a general strategy and not one that is limited to just cost savings.  The “set point” above is simply a goal that a data platform person puts in.  It can be any kind of goal, for example runtime is a great example.  

Let’s say we want a job to be under a 1 hour runtime (or SLA).  We can let the system above tweak the configurations until the SLA is hit.  Or what if it’s more complicated, a cost and SLA goal simultaneously?  No problem at all, the system can optimize to hit your goals over many parameters.  In addition to cost and runtime, other business use cases goals are:

  • Resource Utilization: Independent of cost and runtime, am I using the resources I have properly?
  • Energy Efficiency: Am I consuming the least amount of resources possible to minimize my carbon footprint?
  • Fault Tolerance: Is my job actually resilient to failure? Meaning do I want to over-spec it just in case I get preempted or just in case there are no SPOT instances available?
  • Scalability: Does my job scale? What if I have a spike in input data by 10x, will my job crash?
  • Latency: Are my jobs hitting my latency goals? Response time goals?

In theory, all a data platform person has to do is set goals, and then an automatic system can iteratively improve the infrastructure until the goals are hit.  There are no humans in the loop, no engineers to get on board.  The platform team just sets the goals they’ve received from their stakeholders.  Sounds like a dream.

So far we’ve been pretty abstract.  Let’s dive into a some concrete use cases that are hopefully compelling to people:

Example feature #1: Group jobs by business goals

Let’s say you’re a data platform manager and you oversee the operation of hundreds of production jobs.  Right now, they all have their own cost and runtime.  A simple graph below shows a cartoon example, where basically all of the jobs are randomly scattered across a cost and runtime graph.

What if you want to lower costs at scale?  What if you want to change the runtime (or SLA) of many jobs at once?  Right now you’d be stuck.

Now imagine if you had the closed loop control system above implemented for all of your jobs.  All you’d have to do is set the high level business goals of your jobs (in this case SLA runtime requirements), and the feedback control system would do its best to find the infrastructure that accomplishes your goals.  The end state will look like this:

Here we see each job’s color represents a different business goal, as defined by the SLA.  The closed loop feedback control system behind the scenes changed the cluster / warehouse size, various configurations, or even adjusted entire pipelines to try to hit the SLA runtime goals at the lowest cost.  Typically longer job runtimes lead to lower cost opportunities.

Example feature #2: Auto-healing jobs

As most data platform people can confirm, things are always changing in their data pipelines.  Two very popular scenarios are: data size growing over time, and code changes.  Both of which can cause erratic behavior when it comes to cost and runtime.

The illustration below shows the basic concept.  Let’s walk through the example from left to right:

  • Start:  Let’s say you have a job and over time the data size grows.  Normally your cluster stays the same and as a result both costs and runtime increases.
  • Start Feedback:  Over time the runtime approaches the SLA requirement and the feedback control system kicks in at the green arrow.  At this point, the control system changes the cluster to stay below the red line while minimizing costs.
  • Code Change:  At some point a developer pushes a new update to the code which causes a spike in the cost and runtime.  The feedback control system kicks in and adjusts the cluster to work better with the new code change.

Hopefully these two examples explain the potential benefit of how a closed loop control pipeline can be beneficial.  Of course in reality there are many details that have been left out and some design principles companies will have to adhere to.  One big one is a way for configurations to revert back to a previous state in case something goes wrong.  An idempotent pipeline would also be ideal here in case many iterations are needed.

Conclusion

Data pipelines are complex systems, and like any other complex system, they need feedback and control to ensure a stable performance.  Not only does this help solve technical or business problems, it will dramatically help free up data platform and engineering teams to focus on actually building pipelines.  

Like we mentioned before, a lot of this hinges on the performance of the “update config” block.  This is the critical piece of intelligence that is needed to the success of the feedback loop.  It is not trivial to build this block and is the main technical barrier today.  It can be an algorithm or a machine learning model, and utilize historical data.  It is the main technical component we’ve been working on over the past several years.

In our next post we’ll show an actual implementation of this system applied to Databricks Jobs, so you can believe that what we’re talking about is real!

Interested in learning more about closed loop controls for your Databricks pipelines? Reach out to Jeff Chou and the rest of the Sync Team.

Are Databricks clusters with Photon and Graviton instances worth it?

Configuring Databricks clusters can seem more like art than science.  We’ve reported in the past about ways to optimize worker and driver nodes, and how the proper selection of instances impacts a job’s cost and performance.  We’ve also discussed how autoscaling performs, and how it’s not always the most efficient choice for static jobs.  

In this blog post, we look across a few other popular questions and options we see from folks:

  1. How do Graviton instances impact cost and performance?
  2. How does the price and performance of Photon compare to standard instances?

What are Graviton instances?

Graviton instances on AWS contain custom AWS built processors, which promise to be a “major leap” in performance. Specifically for Spark, AWS published a report that claimed Graviton can help reduce costs up to 30% and speed up performance up to 15% for Apache Spark on EMR.   Although Databricks clusters can use Graviton, there haven’t been any performance metrics reported (that we know of).   There’s no extra surcharge for Graviton instances, and they are typically moderately priced compared to other instances.

What is Photon in Databricks?

Photon is a vectorized query engine written in C++ developed by the creators of Apache Spark and is available within the Databricks platform.  Photon is an amazing technical feat with a multitude of features and considerations, that extend well beyond the scope of this blog to go into.   For full details, we encourage readers to check out the original Photon academic paper here.  Unfortunately, Photon is not free and is typically a 2x cost increase for DBUs compared to non-photon.  So users have to decide if the cost increase is “worth it.”

At the highest level for most end users, as cited by the original academic paper::

  • Photon is great for CPU heavy operations such as joins, aggregations, and SQL expression evaluations.  
  • The academic paper claims about a 3x speedup on the TPC-H benchmark compared to standard Databricks runtime
  • Photon is not expected to provide a speedup to workloads that are I/O or network bound.

Yes, you can even run Photon on Graviton instances!  What happens with this powerful combo?  The data below shows the results.

How do I use Graviton and/or Photon?

Graviton instances typically have the “g” letter in the instance names, such as “m6g.xlarge” or “c7g.xlarge” and are selected during the cluster creation step within Databricks under “Worker type” and “Driver type”.

Photon is enabled by simply checking the box “Use Photon Acceleration” in the cluster creation step.  An image of the UI is shown below.

Experimental setup

In our analysis we utilize the TPC-DS 1TB benchmark, with all queries run sequentially.  We then look at the total runtime of all queries summed together.  To keep things simple and fair, every cluster has identical driver and worker instances.  We sampled 28 different instances spanning from photon enabled, Graviton, memory, compute, I/O, network, and storage optimized instances.   A full list of the parameters of each cluster are below:

  1. Driver:  [instance].xlarge
  2. Worker:  [instance].xlarge
  3. Number of workers: 10
  4. EBS volume: 64
  5. Databricks runtime version:  11.3.x-scala2.12
  6. Market:  On-demand
  7. Cloud provider: AWS
  8. Instances:  28 different instances on AWS

For the cost, we utilize only the DBU cost of each cluster.  We did not include the AWS costs for various reasons:

  • Cloud cost attribution difficulty:  Databricks internally re-uses clusters of adjacent jobs.  Meaning, AWS clusters for one job may be reused for a second job, if they require the same machine.  This causes identifying which job was using which cluster in AWS difficult to determine.  This is a niche problem, and only for people who want to determine the true cost of a single job
  • AWS costs depend on the market:  The AWS costs, or cloud costs in general, depend on the market.  Specifically, if users are using on-demand vs. spot nodes, it will drastically change the relative cost performance.  Furthermore, spot prices can fluctuate daily, so extracting fair comparisons would be difficult here.
  • AWS costs depend on contracts:  Large companies negotiate their own costs for their instances, thus again, making an overall apples to apples comparison difficult.

For the reasons above, the DBU costs are utilized because they are exact, easy to identify, and do not fluctuate depending on the market.  However, we will say that DBU costs can also depend on contracts.  But for the sake of this study, we’ll just use the list prices of DBUs.  As you can tell by these thoughts, doing actual cost comparisons is not a trivial task, and is highly dependent on each company’s use case.

Results

The graph below shows the cost vs runtime plots of all 28 different clusters.  They are grouped into 3 sections, “Graviton” instances, “Photon” enabled instances, “Standard” instances (no photon, no Graviton), and “Graviton + Photon” instances.  Points that are closer to the bottom left hand corner of the graph are both “faster and cheaper.”

In the graph below, we can see two clear “clusters”, basically with and without Photon.  It’s clear from this data that Photon is legitimately faster.  Unfortunately, it doesn’t appear any cheaper, so if your goal is to save money these results are a bit of a downer.  If you’re trying to run faster, Photon may be exactly what you’re looking for.

The two bar graphs below contain the same data as the XY plot above, but they break out the data into runtime and DBU costs separately.  Also, we present the individual instances used, in case people would like a more granular view into the data.

After perusing through the data, our main observations are outlined below.  I’d like to heavily caution that these observations are purely from the experiment we ran above.  We urge people to exercise caution when trying to generalize these results, as individual jobs can have wildly different results than the ones we showed above.  With that said, these are the main takeaways:

  • Photon is generally 2x faster – Across the board Photon was about 2x faster than their non-photon counterparts (same instances).  This was great to see.  Although not as high as some of the claims reported by Databricks, we understand that it is highly dependent on the workload.  In my opinion a 2x speedup is pretty impressive.
  • Graviton was neutral  – The runtime for graviton was perhaps a bit faster than standard instances, but it’s unclear if it’s statistically significant.  There doesn’t seem much risk to using Graviton, and they are newer chips so maybe they will be faster for your jobs? 
  • Photon’s total cost is cheaper (with this data) – In the data above, since the DBU costs were about the same across all 3 types, and Photon’s runtimes were about 2x faster, one can logically conclude that the cloud portion of the costs (the AWS fees) will be less with Photon.  As a result, the total cost for an end user was cheapest with Photon enabled.
  • Photon pricing makes for complex cost ROI –  Because of the previous point, determining the ROI of Photon is difficult.  It basically boils down to if the speedup is fast enough to endure the increased cost.  If it does not, then users are essentially paying more money for a potentially faster job.  If Photon speedup is fast enough, then it will be cheaper.  What that threshold is will depend on the market and any discounts.  For the sake of this study, the crossover point for on-demand instances was around 20%.  Meaning, Photon needs to be at least 20% faster than Standard to observe any cost savings.

Formula for determining Photon ROI

For those that are mathematically inclined, here is a simple formula to help determine the “speedup threshold” which is the minimum speedup Photon needs to achieve for your job in order to break even.  If your speedup is greater than this threshold, then you are saving money.

For a simple example, let’s say all of the machine and DBU costs are 1, and the Photon cost increase is a factor of 2, and we have 10 workers.  With these very simple numbers, we get a Psth value of 1.5.  Plugging in 1.5 for Psth and setting R_orig =1 and solving for R_photon, that means Photon needs to be 33% faster to break even.  Clearly this value is heavily dependent on a lot of factors, all of which are shown in the equation above.

Conclusion

Overall the answers to the original two questions really comes down to “it depends.”  The data points we showed above are an infinitely small slice of what workloads actually look like.  Based on simply the data above, here are the answers:

1)  Photon will probably be faster than non-photon, but whether or not it’s cheaper will depend on how much faster it is relative to the costs.  To understand if the 2x DBU cost increase with Photon is worth it, it all depends on the markets and pricing of your cloud instances.

2)  On average Graviton was about the same for cost and runtime compared to standard instances.  We did not see any significant advantage of using Graviton here, but we didn’t see any downside either.  Maybe these new chips will be perfect for your workload, or maybe not.  It’s hard to tell.

However, with the data above, specifically around Photon, I can’t help but ask the question:

Is Databricks motivated to make Spark run faster? 

This is an interesting philosophical question where the tech enthusiast may clash with the business units.  The faster Databricks makes Spark, the less revenue they get, since they charge per minute.  Photon is an interesting case study in which, yes, they made Spark 2x faster – but then had to double their costs to not lose money.  This is at least one data point that shows you where Databricks basically sits: “Yes we can make Spark faster, but not cheaper.”

In my opinion, Databricks, and other cloud providers, are fundamentally motivated to increase revenue.  So making Spark run faster and/or cheaper is not in alignment with where they need to do as a business.  They will however make the product easier to use, or expand to other use cases which, fundamentally, increases revenue.
We of course respect the fact that any business needs to make money, so I don’t think anything improper is happening here.  But it does reveal an interesting conflict between technology and business and how that fundamentally impacts the end user.

How to Use the Gradient CLI Tool to Optimize Databricks / EMR Programmatically

Introduction:

The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts. 

Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs. 

If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.

Pre Work

This tutorial assumes that you have already created a Gradient account and generated your

Sync API keys. If you haven’t generated your key yet, you can do so on the Accounts tab of the Gradient UI.

Step 1: Setting up your Environment

Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.

python --version

I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.

python -m venv ~/VirtualEnvironments/gradient-cli

Step 2: Install the Sync Library

Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.

source ~/VirtualEnvironments/gradient-cli/bin/activate

Next use the pip package installer to install the latest version of the Sync Library.

pip install https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz

You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.

sync-cli --help

Step 3. Configure the Sync Library

Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:

sync-cli configure

You will be prompted for the following values:

Sync API key ID:

Sync API key secret:

Default prediction preference (performance, balanced, economy) [economy]:

Would you like to configure a Databricks workspace? [y/n]:

Databricks host (prefix with https://):

Databricks token:

Databricks AWS region name:

If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.


Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups.

These values are stored in ~/.sync/config.

Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.

Step 4. Example Uses

Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.

Projects

All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.

Project Commands:

create – Create a project

sync-cli projects create --description [TEXT] --job-id [Databricks Job ID] PROJECT_NAME

delete – Delete a project

sync-cli projects delete PROJECT_ID

get – Get info on a project

sync-cli projects get PROJECT_ID

list – List all projects for account

sync-cli projects list

Predictions

You can also use the CLI to manage, generate and retrieve predictions. This is useful when you want to automate the implementation of recommendations within your Databricks or EMR environments.

Prediction commands:

get – Retrieve a specific prediction

sync-cli predictions get --preference [performance|balanced|economy] PREDICTION_ID

list – List all predictions for account or project

sync-cli predictions list --platform [aws-emr|aws-databricks] --project TEXT

status – Get the status of a previously initiated prediction

sync-cli predictions status PREDICTION_ID

The CLI also provides platform specific commands to generate and retrieve predictions.

Databricks

For Databricks you can generate a recommendation for a previously completed job run with the following command:

sync-cli aws-databricks create-prediction --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] RUN_ID

If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.

Alternatively, you can use the following command to run the job and request a recommendation with a single command:

sync-cli aws-databricks run-job --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] JOB_ID

This method is useful in cases when you are able to manually run your job without interfering with scheduled runs.

Finally, to implement a recommendation and run the job with the new configuration, you can issue the following command:

sync-cli aws-databricks run-prediction --preference [performance|balanced|economy] JOB_ID PREDICTION_ID

EMR

Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.

sync-cli aws-emr create-prediction --region [Your AWS Region] CLUSTER_ID

Use the following command to do so:

If you want to manually rerun the EMR job and immediately request a Gradient recommendation, use the following command:

sync-cli aws-emr record-run --region [Your AWS Region] CLUSTER_ID PROJECT

To execute the EMR job using the recommended configuration, use the following command:

sync-cli aws-emr run-prediction --region [Your AWS Region] PREDICTION_ID

Products

Gradient is constantly working on adding support for new data engineering platforms. To see which platforms are supported by your version of the CLI, you can use the following command:

sync-cli products

Configuration

Should you ever need to update your CLI configurations, you can call config again to change one or more your values.

sync-cli configure --api-key-id TEXT --api-key-secret TEXT --prediction-preference TEXT --databricks-host TEXT --databricks-token TEXT --databricks-region TEXT

Token

The Token command returns an access token that you can use against our REST API with clients like postman

sync-cli token

Conclusion

With these simple commands, you can automate the end to end optimization of all your Databricks or EMR workloads, dramatically reducing your costs and improving the performance. For more information refer to our developer docs or reach out to us at info@synccomputing.com.