Databricks is a popular unified analytics platform and a go-to solution for many organizations looking to harness the power of big data. Its collaborative workspaces have become the industry standard for data engineering and data science teams and an ideal environment for building, training and deploying machine learning and AI models at scale.
However—as with any cloud-based service— Databricks pricing structure is extremely complex and product dependent – understanding it is crucial for budgeting and cost management. In this article, we’ll explain in simple terms everything having to do with Databricks pricing, including its pay-as-you-go model, what factors affect your individual pricing, and examples of how to save on cost.
Databricks Pricing FAQ
- How is my Databricks cost calculated?
- What are the different Databricks pricing plans?
- Price for Databricks by workload
- Does Databricks offer free trials?
- How to save money on Databricks
- How do I find The cost of my Databricks?
- Additional costs for running Databricks
- How does Databricks price compare to Snowflake?
First off, what is the price of Databricks?
The short answer is that it depends. As we’ll explain below Databrick’s price depends on usage so there is no single answer to what it costs. However, based on the average data use of a medium-sized company, it’s fairly normal to see an expenditure of a midsize company to be somewhere between $100k to $1 million per year.
How is my Databricks cost calculated?
In simple terms, Databricks cost is based on how much data you process, and the type of workload you’re executing and which product you’re using. Each type of compute has a different price per processing unit—known as Databricks unit, or DBU. To calculate your Databricks cost, you simply multiply the number of DBUs used by the dollar rate per DBU for that workload.
For instance, certain jobs such as Jobs Light Compute or Serverless Real-Time cost $0.07 per DBU. So if you use a job that requires 100 DBU it would cost $7.00.
Keep in mind complex tasks such as All-Purpose Interactive workloads (typically used for data science or business intelligence) have higher costs of around $0.55 per DBU. This means that it’s not just the amount of data, but also the workload type. Data velocity (the frequency your data pipeline is used) and data complexity (how much work it takes to process your data set) can all add to the amount of DBUs needed, and thus raise the cost of your workload. It’s thus crucial to evaluate your ETL workflow before and during your Databricks subscription to understand if there are areas for optimization.
Outside of the job function itself, prices for your Databricks subscriptions differ by cloud service provider, your pricing plan, and even your region (though within the contiguous U.S these prices are largely the same). Databricks price can also differ by size and type of instance, which refers to the type of virtual machine you are running on the Databricks lakehouse.
In addition to Databricks costs, there can also be the cloud compute costs. For example, if you run a job on a cluster, you have to pay for both the Databricks overhead as well as the cloud compute costs. Typically the cloud compute costs can be larger than your Databricks cost, so keep this in mind. As a result, the total cost of Databricks is a sum of two major components:
Total Cost of Ownership = Databricks Cost + Cloud Provider Cost
What is interesting, is both the Databricks and cloud costs scale with the cluster size. While that does make sense from the cloud provider’s perspective, since they are providing the compute — one may ask:
Why do Databricks costs scale with cluster size when they don’t run my cluster?
In reality, Databricks is a software layer on top of your cloud provider. Whether you run a 1 node cluster, or a 1000 node cluster, the actual costs to Databricks is fixed. While this doesn’t make any sense, that’s the reality of Databricks pricing.
What are the different Databricks pricing plans?
At the moment there are three different types of pricing plans: Standard Plans, Premium Plans, and Enterprise Plans. These plans differ in their features and types of workloads available, with Premium plan costing the same or more than Standard plan.
Much of the premium plan’s benefit is for role-based access control (think assigning admins with more capabilities and permissions than users), and for higher levels of automation and authentication. There is also access to features like Audit Logs, Credential Pass Through (for Azure Databricks), and IP access list. Enterprise plans are customized per user so vary based on company size, contract size, and duration of the plan.
For a full list of differences between standard and premium pricing, check out here.
Price for Databricks by Workload
Below you will see a breakdown of Databricks cost by workload for the standard plan, using AWS as Cloud Service Provider and in the Central US region.
- Jobs Lite Compute: $.0.07 per DBU/hour
- Jobs Compute: $0.10 per DBU/hour
- Jobs Compute Photon: $0.10 per DBU/hour
Delta Live Tables
- DLT Core Photon: $0.20 per DBU/hour
- DLT Pro Photon: $0.25 per DBU/hour
- DLT Advanced Photon: $0.36 per DBU/hour
All Purpose Compute:
- All Purpose Comptute: $0.40 per DBU/hour
- All Purpose Compute Photon: $0.40 per DBU/hour
The following workloads are only available for premium subscriptions, and so their prices reflect as such.
Serverless and SQL Compute:
- SQL Classic: $0.22 per DBU/hour
- SQL Pro: $0.55 per DBU/hour
- SQL Severless: $0.70 per DBU/hour
- Serverless Real-time Inference: $.0.07 per DBU/hour
Does Databricks offer free trials?
Yes, Databricks does offer free trials, with a free version with fully usable user-interactive notebooks available for 14 days. While the Databricks trial itself is free, you still need to pay for the underlying cloud infrastructure.
If you want to continue to use Databricks for free (but with limited features) you can use the open-source Databricks Community Edition. This is great for those wanting to learn Apache Spark.
However, it’s also important to note that because there are no upfront costs and Databricks is priced on a pay-as-you-go model, the cost itself to get set up is very minimal.
How To Save Money On Databricks
The great news about the Databricks pricing model is that because it’s based on usage, there are a number of ways to reduce your cost basis by altering your usage. Some of these ways include:
- Optimize Your Job Clusters. By choosing the right size and type job cluster, companies can often save huge amounts of money through runtime reductions, without having to make any changes to hardware. For instance, Sync saved DuoLingo 55% on their machine learning costs simply through cluster optimization.
- Use Spot Instances. This is for AWS customers specifically, “Spot Instances” are unused computing capacity on Amazon EC2, which are offered up at deep discounts of up to 90%. However, one of the issues with Spot instances is that machines (or worker nodes) can be removed at any time. This can cause unwanted delays in your job, which can end up increasing the cost of your job. So while Spot Instance wil save you money most of the time, if you need reliable runtime and performance more than costs – then on-demand instances may be better.
- Use Photon. Photon is the next-generation engine on the Databricks Lakehouse Platform that provides massively parallel, extremely fast query performance at lower total cost. This makes it very efficient for highly complex workloads, but maybe overkill for certain simple jobs that are not Photon compatible. If your job is not compatible you could end up paying 2x the DBU costs for no benefit. So we recommend testing your job with Photon to see if you get cost savings or not. Read our blog on this topic to learn more.
- Autoscaling. Autoscaling is a Databricks configuration that helps to dynamically tune the number of workers for your workloads. Activating autoscaling is a simple checkbox in the databricks UI that many people overlook. However, there is a cost to spinning up and down nodes where you’re paying for machines that are still warming up and not actually processing data. This makes autoscaling best for ad-hoc notebook usage. However, for production static Jobs, Autoscaling may end up costing more. Read our blog on this topic to learn more.
- Optimize your code. Apache Spark is a very rich programming framework, and Databricks has built a lot of optimizations within their platform. For example: Optimize & Z-order, OptimizeWrite, Partitioning, File size tuning, Reduce shuffle, Cost based optimizer, Adaptive Query Engine, Salting, Data Skipping, Delta Lake optimizations, and Data Caching. A lot of these techniques are very advanced, but thankfully Databricks has a great resource outlining best practices.
- Don’t use Databricks. Databricks is great for many use cases, but it is very expensive. A lot of real world use cases don’t have data sizes at scale to really justify using Databricks. There are alternative Apache Spark services, such as AWS EMR or using free open source Apache Spark. These options are usually more time intensive to set up, which may mean you need more infrastructure engineers – however their per minute costs are typically cheaper.
Additionally, there are some subscription parameters you can alter to maximize your savings when it comes to using Databricks. The major ones here include:
- Committed Use Discounts. Databricks offers big discounts for those who pre-pay for their processing units, in what’s known as Databricks Commit Units (DBCU). Like many things, the more DCBUs you buy the more you save. For instance, a customer buying $25,000 worth of DBCUs per year could save 6%, while one buying $1.25 million could save as much as 33%. See a full list of pre-purchase discounts for Azure here.
- Use A Different Cloud Service Provider. There are three different cloud service providers for Databricks: AWS, Microsoft Azure and Google Cloud. In our experience, Azure is the most expensive of these, roughly 1-2x higher per DBU than the other two (this is due to the Databricks being a first-party service and having included support from Microsoft).
How Do I Find The Cost Of My Databricks?
Finding the total cost of your databricks usage can be tricky. Because pricing is based on both the Databricks and Cloud provider fees, it’s difficult to collect and attribute all of the costs. There are several methods you can use depending on what you can access at your company:
1. First Find DBUs
You’ll always want to first asses the direct cost of your Databricks usage. To do this you can go to your admin page, and look at your data usage to isolate just your DBU costs. You can also go the new “system tables” under Databricks which will breakdown the DBU costs only for your jobs.
2. Find Cloud Provider Costs
The good news about Cloud provider costs is that they should remain fairly static relative to Databricks costs. To find your Cloud Provider cost, you should be able to use the tags employed inside your Databricks clusters to find associated costs within your Cloud Provider account. For example, example in AWS, you can use costs explorer to find the cluster and tags associated with your bill.
One thing to note is that it can take from several hours to a day to wait for the billing information to be placed in both Databricks or your cloud providers endpoints. This means you have to match costs to workflows from given days, and you can’t get real time results on costs.
Real-time Estimated Total Costs with Gradient
Due to the complexities of extracting the actual cost of each workload, we put together our Gradient product to estimate the total cost (DBUs + cloud costs) of each of your jobs. These costs are estimated based on the Spark eventlog and the cluster metrics from your cloud provider.
In the image below from Gradient, we can see the runtime and the estimated total cost of the job before and after a sync recommendation. These cost values are provided instantly after each job run. Of course they are only estimates and baked on list-pricing – but it will give you a good idea of cost trends.
Additional Costs For Running Databricks
Apart from workspace and compute costs, there are other factors to consider:
- Data Migration and Storage: While Databricks itself doesn’t charge for data storage, you might incur costs based on the cloud provider’s storage and data transfer rates. Databricks also offers a data migration service from existing data warehouse.
- Third-party Integrations: Databricks offers intelligent lakehouse monitoring provided by Unity Catalog, and Predictive Optimization powered by AI. Both operate under a DBU/hour model like standard pricing.
- Support and Training: Databricks offers various support and training packages, which come at an extra cost. Databricks public instructor-led courses average $1,000 to $1,500 per participant (though you can 20% for a limited time with the discount code ilt20)
How Does Databricks Price Compare To Snowflake?
While Databricks is a fairly unique product, the most common alternative companies consider is Snowflake. While both are cloud-based data solutions, Databricks is much more common for large-scale machine learning and data science jobs, whereas Snowflake is optimized for low-to-moderate SQL-based queries. Snowflake is typically easier to use, however, users have much less fine grain control of their infrastructure.
While both have a usage-based charge Snowflake charges clients directly for everything – from compute to storage. Databricks on the other hand has 2 cost drivers, Databricks fees in addition to Cloud compute / storage fees.
At the end of the day, there’s no real way to predict if either platform will be cheaper. It all depends on how you use it and what kind of workloads you’re running. One comment we can say is, you’ll only get as much efficiency as effort you put in – as both platforms require optimizations.
For more in-depth information read Databricks guide on evaluating data pipelines for cost performance.