Choosing the right Databricks cluster: Spot instances vs. on-demand clusters, All-Purpose Compute vs. Jobs Compute

According to Wavestone’s 2024 Data and AI Leadership Executive Survey, about 82.2% of data and AI leaders report that their organizations are increasing investments in data and analytics. As companies increasingly rely on big data, the significance of efficient data processing solutions and optimal configuration of clusters become even more crucial.

Choosing the correct cluster type for your Databricks Jobs requires considering factors like workload, budget, and performance needs. In this post, we explore the types of clusters in Databricks, ideal use cases for each, and strategies for maximizing their efficiency while lowering the costs of your data infrastructure.

What is a cluster?

A cluster is a group of virtual machines (VMs), computation resources, and configurations that work together as a joint computing environment. In Databricks, these compute resources are used to run notebooks and data pipelines. Databricks leverages the customer’s existing cloud infrastructure. By integrating Databricks with a cloud provider, such as Azure or AWS, users are able to utilize their current cloud resources while leveraging Databricks’ powerful data processing capabilities.

Clusters are designed to process large amounts of data in parallel, improving both speed and planning. Each node in a cluster is a machine that handles a portion of the overall data processing task, allowing multiple tasks to run simultaneously so the workload completes faster. Some examples of data workloads these clusters execute include machine-learning jobs, ad-hoc analytics, streaming analytics, and ETL processes.

In Databricks, clusters can be modified to meet the specific needs of your data workloads. The right configuration, from the size of the cluster to the market (i.e. spot vs on-demand), can significantly affect performance, scalability, and cost efficiency.

Let’s dive into the various types of clusters, pricing options, and best practices to help you choose the best cluster configuration for your Databricks Jobs.

Save up to 50% on compute!

Save on compute while meeting runtime SLAs with Gradient’s AI compute optimization solution.

Save up to 50% on compute with Gradient

Worker and driver nodes in Databricks

In Databricks, clusters consist of two primary types of nodes: workers and a driver.

  • Worker nodes: These nodes are responsible for executing tasks and running Spark executors. They handle the majority of the data processing workload, distributing tasks in parallel to enhance performance. Each worker node runs one executor, which is essential for executing Spark jobs.
  • Driver node: The driver is a special type of worker that coordinates the execution of tasks across the worker nodes. It manages the Spark context, schedules tasks, and collects results from the workers. While it performs similar functions to a worker, its unique role in managing the cluster’s operations sets it apart.

Choosing the right configuration of worker and driver nodes can be complex due to the vast amount of options available. You also need to take into account the cluster size – more about that later. Lastly, factors such as workload requirements, performance needs, and cost considerations must be taken into account when selecting node types and sizes.

To ensure you are only paying for the resources you need, Gradient offers AI-powered cluster optimization. Gradient assists users in configuring clusters by providing insights and recommendations for optimization based on specific workload characteristics. These recommendations can be reviewed and applied in a click, or fully automated for scale

Cluster attributes

Each cluster has the following attributes:

  • Compute nodes: These are virtual machine instances that supply resources to the cluster (CPU, memory and storage). Nodes can be resized or added as needed between runs to support scaling.
  • Databricks Runtime: A set of software artifacts that act as an optimized version of Apache Spark, specifically designed for the Databricks platform. Databricks Runtime enhances job performance, scalability, security, and integration capabilities.
  • Libraries: Users can install additional JARs, Python/R packages, Spark packages, and other dependencies for workloads running on the cluster. These are automatically accessible when the cluster starts.
  • Init scripts: Custom scripts that run at the start of a cluster, used to set up dependencies, mount storage, download data, and perform other configuration tasks.
  • Notebooks and dashboards: Data artifacts and visualizations created in Databricks, which can be associated with a cluster for convenient access. The cluster has the necessary permissions to interact with these resources.

Cluster market

Generally, clusters can be accessed through two primary pricing models: Spot instances and on-demand clusters. The right choice depends largely on your workload requirements and your budget.

As a rule of thumb, we do not recommend running crucial production jobs on spot instances, as they might be interrupted or even terminated without notice.

Spot instances

Spot instances leverage unused capacity from cloud providers, such as AWS and Azure. These instances are typically much cheaper than on-demand clusters, saving up to 90% of the cost compared to on-demand pricing. The trade-off is that the cloud provider can terminate these instances without notice, or pull away resources (workers) causing the jobs to stall or even fail.

When to choose Spot instances:

  • Fault-tolerant workloads: If your workload can tolerate interruptions, such as jobs that use checkpointing to restart from the last successful state, spot instances are an excellent, cost-effective option.
  • Development and testing environments: Spot instances are ideal for non-production use cases where job interruption isn’t a critical issue.
  • Non-time-sensitive data processing: If your tasks are not on strict deadlines and can handle delays, spot instances offer a more affordable solution.
  • Batch processing: Tasks that process data in large, scheduled batches (such as ETL or data transformation tasks) can benefit from the lower cost of spot instances, provided they have built-in fault tolerance and are not time sensitive.

Best practices for Spot instances:

  • Implement checkpointing: Ensure your workloads are designed to resume after an interruption by using checkpointing or other state-saving mechanisms. 
  • Use automatic retries: Databricks provides features to retry failed jobs, which is especially important for spot instances that may face termination.
  • Monitor cluster health: Keep track of termination notifications, so you can take action if an instance is about to be reclaimed by the cloud provider.
  • Set timeouts: Configure job timeout settings to prevent jobs from running indefinitely and using up unnecessary resources.

On-demand clusters

On-demand clusters, on the other hand, provide guaranteed availability of resources and are ideal for mission-critical production workloads that cannot afford interruptions. With on-demand clusters, you pay for the resources you use, but you’re assured that the computing power will be available whenever you need it. These clusters are the most reliable option for production environments and time-sensitive tasks.

When to choose on-demand clusters:

  • Production workloads: On-demand clusters ensure your production workloads run smoothly with minimal downtime.
  • Workloads with strict SLAs: On-demand clusters are ideal for workloads tied to runtime Service Level Agreements (SLAs). You can easily add resources to meet your runtime SLAs, the tricky part is to not overprovision your clusters with resources you do not need. Gradient solves this common issue.
  • Real-time data processing: If you require real-time analytics or data processing with low-cold storage requirements, on-demand clusters are the best choice.
  • Interactive workloads: For tasks such as data exploration, machine-learning model development, and collaborative work that require ongoing access to computational resources, on-demand clusters provide the stability and flexibility you need.
  • Business-critical applications: If the data being processed is crucial for decision-making, such as financial transactions, customer analytics, or operations monitoring, on-demand clusters guarantee that the infrastructure is always available.

Best practices for on-demand clusters:

  • Optimize resource utilization: Keep an eye on resource utilization to ensure you are using the appropriate instance types and cluster sizes for your workloads.
  • Establish proper shutdown policies: Automatically shut down clusters when not in use to avoid incurring unnecessary costs.
  • Leverage cluster pools: Cluster pools help reduce the time it takes to provision clusters by reusing existing resources, especially useful in environments with frequent cluster creation or single minute latency requirements on jobs.

Related case study:
Learn how an AdTech company saved 300 eng hours and $10K with Gradient

Cluster types

Databricks cluster node types are designed for specific use cases, with the two main categories being All-Purpose Compute and Job Clusters.

All-Purpose Compute clusters

All-purpose compute (APC) clusters are designed to support interactive tasks, such as exploratory data analysis, development, and real-time model training. These clusters stay running until manually terminated, providing continuous access to computing resources for tasks that require flexibility and responsiveness. They can also be used as a shared resource, which makes them ideal for research and exploratory processes. 

Key features:

  • Interactive development support: Ideal for data scientists and analysts who need to run notebooks and interact with data on the fly.
  • Multi-user access: Multiple team members can collaborate on the same cluster, making it a great option for team-based work.
  • Manual start/stop: You have full control over when the cluster is running, allowing you to start it for specific tasks and stop it when it is no longer needed.
  • Notebook attachment: You can attach notebooks to an all-purpose cluster to run code interactively.

Ideal use cases:

  • Data exploration: When you need to query data and visualize results interactively, all-purpose compute clusters provide the flexibility you need.
  • Collaborative development: Teams of data scientists, analysts, and engineers can share a cluster for joint data analysis or model development.
  • Prototyping and experimentation: For testing out new ideas, running experiments, or developing machine learning models, an APC cluster offers continuous access to compute power.

Job clusters

Jobs compute clusters, unlike all-purpose compute clusters, are provisioned specifically to execute automated, batch workloads. Once the job is complete, the cluster is automatically terminated, making this type of cluster cost-effective for scheduled tasks and periodic data processing jobs. In fact, we found that APC clusters can cost up to 50% more than Job clusters, for the same batch workload. 

Key features:

  • Ephemeral clusters: These clusters are created on-demand for a specific job and automatically terminated once the job completes.
  • Optimized for batch processing: Ideal for running tasks that don’t require interactive analysis, such as ETL processes, large-scale data transformations, or machine learning model training.
  • Job-specific configurations: You can configure job clusters with the specific resources needed for each job, making them more efficient than APC clusters for repetitive tasks.

Ideal use cases:

  • ETL pipelines: For processing and transforming large datasets in a scheduled manner, job compute clusters are an efficient and reliable choice.
  • Model training: Job clusters can run training jobs for machine learning models without requiring continuous monitoring.
  • Data processing jobs: Batch processing tasks that don’t require interactive analysis benefit from job clusters’ flexibility.
  • Scheduled report generation: Automate the generation of reports and dashboards based on predefined schedules.

Cluster sizing

Selecting the right cluster size is one of the most important aspects of configuring your Databricks cluster. Unfortunately, the data cloud doesn’t make it easy. There is a wide variety of cluster size options and combinations available. Some are optimized for storage, others for memory, compute, GPU acceleration, or general purpose. 

The size of the cluster should depend on the complexity of the workload and the amount of data being processed. However, we’ve found that more often than not an engineer will spin up a cluster that is “large enough” to support a workload, prioritizing reliability over cost efficiency. The issue is that once the cluster is configured and the job starts running, the configuration will not likely be reviewed soon, unless the job is one of the expensive ones, or it starts to fail. This situation at scale could cost an organization dearly. 

Getting a better understanding of the options available is the first step to combating the issue. 

Single-node clusters:

Best for: Testing, learning, and small-scale data processing

Advantages: Cost-effective, easy to manage, and shared resources (allowing for significant increase in speed over distributed computing in cases where latency to shared resources matters).

Limitations: Limited processing power, cannot handle large datasets or complex workloads, not designed for shared use.

Multi-node clusters:

Small clusters (2-8 Nodes)

Best for: Development, testing, and small-scale data processing.
Advantages: More affordable, easier to manage.
Limitations: Limited processing power, and possible memory constraints.

Medium clusters (8-32 Nodes)

Best for: Production workloads, and medium-sized datasets.
Advantages: A good balance of performance and cost.
Limitations: Higher operational costs compared to smaller clusters.

Large clusters (32+ Nodes)

Best for: Big data processing and complex analytics.
Advantages: High scalability and performance for large datasets and intensive workloads.
Limitations: Higher cost, requires careful resource management to avoid inefficiencies.

Advanced features

Databricks offers a couple advanced features built to help users optimize their compute. 

Photon: 

Photon is Databricks’ next-generation query engine that improves workload performance by utilizing native vectorized execution. With Photon, you can significantly reduce query times and improve overall performance, however, it is more expensive, typically costing 2x a DBU. Photon will probably be faster than non-photon, but whether or not it’s cheaper will depend on how much faster it is relative to the cost increase. 

Best practices for Photon:

  • A/B test your jobs with and without Photon enabled to compare performance and cost differences and make job-specific decisions.
  • Monitor and optimize your jobs’ configurations to ensure that you’re getting the most out of Photon.

Autoscaling: 

Autoscaling automatically adjusts the size of your cluster based on workload demand, ensuring that you have the resources to complete your job. But are you overprovisioning? This is something autoscaling doesn’t help with. We compared Gradient optimized job clusters to job clusters running autoscaling and found that Gradient outperformed autoscaling for cost by roughly 40%.

Best practices for autoscaling:

  • A/B test your jobs with and without autoscalling enabled to identify jobs that are prime candidates for autoscaling.
  • Set min and max worker limits to prevent excessive scaling and ensure cost control. This typically overlooked recommendation is crucial. Think of it as your guardrails to limit overprovisioning. 
  • Regularly review cluster performance to fine-tune auto-scaling settings and ensure optimal efficiency.

Comparison table: APC clusters vs Job clusters

FeatureAll-Purpose ComputeJobs Compute 
Purpose💬 Interactive tasks🤖 Automated batch jobs
Lifecycle🖐️ Manual start/stop🌫️ Ephemeral
Access👥 Multi-user👤 Single-use
Use Cases🔍 Data exploration and prototyping📊 ETL and model training
Resource Configuration🔄 Flexible🎯 Job-specific
Cost Efficiency💸 Higher cost💵 Cost-effective
Interactivity⚡ High💤 Low

All-purpose clusters are designed for flexibility and collaboration, making them ideal for interactive tasks such as data exploration and development. They remain active until manually terminated, allowing multiple users to work simultaneously on the same cluster. This setup is particularly beneficial for teams that require ongoing access to computational resources.

In contrast, Jobs Compute clusters are optimized for running automated, batch workloads. These clusters are ephemeral, created specifically for a job and terminated once the job is completed. This makes them a cost-effective choice for scheduled tasks like ETL processes and model training, where interactivity is not required.

By configuring resources according to job specifications, organizations can achieve greater efficiency and save costs. This is precisely what Gradient helps organizations with, automatically ensuring their data infrastructure is as efficient as possible.

Conclusion

Choosing the right cluster configuration for your data pipelines in Databricks is a critical decision that can impact the performance, cost, and reliability of your data infrastructure. By understanding the different cluster types, pricing models, and configuration options available, you can make informed decisions that line up with your organization’s needs. When configuring a cluster for Databricks, remember to consider your workload characteristics, business requirements, and technical constraints.

Whether you’re using spot clusters for cost efficiency, on-demand clusters for reliability, or optimizing your setup with advanced features like Photon and autoscaling, Databricks provides your with granular options to manage your data processing workloads. They might even be too granular for most users.

If you find the sheer volume of options confusing, or simply don’t have the time for the trial and error process of tuning a cluster to your needs, we can help. Gradient is a purpose-built compute optimization system. It uses advanced, self-improving algorithms to offer 100% custom cluster optimizations based on the unique characteristics of your data pipelines and requirements.

Interested in seeing Gradient in action?

Book a time to learn more and chat about your data needs!