May 2024 Release Notes

release notes

April showers bring May product updates! Take a look at Sync’s latest product releases and features. 💐

The Sync team is heading to San Francisco for the Databricks Data+AI Summit 2024! We’ll be at Booth #44 talking all things Gradient with a few new surprise features in store.

Want to get ahead of the crowd? Book a meeting with our team before the event here.

Download our Databricks health check notebook

Have you taken advantage of our fully customizable health check notebook yet?

With the notebook, you’ll be able to answer questions such as:
⚙️ What is the distribution of job runs by compute type?
⚙️ What does Photon usage look like?
⚙️ What are the most frequently used instance types?
⚙️ Are APC clusters being auto-terminated or sitting idle?
⚙️ What are my most expensive jobs?

The best part? It’s a free tool that gives you actionable insights so you can work toward optimally managing your Databricks jobs clusters.

Head here to get started.

Apache Airflow Integration

Apache Airflow for Databricks now directly integrates with Gradient. Via the Sync Python Library, users are able to integrate Databricks pipelines when using 3rd party tools like Airflow.

To get started simply integrate your Databricks Workspace with Gradient via the Databricks Workspace Integration. Then, configure your Airflow instance and ensure that the syncsparkpy library has been installed using the Sync CLI

Take a look at an example Airflow DAG below:

from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonVirtualenvOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
from airflow.models.variable import Variable
from airflow.models import TaskInstance

from sync.databricks.integrations.airflow import airflow_gradient_pre_execute_hook

default_args = {
    'owner': 'airflow'

with DAG(
    schedule_interval = None,
        'gradient_app_id': 'gradient_databricks_multitask',
        'gradient_auto_apply': True,
        'cluster_log_url': 'dbfs:/cluster-logs',
        'databricks_workspace_id': '10295812058'
) as dag:

    def get_task_params():
        task_params = {

        return task_params

    notebook_task = DatabricksSubmitRunOperator(



And voila! After implementing your DAG, head to the Projects dashboard in Gradient to review recommendations and make any necessary changes to your cluster config.

Take a look at our documentation to get started.

What is Declarative Computing?

The problem today

In the world of cloud computing today, there remains an echo of the past that comes from the old server days that still haunts us today – manual compute resource selection.

At a very high level, there are are always 2 pieces of information you need to provide to a compute cluster before your job can run:  

  1.  Your code / data
  2.  Compute resources (e.g. warehouse size, instance types, memory, number of workers)

Some examples of popular platforms with recurring jobs and their basic infrastructure choices are shown in the table below.  In reality, many of these systems have substantially more options and configurations, but our goal was to highlight the fundamental computing specific knobs.

Example cloud platforms and their tuning knobs (links provided)

And the parameters that actually matter to any company, such as the cost, runtime, latency, accuracy — are always outputs of the current system. The figure below shows the general high level relationship.

While this is pretty much the gold standard of how computing is operated today, it has resulted to a few popular pain points in regards to the cloud:

  1. Costs are too high – Over Provisioning resources and forgetting about them leads to exorbitant waste for companies, even perhaps questioning the ROI of the cloud.
  2. Unable to manually tune at scale – If an engineer wanted to change the performance of their workload, it would require some manual tuning to change resources.  This does not scale and is impossible when managing thousands of workloads.
  3. Performance goals can be missed – Data sizes can grow, code can change, the computing infrastructure is rarely constant.  This can lead to fluctuating runtimes which may disrupt performance goals such as runtime, accuracy, or latency for example.

Declarative Computing – Outputs as Inputs

What if we flipped the story around?  What if we submitted the goals of the cluster we want vs. the resources? 

If we did, it would look something like this:

Where a user would input the high level performance goals, and then somehow a magical system would figure out the perfect hardware requirements to meet those goals.

In the above example, the goal was to minimize costs, while hitting a runtime of 1 hour and a latency of 100ms.  The actual performance of the job was $50, with a runtime of 1 hour and a latency of 100 ms.

Much like the more well known concept of “declarative programming”, the goal of declarative computing is to describe what you want your compute infrastructure to do, not how – and to rely on some sort of “compiler” to figure out how.

Introducing declarative computing — a method to provision cloud computing infrastructure based on only the desired performance outcomes, such as cost, runtime, latency, accuracy etc…

So why doesn’t this exist already?

The reason why declarative computing doesn’t exist today is simply because it is extremely difficult to predict the output performance of your workloads simply based on your code and data.  

To do so with no prior data and generally across millions of workloads I would say is actually impossible (except for perhaps very constrained situations).

Furthermore, cloud computing infrastructure is a moving object, with updates being pushed daily to various parts of the stack.  The ability to accurately predict performance in a changing landscape is an incredibly hard problem to solve.

So how do we make this a reality?

The concept of declarative programming is quite straight forward, so how would it work? 

At a high level, what is missing in today’s ecosystem is a closed feedback loop, where the output performance of the job is used to train a machine learning model that can improve the infrastructure through iterative predictions.

The cartoon image below shows the basic idea, where a production workload gets cheaper, towards a target cost over time.  

Although costs are shown here, you can swap out the target for any other quantitative metric, such as runtime, latency, accuracy, you name it, you got it.  

The big roadblock here is, what is that machine learning algorithm to make this a reality?

How is this different from autoscaling?

Most autoscaling algorithms are governed by rules and policies as they react to some utilization metric.  For example, add more nodes if utilization crosses 80%, remove nodes if utilization falls below 20%.

While this does a good job of ensuring that utilization is high, it doesn’t actually solve the problem of driving your infrastructure to hit a certain performance goal.  It simply keeps utilization high, but how does utilization connect to cost or runtime goals?  

This disconnect between infrastructure and business goals  is the inherent difference between autoscaling and what “declarative computing” aims to do.  Declarative computing aims to optimize towards actual business metrics, not some intermediate compute metric.

Doesn’t serverless do this already?

Serverless is more an abstraction layer, where the serverless provider is just making choices for you without your goals in mind. However, one big business difference is that the service provider is trying to maximize their profits while keeping the cost to users the same.

For example, let’s say you run a job on a “serverless” product, and it costs you $100. Behind the scenes it costs the service provider $70 to run the job. Perhaps 6 months later, some hot new employee at the cloud provider comes up with a better way to run the job and drops the cost to $30.

In this scenario, the cost to the user stays at $100, while the costs dropped for the service provider to $30. Their margins increased significantly, and your costs stayed the same. Basically, the service provider gets to reep all the benefits of optimization, and may not pass on the savings to you.

Serverless helps solve the problem of “it’s annoying to provision infrastructure” but doesn’t address the problem of pushing your infrastructure to achieve certain performance goals.

On a technical note – one key aspect to making declarative computing a reality, is tracking the same job over time – which is not a trivial task.  What counts as the same job?  What if many different data sources are going to the same job, is each data source a different “job”?  

This type of tracking requires another level of coordination between the user and the compute stack to enable – which is not present in serverless solutions today.

Declarative Computing for Databricks

The concept of declarative computing isn’t just science fiction anymore.  At the core, it’s what we’re building with our product Gradient.  Today, Gradient focuses on Databricks clusters, to help users input their cost and runtime goals, while we help figure the right cluster to make it real.

Below is a screenshot of a real user’s production Databricks cluster cost and runtime performance run with the Gradient solution.  

Each point on the graph represents a production run of this users’s job.  With the feedback loop in place, the system was able to achieve the goals of reducing cost and runtime quite substantially by optimizing the compute infrastructure and cluster settings on AWS.  

These changes occurred automatically as the Gradient solution trains and applies its prediction model to this particular workload.


The big vision of what we’re trying to achieve here at Sync goes well beyond Databricks clusters.  The concept behind “Declarative Computing” is general and can apply to any repeat workload.  For example, any ETL job, Open source Spark, EMR on AWS, Serverless functions, Fargate, ECS, Kubernetes, GPUs, and any other system that runs scheduled jobs.

If you’d like to learn more, or think this concept could be applied to your infrastructure, please feel free to reach out!  We’d love to talk shop about all things cloud infrastructure.

Book a demo with us to see for yourself!

April 2024 Release Notes

release notes

Our April releases are here! Take a look at Sync’s latest product updates and features.

Sync’s Databricks Workspace health check is now self-serve and available as a notebook that you simply download and run on your own.

With the notebook, you’ll be able to answer questions such as:

⚙️ What is the distribution of job runs by compute type?
⚙️ What does Photon usage look like?
⚙️ What are the most frequently used instance types?
⚙️ Are APC clusters being auto-terminated or sitting idle?
⚙️ What are my most expensive jobs?

The best part? It’s a free tool that gives you actionable insights so you can work toward optimally managing your Databricks jobs clusters. Head here to get started.

Hosted Log Collection for Microsoft Azure

You’re now able to easily onboard your Databricks jobs on Azure. With Sync-hosted collection within Gradient, users are able to minimize onboarding errors with a “low-touch” integration process.

Want to give new features a try and learn more about the latest Gradient updates? Get started for free here.

Job Metrics Timeline View

Track custom Spark and Gradient metrics for your projects directly from the Gradient dashboard. With this enhanced view, you’re able to visualize metrics like core hours, number of workers, input data, and more!

Login to Gradient now to get started.

Sync’s Health Check for Databricks Workspaces

Whether you’re a data engineer, a manager of a data team, or an executive overseeing a data platform, your focus might be on growth, and to continue to build and innovate. However, this may come at the expense of ballooning costs that are getting harder and harder to get under control. This ultimately leads to a point where you need to make some tough cost-cutting decisions — like migrating to a less expensive platform — or even tougher decisions — like laying off part of your team.

Our data platform costs are increasing 20% MoM. How do we reduce our costs and get our budget under control?

Senior Data Engineer at a martech company

Can you help us get a better understanding of how we’re using Databricks? We want to get our costs under control but we don’t know where to start.

Staff Project Manager at a large pharma company

What is the Health Check?

Sync Computing’s Health Check for Databricks Workspaces is a Databricks notebook that runs entirely within your Databricks environment. It provides you with a detailed report on findings and actions that help in reducing spend, as well as lead to a deeper understanding of your use cases, patterns, and practices in Databricks.

How stable are job runs?

What is the distribution of job runs by compute type?

What does Photon usage look like?

What are the most frequently used instance types?

Are clusters being auto-terminated or sitting idle?

What are my most expensive jobs?

Sync Health Check provides answers to all the above questions, and more!

We’ll cover a few of these questions in this blog post to demonstrate how Health Check can help get you the data you need to make informed decisions when it comes to your Databricks usage.

How do I get it?

The health check for Databricks workspaces is a free tool anyone can download by following the link below:

Request health check notebook download here

We do ask for your contact information so we can follow up to see if the notebook was useful and to receive any feedback.  We’d also love to hear ideas on any new analysis we can add!

Without further delay, let’s dive into what’s in the health check and how it can be a useful tool:

Job Run Stability

Jobs with low stability and failed runs cost money but don’t drive any business value. These jobs may also be preventing you from meeting your SLAs and causing thrash in your teams. Health Check shows you how many of your job runs result in success and how many result in failure. This insight helps you prioritize actions if the failures are costing you more than the business value they’re driving.

Actionable insight: Prioritize fixing or pausing the jobs with high failure rates to save costs, deliver on SLAs, and reduce team thrash.

Jobs by Compute Type

Databricks offers several compute options to run your workload. For example, Jobs Compute clusters are best suited for jobs that run on a schedule while All-Purpose clusters are best suited for ad hoc analysis. We’ve seen many cases where users run scheduled jobs on All-Purpose clusters primarily to circumvent cluster spin up and spin down times. However, All-Purpose clusters come at a higher cost (at least 2.5x the cost of Jobs Compute clusters!!).

Actionable insight: Migrate to cheaper Jobs Compute clusters and establish clear policies to grant exceptions to use All-Purpose Compute clusters.

Photon Usage

Photon may do the job in terms of providing extremely fast query performance. However, whether or not it’s delivering ROI depends on the performance gain compared to the cost increase of using Photon. Note that Photon is not free and is typically a 2x cost increase for DBUs compared to non-Photon. For more information and details, check out our blog on whether Databricks clusters with Photon and Graviton instances are worth it.

Actionable insight: Compute the ROI you’re getting out of Photon.

Most Frequently used Instance Types

As the subtitle suggests, this shows you the most commonly used instances. The types of instances used may change over time, as the needs of your business change. Being able to track the trends in instance types being used enables your business to remain agile and respond quickly to changing needs – such as efficiently managing your reserved instances.

Actionable insight: Drive better alignment between Databricks instance usage and your organization’s preferred instances.


Clusters with no auto-termination, or longer auto-termination continue to accrue costs when they’re idle. This is generally the case with All-Purpose compute clusters, and the wasted spend could have been avoided with better policies on auto-termination. Additionally, with more of these clusters being spun up, the total idle time keeps increasing. Jobs Compute clusters, on the other hand, are terminated right after the job completes so there’s generally no waste related to idle time.

Actionable insight: Set auto-termination to a minimum for your clusters and establish clear policies to grant exceptions while encouraging cluster re-use.

Most Expensive Jobs

Health Check shows you your top most expensive jobs based on DBUs alone. This is only part of the picture, but when compared against the business value that these jobs drive then you can determine whether you’re getting ROI out of these jobs. If you’ve determined that the job is high value, then the next step is to increase ROI through rightsizing. A major cause of bloated costs is over-provisioning, where you’re still paying for underutilized resources.

Actionable insight: Determine if these jobs are high value and whether there’s opportunity to rightsize the compute to move the needle on ROI.

Wrapping Up

Sync’s Health Check provides deep insights into how your organization uses Databricks, and shines light on areas where there is opportunity to improve.

Feel free to reach out to us! We’d love to hear your feedback on how the Sync Health Check worked for you, and where there’s room for improvement. You can reach us here or send an email to our support team.

March 2024 Release Notes

release notes

Our team has been hard at work to deliver industry-leading features to support users in achieving optimal performance within the Databricks ecosystem. Take a look at our most recent releases below.

Worker Instance Recommendations

Introducing Worker Instance Recommendations directly from the Sync UI. With this feature, you are able to tap into optimal cluster configuration recos so that your configs are optimized for individual jobs.

The instance recos within Gradient not only optimize the number of workers, but also the worker size. For example, if you are using i3.2xl instances, Gradient will find the right instance size (such as i3.xl, i3.4xl, i3.8xl, etc) in the i3 instance type.

Instance Fleet Support

If your company is using Instance Fleet Clusters, Gradient is now compatible!  There are no changes required on the user flow, as this feature is automatically supported in the backend.  Just onboard your jobs like normal into Gradient, and we’ll handle the rest.

Hosted Log Collection

Running Gradient is now more streamlined than ever! You’re now able to opt into hosted log collection entirely in the Sync environment with Sync-hosted or user-hosted collection options. What does this mean? It means that there are no extra steps or external clusters needed to run Gradient, allowing Sync to do all the heavy lifting while minimizing the impact on your Databricks workspace. 

With hosted DBX log collection within Gradient, you’re able to minimize onboarding errors due to annoying permission settings while increasing visibility into any potential collection failures, ultimately giving you and your team more control over your cluster log data.

Getting Started with Collection Setup
The Databricks Workspace integration flow is triggered when a user clicks on Add → Databricks Workspace after they have configured their workspace and webhook. Users will also now have a toggle option to choose between Sync-hosted (recommended) or User-hosted collection.

  • Sync-hosted collection – The user will be optionally prompted to share their preference for cluster logs stored for their Databricks Jobs. This will initially be an immutable setting saved on the Workspace.
    • For AWS – Users will need to add a generated IAM policy and IAM Role to their AWS account. The IAM policy allows us to ec2:DescribeInstances, ec2:DescribeVolumes, and optionally an s3:GetObject and s3:ListBucket to the specific bucket and prefix to which users have configured uploading cluster logs. S3 permissions are optional because they may be using DBFS to record cluster logs. The user needs to add a “Trusted Relationship” to the IAM role to give our Sync IAM role permissions to sts:AssumeRole using an ExternalId we provide them. Gradient will then generate this policy and trust relationship for the user in a JSON format to be copy and pasted.
    • For Azure – Coming soon!
  • User-hosted collection – For both Azure/AWS will proceed as the normal workspace integration requirements dictate

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Request a demo or reach out to us at

Why Your Databricks Cluster EBS Settings Matter

Sean Gorsky & Cayman Williams

Figure 1: Point comparison of between the cost and runtime of a Databricks Job run using the Default EBS settings and Sync’s Optimized EBS settings. More details about the job that was used to create this data can be found in the lower left plot in Figure 4.

Choosing the right hardware configuration for your Databricks jobs can be a daunting task. Between instance types, cluster size, runtime engine, and beyond, there are an enormous number of choices that can be made. Some of these choices will dramatically impact your cost and application duration, others may not. It’s a complicated space to work in, and Sync is hard at work in making the right decisions for your business.

In this blog post, we’re going to dig into one of the more subtle configurations that Sync’s Gradient manages for you. It’s subtlety comes from being squirreled in the “Advanced” settings menu, but the setting can have an enormous impact on the runtime and cost of your Databricks Job. The stark example depicted in Figure 1 is the result of Sync tuning just this one setting. That setting — really a group of settings — are the EBS volume settings.

EBS on Databricks

Elastic Block Storage (EBS) is AWS’s scalable storage service designed to work with EC2 instances. An EBS volume can be attached to an instance and serves as disk storage for that instance. There are different types of EBS volumes, the three of which are relevant to Databricks:

  1. st1 volumes are HDD drives used in Databricks Storage Autoscaling
  2. gp2 volumes are SSDs, user selects the Volume count and Volume Size
  3. gp3 volumes are similar to gp2, but you may pay for additional throughput and IOPS separately

Apache Spark may utilize disk space, including for disk caching, disk spillage, or as intermediate storage between stages. Consequently, EBS volumes are required to run your Databricks cluster if there is no disk-attached (NVMe) storage. However, Databricks does not require a user to specify EBS settings. They exist, squirreled away in the Advanced menu of cluster creation, but if no selection is made then Databricks will automatically choose settings for you.

Figure 2: Screenshot of Databricks’ “Advanced” options on the Compute tab, showing the EBS gp2 volume options. If your workspace is on gp3 you can also tune the IOPS and Throughput separately, though this option is not enabled in the interface (it is possible through the API or by manipulating the cluster in the UI’s JSON mode)

The automatic EBS settings depend on the size of the instance chosen, with bigger instances getting more baseline storage according to AWS’s best practices. While these baseline settings are sufficient for running applications, they are often suboptimal. The difference comes down to how EBS settings impact the throughput of data transfer to and from the volumes.

Take for example the gp2 volume class, where the volume IOPS and throughput are direct functions of the size of the volume. The bigger the volume size, the faster you can transfer data (up to a limit). There’s additional complexity beyond this, including bandwidth bursting and instance bandwidth limits.

So how does Sync address this problem?

Laying the Groundwork

Sync has approached this problem the same way we’ve approached most problems we tackle — mathematically. If you get way down in the weeds, there’s a mathematical relationship between the EBS settings (affecting the EBS bandwidth), the job duration, and the job cost.

The following formula shows the straightforward relationship between the EBS settings (S), the application Duration [hr], and the various charge rates [$/hr]. For clarity we write the Duration only as a function of S, but in reality it depends on many other factors, such as the compute type or the number of workers.

At first glance this equation is straightforward. The EBS settings impact both the job duration and the EBS charge rate. There must be some EBS setting where the decrease in duration outweighs the increase in charge rate to yield the lower possible cost.

Figure 3 exemplifies this dynamic. In this scenario we ran the same Databricks job repeatedly on the same hardware, only tuning the EBS settings to change each instance’s effective EBS throughput. An instance’s EBS throughput is the sum of the throughputs of the attached EBS volumes (ThroughputPerVolume*VolumesPerInstance), up to the maximum throughput allowed by the instance (MaxInstanceThroughput). This leads to a convenient “Normalized EBS Throughput” defined as ThroughputPerVolume*VolumesPerInstance/MaxInstanceThroughput, which we use to represent the instance EBS bandwidth.

Figure 3: (left) Application duration vs normalized EBS throughput, defined as ThroughputPerVolume*VolumesPerInstance/MaxInstanceThroughput. Increasing throughput reduces runtime with diminishing returns, and increasing throughput beyond 1.0 (the maximum throughput allowed by the instance) has no effect on the application duration. (right) Total cluster cost vs normalized EBS throughput. Since EBS contributes to the cost rate of the cluster, the optimal cost corresponds to a throughput value below the instance maximum.

The plot on the right shows the cost for each point in the left plot. Notably, there’s a cost-optimum at a normalized throughput of ~0.5, well below the instance maximum. This is a consequence of the delicate balance between the cost rate of the EBS storage and its impact on duration. The wide vertical spread at a given throughput is due to the intricate relationship between EBS settings and throughput. In short, there are multiple setting combinations that will yield the same throughput, but those settings do not have the same cost.

Sync’s Solution

The most notable feature in Figure 3 is the smooth and monotonically decreasing relationship between duration and throughput. This is not entirely unexpected, as small changes in throughput ought to yield small changes in duration, and it would be surprising if increasing the throughput also increased the runtime. Consequently, this space is ripe for the use of modeling — provided you have an accurate enough model for how EBS settings would realistically impact duration (wink).

The downside to modeling is that it requires some training data, which means a customer would have to take deliberate steps to collect the data for model training. For GradientML we landed on a happy medium.

Our research efforts yielded a simple fact: immediately jumping to EBS settings that efficiently maximize the worker instance EBS throughput will yield a relatively small increase in the overall charge rate but in most cases results in a worthwhile decrease in run duration. When we first start managing a job, we bump up the EBS settings to do exactly this.

We explore the consequences of this logic in Figure 4, which depicts six different jobs where we compare the impact of different EBS settings on cost and runtime. Every job uses the same cluster consisting of one c5.24xlarge worker. In addition to the “default” and “optimized” settings discussed thus far, we also tested with autoscaled storage (st1 volumes, relatively slow HDDs), and disk-attached storage (one c5d.24xlarge worker instead, this is lightning fast NVMe storage).

The top row consists of jobs which are insensitive to storage throughput, but we see that maximizing the EBS settings did not meaningfully impact cost. In these cases data transfer to and of from storage had a negligible impact on the duration of the overall application, and so the duration was insensitive to the EBS bandwidth.

The bottom row consists of jobs where this data transfer does meaningfully impact the application duration, and are therefore more sensitive to the throughput. Coincidentally, the disk-attached runs did not show any meaningful cost reduction over the EBS-optimized runs, though this is most certainly not a universal trend.

Figure 4: Several tests to assess the impact of EBS setting on Databricks Job durations. The top row depicts jobs where the EBS choice has a negligible impact on duration and cost. The bottom row depicts jobs which are very sensitive to EBS throughput, indicated by the steep drop in cost of the ebs_optimized and disk_attached bars. Every run uses a single c5.24xlarge worker instance, except for the disk-attached (green) runs, which use one c5d.24xlarge worker.


With the abstraction that is cloud computing, even the simplest offerings can come with a great deal of complexity that impacts the duration and cost of your application. As we’ve explored in this blog post, choosing appropriate EBS settings for Databricks clusters is an excellent illustration of this fact. Fortunately, the smooth relationship between duration and an instance’s EBS throughput lends itself to the powerful tools of mathematical modeling — the kind of thing that Sync eats and breaths. We’ve employed this expertise not only in the analysis in this blog, but in our compute management product GradientML which manages the compute decisions in Databricks clusters, and automatically implements these optimizations on your behalf.