Blog

Why are your Databricks jobs performances changing over time?

For those running and tracking their production Databricks jobs, many may often see “random” fluctuations in runtime or slowly changing performance over days and weeks.

Immediately, people may often wonder:

  • “Why is my runtime increasing since last week?”
  • “Is the cost of this job also increasing?”
  • “Is the input data size changing?”
  • “Is my job spilling to disk more than before?”
  • “Is my job in danger of crashing?”

To help give engineers and managers more visibility into how their production jobs are performing over time, we just launched a new visualization feature in Gradient that will hopefully help provide quick answers to engineers.

A snapshot of the new visualizations is shown below, where on a single page you can see and correlate all the various metrics that may be impacting your cost and runtime performance.  In the visualization below, we track the following parameters:

  • Job cost (DBU + Cloud fees)
  • Job Runtime
  • Number of core*hours
  • Number of workers
  • Input data size
  • Spill to disk
  • Shuffle read/write

Why does job performance even change?

The main three reasons why we see job performance change over time are:

  1. Code changes – Obviously with any significant code changes, your entire job could behave differently.  Tracking and understanding how your new code changes impact the business output however is less clear.   With these new visualizations, engineers can quickly see the “before and after” impact of any code changes they implement

  1. Data size changes – When your data increases or decreases, this can impact your job runtime (and hence costs).  While this makes sense, tracking and seeing how it changes over time is much more subtle.  It may be a slowly varying amount, or it could be very spiky data events with sudden changes.  Understanding how your data size impacts your runtime is a critical “first check”

  1. Spot instances revoking – When spot nodes are randomly pulled during your job execution, it can cause significant impact on the runtime of your job.  Since Spark has to essentially “recover” from a worker being pulled, the impact on runtime can be significant.  We’ve seen runtimes go up 2-3X simply from 1 worker being pulled.  It all depends on at what point the Spot instance is being pulled and the impact.  Since this is basically random, your overall spot runtimes can have wildly varying runtimes. 


As a side note, we’ve observed that an optimized on-demand cluster can often beat Spot pricing because of this very reason.  Over the long haul, a stable and optimized on demand cluster is better than a wildly varying Spot cluster.

How does this feature differ from Overwatch?

For those in the know, Overwatch is a great open source tool built by Databricks to help plot all sorts of metrics to help teams monitor their jobs.  The main differences and advantages of the visualizations we show are:

1)  Total cost metrics – Gradient pulls both the DBU and estimated Cloud costs for your clusters and shows you the total cost.  Cost data from Databricks only includes DBUs.  While it is possible to wrap in cloud costs with overwatch, it’s a huge pain to set up and configure.  Gradient does this “out of the box”

2)  “Out of the box” Ready – While Overwatch does technically contain the same data that Gradient shows, users would still have to write queries to do all the aggregations properly by pulling the task level or stage level tables as required. Overwatch is best considered a “data dump” and then users will have to wrangle it correctly and do all the dashboarding work in a way that meets their needs. Our value add is that we do all this leg work for you and just present the metrics from day 1.

3)  Trends over time – Gradient aggregates the data to show users how all these various metrics are changing over time, so users can quickly understand what has changed recently.  Looking at a single snapshot in time is often not very useful, as users need to see “what happened before?”  to really understand what has changed and what they can do about it.  While technically this is do-able with Overwatch, it requires users to do the work in building and collecting the data.  Gradient does this “out of the box”

How does this help with stability and reliability?

Beyond cost efficiency, stability is often a higher priority than everything.  Nobody wants a crashed job.  These metrics help give engineers “early signals” if their cluster is headed towards a dangerous cliff.  For example, data sizes may start growing beyond the memory of your cluster, which could cause the dreaded “out of memory” error.

Seeing how your performance is trending over time is a critical piece of information users need to help prevent dangerous crashes from happening.

Conclusion

We hope this new feature makes life a lot easier for all the data engineers and platform managers out there.  This feature comes included with Gradient and is live today!  We’d love your feedback.

It probably goes without saying that this feature is in addition to our active optimization solution that can auto-tune your clusters to hit your cost or runtime goals.  The way we look at it, is we’re expanding our value to our users by providing critical metrics.

We’d love your feedback and requests for what other metrics you’d love to see.  Try it out today or reach out for a demo!

Databricks Delta Live Tables 101

Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below.

  1. What are Delta Live Tables?
  2. How are Delta Live Tables, Delta Tables, and Delta Lake related?
  3. Breaking Down The Components of Delta Live Tables
  4. When to Use Views or Materialized Views in Delta Live Tables
  5. Where are the advantages of Delta Live Tables?
  6. What is the cost of Delta Live Tables?

Since its release in 2022 Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block. 

As an offering DLT begins to look similar to the DBT value proposition, and with a few changes (namely, jinja templating), DLT may be poised to expand more into what has traditionally been considered DBT’s wheelhouse. DLT is also positioned to begin consuming workloads that were previously handled by multiple separate orchestration, observability, and quality vendors. 

In our quest to help customers manage, understand, and optimize their Databricks workloads, we sought out to understand the value proposition for both customers, and for Databricks. In this post, we break down DLT as both a product offering as well as it’s ROI for customers.

What Are Delta Live Tables?

Delta Live Tables, or DLT, is a declarative ETL framework that dramatically simplifies the development of both batch and streaming pipelines. Concretely though, DLT is just another way of authoring and managing pipelines in databricks. Tables are created using the @dlt.table() annotation on top of functions (which return queries defining the table) in notebooks.  

Delta Live Tables are built using Databricks foundational technology such as the Delta Lake and Delta File format. As such, they operate in conjunction with these two. However, whereas these two focus on the more “stagnant” portions of the data process, DLT focuses on the transformation piece. Specifically, the DLT framework allows data engineers to describe how data should be transformed between tables in the DAG. 

Delta Live Tables are largely used to allow data engineers to accelerate their construction, deployment, and monitoring of a data pipeline. 

The magic of DLT, though, is most apparent when it comes to datasets that both involve streaming data and batch processing data. Whereas, in the past, users had to be keenly aware of and design pipelines for the type of the “velocity” (batch vs. streaming) of data transformed, DLT allows users to push this problem to the system itself. Meaning, users can write declarative transformations and let the system figure out how to handle the streaming or batch components. Operationally, Delta Live Tables add an abstraction layer over Apache Spark (or at least Databricks’ flavor of Spark). This layer provides visibility into the table dependency DAG, allowing authors to visualized, what can rapidly become inter-table dependencies. . 

The DAG may look something like this:

Table dependency visualization is just the beginning. DLT provides a comprehensive suite of tools on top of these pipelines that are set up by default. This can include tools such as data quality checks, orchestration solutions, governance solutions, and more.

When executed properly, DLT helps with total cost-of-ownership, data accuracy and consistency, speed, and pipeline visibility and management. There are many who actually say that DLT is Databricks’ foray into the world of DBT, hoping to cannibalize DBT’s offering. To the question of how this may all play out, we’ll just wait and see.

The word “Delta” appears a lot in the Databricks ecosystem, and to understand why, it’s important to look back at history. In 2019, Databricks publicly announced the Delta Lake, a foundational element for storing data (tables) into the Databricks Lakehouse. Delta Lake popularized the idea of a Table Format on top of files, with the goal of bringing reliability to data lakes. As such, Delta Lake provided ACID transactions, scalable metadata handling, and unified streaming/batch processing to existing Data Lakes in a Spark API compatible way.

Tables that live inside of this Delta Lake are written using the Delta Table format and, as such, are called Delta Tables. Delta Live Tables focus on the “live” part of data flow between Delta tables – usually called the “transformation” step in the ETL paradigm. Delta Live Tables (DLTs) offer declarative pipeline development and visualization.

In other words, Delta Table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta Tables are a data format while Delta Live Tables is a data pipeline framework. All are built on the data lakehouse infrastructure of Delta Lake.

Breaking Down The Components Of Delta Live Tables

The core of DLT is the pipeline— the main unit of execution used to configure and run data processing workflows with Delta Live Tables. These pipelines link data sources to target datasets, through what’s known as a Directed Acyclic Graph, and are declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order.

Each pipeline configuration is defined by its settings, such as notebook, running mode, and cluster configuration. Before processing data with Delta Live Tables, you must configure a pipeline.

As an aside, for the developers reading this, for some reason the Databricks SDK defines a Pipeline as a List of Clusters, which either may be a preview for what to expect in new features or an oversight. We’ll find out soon.

Delta Live Table pipeline supports three types of datasets: Streaming tables, Materialized Views and Views. Streaming tables are ideal for ingestion workloads, and pipelines that require data freshness and low latency. They are designed for data sources that are append-only.

Supported views can either be materialized views where the results have been precomputed based on the update schedule of the pipeline in which they’re contained—or views, which compute results from source datasets as they are queried (leveraging caching optimizations when available). Delta Live Tables do not publish views to the catalog, so views can only be referenced within the pipeline in which they’re defined. Views are useful as intermediate queries that should not be exposed to end users or systems. Databricks describes how each is processed with the following table:

Dataset TypeHow are records processed through defined queries?
Streaming TableEach record is processed exactly once. This assumes an append-only source.
Materialized ViewsRecords are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, or for change data capture processing (CDC).
ViewsRecords are processed each time the view is queried. Use views for immediate transformations and data quality checks that should not be published to public datasets.

After defining your pipeline settings, you can declare your datasets in DLT using either SQL or Python. These declarations can then trigger an update to calculate results for each dataset in the pipeline.

When to Use Views or Materialized Views in Delta Live Tables

Given the existence of two options to create views on top of data, there must be some situations where one should be preferred over the other. The choice of View or Materialized View primarily depends on your use case. The biggest difference between the two is that Views, as defined above, are computed at query time, whereas Materialized Views are precomputed. Views also have the added benefit that they don’t actually require any additional storage, as they are computed on the fly. 

The general rule of thumb when choosing between the two has to do with the performance requirements and downstream access patterns of the table in question. When performance is critical, having to compute a view on the fly may be an unnecessary slowdown, even if some storage is saved by computing the table on-the-fly, in which case, Materialized Views may be preferred. The same is true when there are multiple downstream consumers of a particular View. Having to compute the exact same view, on the fly, for multiple tables is inefficient and unnecessary. In this case, persisting the Materialized View may be preferred.  

However, there are multiple situations where users just need a quick view, computed in memory, to reference a particular state of a transferred table. Rather than materializing this table, which again, is only needed for an operation in the same transformation, creating a View is more straightforward and efficient.

Databricks also recommends using views to enforce data quality constraints or to transform and enrich datasets that drive multiple downstream queries.

What Are the Advantages of Delta Live Tables?

There are many benefits to using a Delta Live Table, including simpler pipeline development, better data quality standards, and support for unified real time and batch analytics. 

  • Unified streaming/batch experience. By removing the need for data engineers to build distinct streaming / batch data pipelines, DLT simplifies one of the most difficult pain points of working with data, thereby offering a truly unified experience.
  • Opinionated Pipeline Management. The modern datastack is filled with orchestration players, observability players, data quality players, and many others. That makes it difficult, as a platform manager, to not only select how to configure the standard/template data stack, but also how to enforce those standards. DLT offers an opinionated way to orchestrate and assert dataquality.
  • Performance Optimization. DLTs offer the full advantages of Delta Tables, which are designed to handle large volumes of data and support fast querying, as their vectorized query execution allows them to process data in batches rather than one row at a time. This makes them ideal not just for real-time data ingestion but cleaning of large datasets.
  • Management. Delta Live Tables automate away, otherwise manual tasks, such as compactions or selection of job execution order.  Tests by Databricks show that with the use of automatic orchestration, DLT was 2x faster than the non-DLT Databricks baseline, as DLT is better at orchestrating tasks than humans (meaning, they claim DLT is better at determining and managing table dependencies).
  • Built-in Quality Assertions. Delta Live Tables also provide some data quality features, such as data cleansing and data deduplication, out of the box. Users can specify rules to remove duplicates or cleanse data as data is ingested into a Delta Live Table, ensuring data accuracy. DLT automatically provides real-time data quality metrics to accelerate debugging and improve the downstream consumer’s trust in the data.
  • ACID Transactions. Because DLTs use Delta format they support ACID transactions (Atomicity, Consistency, Isolation and Durability) which has become the standard for data quality and exactness.
  • Pipeline Visibility. Another one of the benefits of Delta Live Tables is a Directed Acyclic Graph of your data pipeline workloads. In fact, this is one of the bigger reasons that DBT adoption has occured at the speed it has. Simply visualizing your data pipelines has been a common challenge. DLT DLT gives you a clear, visually compelling way to both see and introspect your pipeline at various points.
  • Better CDC. Another large improvement in DLT is the ability to use Change Data Capture (CDC)  including support for Slowly Changing Dimensions Type 2 just by setting the enableTrackHistory parameter in the configuration. This is a data history tracking feature incredibly useful for audits and maintaining consistency across datasets. We dive a bit further into this below.

What To Know About Change Data Capture (CDC) in Delta Live Tables

One of the large benefits of Delta Live Tables is the ability to use Change Data Capture while streaming data. Change Data Capture refers to the tracking of all changes in a data source so they can be captured across all destination systems. This allows for a level of data integrity and consistency across all systems and deployment environments which is a massive improvement.

With Delta Live Tables, data engineers can easily implement CDC with new Apply Changes into the API (either with Python or SQL). The capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse.

Importantly, Delta Live Tables support Slowing Changing Dimensions (SCD) both type 1 and type 2. This is important because SCD type 2 retains a full history of values, which means even in your data lakehouse, where compute and storage are separate, you can retain a history of records—either on all updates or on updates to a specified set of columns.

In SDC2, when the value of an attribute changes, the current record is closed, a new record is created with the changed data values, and this new record becomes the current record. This means if a user entity in the database moves to a different address, we can store all previous addresses for that user.

This implementation is of great importance to organizations that require maintaining an audit trail of changes.

What is the cost of Delta Live Tables?

As with all things Databricks, the cost of Delta Live Tables depends on the compute function itself (as well as cost variance by region and cloud provider). On AWS, DLT compute can range from $0.20/dbu for DLT Core Compute Photon all the way up to $0.36/dbu for DLT Advanced Compute. However keep in mind these prices can be up to twice as high when applying expectations and CDC, which are among the chief benefits of Delta Live Tables.

From an efficiency perspective, DLT results in a reduction in total cost of ownership. Automatic orchestration tests by Databricks have shown total compute time to be reduced to as much as half with Delta Live Tables–ingesting up to 1 billion records for under $1. Additionally, Delta Live integrates the orchestrator and Databricks into a single console, which reduce the cost of maintaining two different systems to maintain two solutions.


However, users should also be cautioned that without proper optimization, Delta Live Tables can result in a large increase in virtual machine instances, which is why its crucial to always maintain your auto scaling resources.

Want to learn more about getting started with Delta Live Tables? Reach out to us at info@synccomputing.com.

Rethinking Serverless: The Price of Convenience

As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend. 

The root cause of this confusion has to do with the fact that the catch-all term “Serverless” actually compromises two large operating models: functions and jobs. At Sync – we’re intimately familiar with optimizing jobs, so when our customers gave us feedback to focus a portion of our attention on serverless functions, we were more than intrigued.

The hypothesis was simple. Could we extend our expertise and background in optimizing Databricks large scale/batch compute workloads to optimizing many smaller batch compute workloads. 

Serverless Functions

First, let’s see how we got here.

One of the most painful parts of the developer workflow is “real world deployment.” In the real world, deploying code that was written locally to the right environment and to work in the same way was extraordinarily painful. Libraries issues, scaling issues, infrastructure management issues, provisioning issues, resource selection issues, and a number of other issues plagued developers. The cloud just didn’t mimic the ease and simplicity of local developer environments. 

Then Serverless functions emerged. All of a sudden, developers could write and deploy code in a function with the same level of simplicity as writing it locally. Then never had to worry about spinning up an EC2 instance or figuring out what the material differences between AMI and Ubuntu are. They didn’t have to play with docker files or even have to do scale testing. They wrote the exact same Python or NodeJS code that they wrote locally in a Cloud IDE and it just worked. It seemed perfect.

Soon, mission critical pieces of infrastructure were supported by double digit line python functions deployed in the cloud. Enter: Serverless frameworks. All of a sudden, it became even easier to adopt and deploy serverless functions. Enterprises adopted these functions like hotcakes. Many deployed in the hundreds or even thousands of these functions. 

Why We Care

At Sync, our focus since our inception has been optimizing large scale compute jobs. Whether through Spark, EMR, or Databricks, the idea of introspecting a job and building a model through which we can understand and optimize that job, is our bread and butter. As we continued our development, multiple customers began asking for support of serverless technologies. Naturally, we assumed they were talking about Serverless Job functionality (which many were), but there was a substantial portion focused on Serverless Function functionality. 

So we set out to answer a simple question: Are Serverless Functions in their current form working for the modern enterprise? 

The answer, as it happens, is a resounding no. 

Industry Focus

In 2022, an IBM blog post titled “The Future Is Serverless” was published, which cited the “energy-efficient and cost-efficient” nature of serverless applications as a primary reason that the future will be serverless. They make the – valid – case that reserving cloud capacity is challenging and consumers of cloud serverless functions are better served by allowing technologies such as KNative to streamline the “serverless-ification” processes. In short, their thesis is that complex workloads, such as those run in Kubernetes, are better served by Serverless offerings.

In 2023, Datadog released their annual “State of Serverless” post, where they show the continued adoption of Serverless technologies. This trend is present across all of the 3 major cloud vendors.

https://www.datadoghq.com/state-of-serverless/

The leader of the pack is AWS Lambda. Lambda has traditionally been the entry point for developers to deploy their Serverless workloads. 

But hang on, 40%+ of Lambda Invocations happen in NodeJS? NodeJS is not traditionally thought of as a distributed computing framework, nor is it generally used for some large scale orchestration of computate tasks. But it seems to be dominating the Lambda serverless world.

So, yes, IBM argues that Serverless is great for scaling distributed computation tasks, but what if that’s not what you’re doing with Serverless?

https://www.datadoghq.com/state-of-serverless/

What Serverless Solved 

Before we get into the details of what’s missing, let’s talk about where things are currently working. 

Where Things Work 1: Uptime Guarantees 

One of the critical, but most frustrating pieces of the developer lifecycle is uptime requirements. Many developers hear the term five-nines, and shudder. Building applications that have specific uptime guarantees is not only challenging, it’s also time-intensive. When large scale systems are made up of small, discrete pieces of computation, the problem can become all the more complex. 

Luckily, the Lambda SLAs guarantee a fairly reasonable amount of uptime, right out of the box. This can save otherwise substantial developer efforts of scoping, building, and testing highly available systems. 

Where Things Work 2: Concurrency + Auto Scaling

Introspecting a large scale system isn’t easy. Companies like DataDog and CloudFlare run multi-billion dollar businesses off of this exact challenge. In an environment where requests can burst unexpectedly, creating and designing systems that scale based on spot user demand is also difficult. 

One of the most powerful aspects of a serverless or hosted model (such as AWS Lambda), is the demand-based auto-scaling capabilities offered by the infrastructure. These effects are compounded, especially when the functions themselves are stateless. This effectively eliminates developers having to care about the operational concerns of autoscaling. There are unquestionably still the cost concerns, serverload concerns, and others, but serverless function offerings give developers a good starting point. 

Problem 1: Developer Bandwidth 

In a typical Serverless Function deployment, the initial choice of configuration tends to be the perpetual choice of configuration.

Wait, hang on, “initial choice of configuration”? Meaning, users still have to manually select their own configuration? It turns out, yes, users still need to manually pick a particular configuration for each serverless function they deploy. It’s actually a bit ironic – with the promise of true 0-management jobs, users are still required to intelligently select resource configuration. 

If an engineer deploys and accidently overspecs a serverless function initially, it’s fairly unlikely that they will ever revisit the function to optimize it. This is generally the case for a few reasons:

  1. Time – Most engineers don’t have the time to go back and ensure that functions they have written weeks, months, or even years ago are operating under the ideal resources. This largely feeds into #2.
  2. Incentives – Engineers are not incentivized by picking the optimal resource configuration for their jobs. They’d rather have the job be guaranteed to work, while spending a bit more of their company’s compute budget.
  3. Employee Churn – Enterprises have inherent entropy and employees are oftentimes transient. People start jobs and people leave jobs. The knowledge generally leaves with them. When other engineers inherently previous work, they are significantly more incentivized to just ensure it works, rather than ensure that it works optimally. 

Problem 2: Serverless Still Requires Tuning

Lambda is predicated on a simple principle – the resource requirements for workloads that take less than 15 minutes to run, can be pretty easily approximated. Lambda makes it easy for developers to set-and-forget, offering only one knob for them to worry about.

That knob is memory. Using Lambda, you can configure the memory allocated to a lambda function as a value between 128 MB and 10,240 MB. Lambda will automatically decide how much vCPU to allocate to you based on the memory setting. 

This… sounds great. “I only have to pick one lever and, and all of a sudden, I get everything else figured out for me? That’s perfect!” If that were the end of the story, I would get to finish this post right now. 

Instead, life is all about tradeoffs – generally correlated tradeoffs. In this case, it’s cost and performance. As an engineer, it’s easy for me to pick the largest memory setting available to me just to ensure my Lambda function works, regardless of what its actual resource requirements are. Once it works, why would I ever touch it again?

Well, it turns out that picking large, frequently uncorrelated-to-necessary-resources values isn’t the most cost effective thing to do. So much so, in fact, that an AWS Solutions Engineer built and open sourced a tool to help users actually find the correct memory levels for their Lambda functions. The tool uses AWS Step Functions to walk users down to the minimum necessary level. It’s been so popular that it has 5K stars on GitHub… and 18.8K 

deployments. 

Clearly, the one-knob-rules-all solution isn’t working.  


Problem 3: Serverless Is Hard to Introspect

The scale and growth testing that plagued engineers for decades before the rise of Serverless, was unfortunately not in vain. Understanding how users will be interacting with an application, in terms of number of requests or compute load gives engineers a powerful understanding of what to expect when things go live. 

In the Serverless Function architecture, engineers don’t think about these considerations and push the burden onto the infrastructure itself. As long as the infrastructure works – it’s unlikely that an already oversubscribed engineer would spend time digging into the performance or cost characteristics of the Serverless function. 

Absent home-rolled solutions, there are few tools that allow for the detailed observability of a single serverless function. Furthermore, there are usually hundreds if not thousands of serverless functions deployed. Observability across a fleet of functions is nearly impossible.

Furthermore, the primary mechanism folks can use for per-function observability is AWS CloudWatch. Cloudwatch logs events for each lambda invocation and stores a few metrics. The major problem though, is that just collecting this information in CloudWatch has been observed to be more expensive than Lamba itself. In fact, there are full articles, posts, and best practices around just managing the costs associated with Lambda CloudWatch logs.

Problem 4: No Auto-Optimization

The year 2023 brought on a material shift in the mentality of “compute” consumers. Enterprises that were previously focused on growth at all costs shifted their focus to efficiency. Vendors in the generic Cloud, Snowflake, and Databricks ecosystem popped up at increasing rates. Most had a simple goal – provide high level visibility into workloads. 

They provided interactive charts and diagrams to show ongoing cost changes… But they didn’t provide the fundamental “healing” mechanisms. It would be like going to the doctor, having them diagnose a problem, but provide no recourse. 

Consistent with their focus on efficiency, enterprises had a few options. Larger ones deployed full teams to focus on this effort. Smaller ones that didn’t have the budget or manpower turned to observability tools… nearly all of which fell short, as they missed the fundamental optimization component.  

Providing detailed visibility across a few, large scale jobs is considered table stakes for many observability providers, but for some reason providing that same level of visibility across many, small scale jobs, in an efficient and easy to optimize way hasn’t become standard. 

Conclusion

We’re in a fairly unique period as an industry. Job visibility, tuning, introspection, and optimization have reemerged as key pieces of the modern tech stack. But most focus on the whales, when they should be focusing on the barracudas. 

If these problems resonate with you – drop us a line at info@synccomputing.com. We’d love to chat. 

Databricks vs Snowflake: A Complete 2024 Comparison 

Databricks and Snowflake are two of the largest names in cloud data solutions – and for good reason. Both platforms have been instrumental in helping companies generate value from their internal (and external) data assets. Each platform has distinct advantages and features and they’ve increasingly overlapped in their offerings—leaving many confused about which solution is best suited for their business needs.

Unfortunately, as is the case in most things in life, there isn’t a simple answer for “which one is better.” But, at Sync, we’ve seen, debugged, and optimized thousands of workloads across numerous enterprises, which has given us a unique perspective. 

To begin a meaningful comparison, we have to understand the history and core competency of each offering.  To help in that process, we at Sync broke down the major differences between Snowflake and Databricks — including price, performance, integration, security and best use cases—in order to best address the individual needs of each user.

While this topic has been debated plenty of times in the past, here at Sync we aim to provide emphasis on the total cost of ownership and ROI for companies.  With all that said, let’s begin!

Below is our complete 2024 guide to all things Databricks vs. Snowflake.

Databricks vs. Snowflake: What Are The Key Differences?

The first thing to understand about the two platforms is what they are, and what solution they are hoping to provide.

Databricks is a cloud-based, unified data analytics platform, for building, deploying, and sharing data analytics solutions at scale. Databricks aims to provide one unified interface where users can store data and execute jobs in interactive, shareable workspaces. These workspaces contain cloud-based notebooks—which are the backbone of Databricks— through which all compute functions are built to be computed by cloud based machines.

Snowflake, on the other hand, is a fully-managed, SaaS cloud-based data warehouse. Whereas Databricks was initially designed to unify data pipelines, Snowflake was designed to be the easiest to manage data warehouse solution on the cloud. While the target market for Databricks is data scientists and data engineers, the target market for Snowflake is typically data analysts—who are highly proficient in SQL queries and data analysis—but not as interested in complex computations or machine learning workflows.

Over time, Databricks and Snowflake have been increasingly in competition, as each hopes to expand their offerings to be an all-in-one cloud data platform solution. New products like Snowflake’s Snowpark (which offers Python functionality) and Databricks’s DBSQL (their serverless data warehouse) have made it increasingly difficult to differentiate the offering of each product. 

For the time being, most would agree that Snowflake tends to be the dominant name for easy-to-use cloud data warehouse solutions, and Databricks is the winner for cloud-based machine learning and data science workflows.

Databricks vs Snowflake: Data Storage

At the moment, Snowflake has the edge for querying structured data, and Databricks has the edge for raw and unstructured data needed for ML. In the future, we think Databricks data lakehouse platform could be the catch-all solution for all data management.

One of the largest differences between Snowflake and Databricks is how they store and access data. Both lead the industry in speed and scale. The largest difference between the two is the architecture of data warehouse vs data lakehouse, and the storage of unstructured vs structured data.

Snowflake

Snowflake, at its core, is a cloud data warehouse. It stores structured data in a closed, proprietary format for quick, seamless data querying and transformation. Their proprietary format allows for high speed and reliability with tradeoffs on flexibility. More recently, Snowflake is allowing the ingestion of data and storage of data in additional formats (such as Apache Iceberg), but the vast majority of its’ customer data still sits in their own format. 

Snowflake utilizes a multi-cluster shared disk architecture, in which compute resources share the same storage device, but retain their own CPU and memory. To achieve this, Snowflake  ingests, optimizes, and compresses data to a cloud object storage layer, like Amazon S3 or Google Cloud Storage. Data here is organized into a columnar format and segmented into micro-partitions, anywhere from 50 to 500MB. These micro-partitions store metadata, which helps dramatically with speed. Interestingly enough – Snowflake’s own internal storage file format is not open source – keeping most customers locked in. 

To function efficiently, Snowflake uses multiple layers to provide an enterprise-experience to the cloud processing workload. Snowflake maintains a cloud services layer that handles the enterprise authentication and access control. 

For execution, Snowflake uses virtual warehouses, which are abstractions on top of regular cloud instances (such as EC2).These warehouses query data from a separate Data storage layer, effectively separating storage and compute.  . This separation of compute and storage makes Snowflake infinitely scalable and allows users to run concurrent queries off the same data, with reasonable isolation., 

Snowflake and Databricks are cloud agnostic meaning they run all three major cloud service providers, Amazon AWS, Microsoft Azure and Google Cloud Platform (GCP).

Sync’s Take: Snowflake’s architecture allows for fast and reliable querying of structured data, at scale. It has appeal to those who want simple methods for managing their resource requirements of their jobs (through T-Shirt size warehouse options). It is primarily geared towards those proficient in SQL but lacks the flexibility to easily deal with raw, unstructured data.

Databricks

One of Databricks’ selling points is it employs an open-source storage layer known as Delta Lake— which aims to combine the flexibility of cloud data lakes, with the reliability and unified structure of a data warehouse—and without the challenges associated with vendor lock-in. Databricks has pioneered this so-called ‘data lakehouse’ hybrid structure as a cost-effective solution for data scientists, data engineers, and analysts alike to work with the same data—regardless of structure or format.

Databricks data lakehouse works by employing three layers to allow for the storage of raw and unstructured data—but also stores metadata  (such as a structured schema) for warehouse-like capabilities on structured data. Notably, this data lakehouse provides ACID transaction support, automatic schema enforcement—which validates DataFrame and table compatibility before writes—and end-to-end streaming for real-time data ingestion—some of the most desirable advancements for data lake systems.

Sync’s Take:  Lakehouses bring the speed, reliability and fast query performance of data warehouses to the flexibility of a Data Lake. The drawback is that as a relatively new technology, new and less technical users have been occasionally unable to locate tables and have to rebuild them.

Databricks vs Snowflake Scalability

Snowflake and Databricks continue to battle for dominance of enterprise workloads. While both have been proven to be industry leaders in this capacity, the largest practical difference between the two lies in their resource management capabilities. 

Snowflake

Snowflake offers compute resources as a serverless offering. Meaning users don’t have to select, install, configure, or manage any software and hardware. Instead, Snowflake uses a series of virtual warehouses—independent compute resources containing memory and CPU—to run queries.. This separation of memory and compute resources allows Snowflake to scale infinitely without slowing down, and multiple users can query concurrently against the same single segment of data.

In terms of performance, Snowflake has been shown to process up to 60 million rows in under 10 seconds.

Snowflake employs a simple “t-shirt” sizing model to their virtual warehouses, with 10 sizes with each double the computing power as the size before it. The largest is 6XL which has 512 virtual nodes. Because warehouses don’t share compute resources or store data, if one goes down it can be replaced in minutes without affecting any of the others.

A diagram of the virtual nodes associated with each size data warehouse

Most notably, Snowflake’s multi-cluster warehouses provide both a “maximized” and “auto-scale” feature which gives you the ability to dynamically shut down unused clusters, saving you money.

Databricks

Databricks started out with much more “open” and traditional infrastructure, where basically all of the compute runs inside a user’s cloud VPC.  This is the complete opposite of the “serverless” model where the compute is run inside Databrick’s VPC, since all of the cluster configurations are exposed to end users.  This has its pros and cons, the main advantage is that users can hyper optimize their clusters to improve performance, but the drawback is that it can be painful to use or require an expert to maintain.

More recently, Databricks is evolving towards the “serverless” model with Databricks SQL Serverless, and likely extending this model to the other products, such as notebooks.  The pros and cons here flip, in that the pro is users don’t have to worry about cluster configurations, however the con is that users have no access nor visibility into the underlying infrastructure and are unable to custom tune clusters to meet their needs.

Since Databricks is currently in a “transition” period between classic and “serverless” offerings, their scalability really depends on which use case people select.  

One major note is Databricks has a diverse set of compute use cases, from SQL warehouses, Jobs, All Purpose Compute, Delta Live Tables, to streaming – each one of these has slightly different compute configurations and use cases.  For example SQL warehouses can be used as a shared resource, where multiple queries can be submitted to the warehouse at any time from multiple users.  Jobs are more singular, in which one notebook is run on one cluster, and is shut down (Jobs can also be shared now, but this is used less).  

The different use cases need to fit the end user’s needs, which can also impact scalability.  This one example symbolizes both the strength and weakness of Databricks, there are so many options at so many levels it can be great if you know what you’re doing, or it can be a nuisance.

Sync’s Take: When it comes to scaling to large workflows, both Snowflake and Databricks can handle the workload. However, Databricks is better able to boost and fine tune the performance of large volumes of data which ultimately saves costs.

Databricks vs Snowflake: Cost

Both Databricks and Snowflake are marketed as pay-as-you-go models. Meaning the more compute you reserve/request, the more you pay. Their models starkly differ from more traditional “usage based pricing schemes” where customers pay only for the usage they actually consume. In both Databricks and Snowflake, users can and will pay for requested resources whether or not those resources are actually necessary or optimal to run the job. 

Another big difference between the two services is that Snowflake runs and charges for the entire compute stack (virtual warehouses and cloud instances), whereas Databricks only runs and charges for the management of compute, requiring users still have to pay a separate cloud provider bill. It is worth noting that Databricks’ new serverless product mimics the Snowflake operating model. Databricks works off a compute/time units called Databricks Units (or DBUs) per second and Snowflake uses a Snowflake credit system.

As a formula, it breaks down like this:

  • Databricks (Classic compute) = Data storage + Cost of Databricks Service (DBUs) + Cost of Cloud Compute (Virtual machine instances) 
  • Snowflake = Data storage (Daily average volume of bytes stored on Snowflake) + Compute (number of virtual warehouses used) 

Both Databricks and Snowflake offer tiers and discounts of pricing based on company size, and both allow you to save money by pre-purchasing units or credits.

Databricks has more variance in price as it has different prices depending on the type of workload, with certain types of computes costing 5x more per compute hour than the simple jobs.

One major advantage Databricks has in terms of costs, is it allows users to utilize Spot instances in their cloud provider – which can translate to significant cost savings.  Snowflake obfuscates all of this, and the end user has no option to benefit from utilizing Spot instances.

Sync’s Take: There is no concrete answer to which service is “cheaper” as it really depends on how much of the service or platform you’re using, and for what types of tasks. However, the control and introspection capabilities that Databricks provides is fairly unmatched in the Snowflake ecosystem. This gives Databricks a significant edge when optimizing for large compute workloads.

If you’d like a further guide on the breakdown of Databricks pricing, we recommend checking out our complete pricing guide.

Databricks vs Snowflake Speed Benchmarks

Databricks claims they are 2.5x faster than Snowflake. Snowflake also claims they are faster than databricks. While this is a contentious issue between the two giants the reality is benchmarks merely only serve as a vanity metric.  In reality, your workloads will likely look nothing like the TPC-DS benchmarks that either company ran, and hence their benchmarks would not apply to your jobs.  Our opinion here is that benchmarks don’t matter at this level.

While this may be an unsatisfying answer, if you’re looking for a solution that is all about the absolute fastest way to run your code – there are likely other solutions that are less well known but do focus on performance.  

Most companies we speak to value both platforms due to their ease of use, having all of their data in one place, ability to share code, and not having to worry about low level infrastructure.  Pure raw speed is rarely a priority for companies. If this sounds like your company, likely the speed metrics don’t really matter so much.  

However, cost likely does matter in aggregate, and hence doing an actual comparison of runtime and cost on the different platforms with your actual workloads is the only real way to know.  

Databricks vs Snowflake: Ease of Use

All things equal, Snowflake is largely considered the “easier” cloud solution to learn between the two. It has an intuitive SQL interface and as a serverless experience, doesn’t require users to manage any virtual or local hardware resources. Plus as a managed service, using Snowflake doesn’t require any installing, maintaining, updating or fine-tuning of the platform. It’s all handled by Snowflake.

From a language perspective, Snowflake is all SQL-based (excluding their new foray into Snowpark) making it accessible for many business analysts. While Databricks SQL has  data warehouse functionality in line with Snowflake, the large use case of Databricks is being able to write in Python, R and Scala and reviews on Gartner and Trust Radius have consistently rated it a more technical setup than Snowflake

Snowflake also has automated features like auto-scaling and auto-suspend to help start and stop clusters without fine-tuning. While Databricks also has autoscaling and autosuspend, it is designed for a more technical user and there is more involved with fine-tuning your clusters (watch more about how we help do this here).

Sync’s Take: While Databricks UI has a steeper learning curve than Snowflake, it ultimately has more advanced control and customization than Snowflake, making this a tradeoff that is largely dependent on how complex you intend your operations to be.  

Databricks vs Snowflake: Security

Both Databricks and Snowflake are GDPR-compliant, offer role-based access control (RBAC), and both organizations encrypt their data both at rest and in motion. Both have very good records with data security and offer a variety of role-based access controls and support for compliance standards.

Databricks offers additional isolation at multiple levels including workspace-level permissions, cluster ACLs, JVM whitelisting, and single-use clusters. For organizations that employ ADS or AMS teams, Databricks provides workload security that includes code repository management, built-in secret management, hardening with security monitoring and vulnerability reports, and the ability to enforce security and validation requirements.

Snowflake allows users to set regions for data storage to comply with regulatory guidelines such as HIPAA and PCI DSS. Snowflake security levels can also be adjusted based on requirements and has built-in features to regulate access levels, and control things like IP allows and blocklists. Snowflake also allows advanced features of Time Travel and Fail-safe which allow you to restore tables, schemas, and databases from a specific time point in the past or protect and recover historical data.

Historically the only issue for Snowflake was the inability to on-premise storage on a private-cloud infrastructure, which is needed for the highest level of security like government data. In 2022, Snowflake started adding in on-premise storage, however as of yet there is limited information on this has been received.

Sync’s Take: Both Databricks and Snowflake have an excellent reputation with data security, as it is mission-critical to their businesses. There is really no wrong choice here and it largely comes down to making sure individual access levels match your intent.

Databricks vs. Snowflake: Ecosystem and Integration

Databricks and Snowflake are becoming the abstractions on top of Cloud Vendors for data computation workloads. As such, they both plug into a variety of vendors, tools, and products. 

From the vendor space, both Databricks and Snowflake provide marketplaces that allow other predominant tools and technology to be co-deployed. There are also community built and contributed features, such as the Databricks Airflow Operators / Snowflake Airflow Operators.

On the whole though, the Databrick’s ecosystem is typically more “open” than Snowflake, since Databricks still runs in a user’s cloud VPC.  This means, users can still install custom libraries, or even introspect low-level cluster data.  Such access is not possible in Snowflake, and hence integrating with your favorite tools may be harder. Databricks also tends to be generally more developer / integration friendly than Snowflake for this exact reason.

Other FAQ on Databricks vs Snowflake?

  • Is Databricks a data warehouse? Databricks bills itself as the world’s first “Data Lakehouse”, combining the best of data lakes and data warehouses. However, despite having the capability, Databricks is not typically thought of as a data warehouse solution, as its learning curve and fine-tuning are often unnecessarily for someone seeking a just straightforward data warehouse.
  • Can Snowflake and Databricks integrate with each other? It is possible and not entirely uncommon to integrate Databricks and Snowflake with each other. Typically in this manner, Databricks acts as a Data Lake for all unstructured data, manipulating it and processing it as part of an ETL pipeline where it is then stored on Snowflake like a traditional data warehouse.
  • What data types does Snowflake accept? Snowflake is optimized for structured and semi-structured data, meaning it can only accept certain data formats, notably JSON, Avro, Parquet and XML.
  • Can Snowflake and Databricks create dashboards for business intelligence? Yes, both Snowflake and Databricks are able to create dashboards and visualizations for business intelligence.

Databricks vs Snowflake: Which Is Better?

Both Databricks and Snowflake have a stellar reputation within the business and data community. While both cloud-based platforms, Snowflake is most optimized for data warehousing, data manipulation and querying, while Databricks is optimized for machine learning and heavy data science. 

Broken down into components, here are a list of pros for each:

Platform/FeatureDatabricksSnowflake
StorageBetter for raw, unstructured data. Better for reliability and ease of use for structured data
Use CaseBetter for ML, AI, Data Science and Data Engineering. Collaborative notebooks in Python/Scala/R a big plus Easier for analysts in business intelligence and companies looking to migrate existing data warehouse system
PriceCheaper at high compute volumes. Not as predictable on cost. Efficient at scaling down unused resources. More consistent, predictable costs. 
ScalabilityInfinitely scalable. Effective at high volume workloads.Separate storage and compute makes for seamless concurrent queries. 
SecurityGDPR-compliant, role-based access control, encrypted at rest and and in motionGDPR-compliant, role-based access control, encrypted at rest and in motion

If you want to integrate structured data an existing ETL pipeline using structured data and programs like Tableau, Looker and Power BI, Snowflake could be the right option for you. If you instead are looking for a unified analytics workspace where you build compute pipelines, Databricks might be the right choice for you. 

Interested in using Databricks further? Check out Sync’s Gradient solution – the only ML-powered Databricks cluster optimization and management tool.  At a high level, we help maintain the openness of Databricks but now with the “ease” of Snowflake.  On top of that, we also actively drive your costs lower and lower.

What is the Databricks Job API?

The Databricks Jobs API allows users to programmatically create, run, and delete Databricks Jobs via their REST API solution.  This is an alternative to running Databricks jobs through their console UI system.  For access to other Databricks platforms such as SQL warehouses, delta live tables, unity catalog, or others, users will have to implement other API solutions provided by Databricks.

The official Databricks Jobs API reference can be found here.  

However, for newcomers to the Jobs API, I recommend starting with the Databricks Jobs documentation which has great examples and more detailed explanations.  

Why should I use the Jobs API?

Users may want to use an API, vs. the UI, when they need to dynamically create jobs due to other events, or to integrate with other non-Databricks workflows, for example Airflow or Dagster.   Users can implement job tasks using notebooks, Delta Live Tables pipelines, JARS, or Python, Scala, Spark submit, and Java applications.

Another reason to use the Jobs API is to retrieve and aggregate metrics about your jobs to monitor usage, performance, and costs.  The information in the Jobs API is far more granular than those present in the currently available System Tables. 

So if your organization is looking to monitor thousands of jobs at scale and build dashboards, you will have to use the Jobs API to collect all of the information.  

What can I do with the Jobs API?

A full list of the Jobs API PUT and GET requests can be found in the table below, based on the official API documentation.  

ActionRequestDescription
Get job permissions/api/2.0/permissions/jobs/{job_id}Gets the permissions of a job such as ‘user name’, ‘group name’, ‘service principal’, ‘permission level’ 
Set job permissions/api/2.0/permissions/jobs/{job_id}Sets permissions on a job.
Update job permissions /api/2.0/permissions/jobs/{job_id}Updates the permissions on a job. 
Get job permission levels 
/api/2.0/permissions/jobs/{job_id}/permissionLevels
Gets the permission levels that a user can have on an object
Create a new job /api/2.1/jobs/createCreate a new Databricks Job
List jobs /api/2.1/jobs/listRetrieves a list of jobs and their parameters such as ‘job id’, ‘creater’, ‘settings’, ‘tasks’
Get a single job /api/2.1/jobs/getGets job details for a single job
Update all job settings (reset) /api/2.1/jobs/resetOverwrite all settings for the given job.
Update job settings partially /api/2.1/jobs/updateAdd, update, or remove specific settings of an existing job
Delete a job /api/2.1/jobs/deleteDeletes a job
Trigger a new job run /api/2.1/jobs/run-nowRuns a job with an existing job-id
Create and trigger a one-time run /api/2.1/jobs/runs/submitSubmit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Runs submitted using this endpoint don’t display in the UI. 
List job runs /api/2.1/jobs/runs/listList runs in descending order by start time.  A run is a job that has already historically been run.
Get a single job run /api/2.1/jobs/runs/getRetrieve the metadata of a single run.
Export and retrieve a job run /api/2.1/jobs/runs/exportExport and retrieve the job run task.
Cancel a run /api/2.1/jobs/runs/cancelCancels a job run
Cancel all runs of a job /api/2.1/jobs/runs/cancel-allCancels all job runs
Get the output for a single run /api/2.1/jobs/runs/get-outputRetrieve the output and metadata of a single task run. 
Delete a job run /api/2.1/jobs/runs/deleteDeletes a job run
Repair a job run /api/2.1/jobs/runs/repairRepairs a job run by re-running it

Can I get cost information through the Jobs API?

Unfortunately, users cannot obtain jobs cost directly through the Jobs API.  You’ll need to use the accounts API to access billing information, or use System tables.  One big note, is the billing information retrieved through either the accounts API or the system tables is only the Databricks DBU costs.

The majority of your Databricks costs could come from your actual cloud usage (e.g. on AWS it’s the EC2 costs).  To obtain these costs you’ll need to separately retrieve cost information from your cloud provider.

If this sounds painful – you’re right, it’s crazy annoying.  Fortunately, Gradient does all of this for you and can retrieve both the DBU and cloud costs for you in a simple diagram to monitor your costs.  

How does someone intelligently control their Jobs clusters with the API?

The Jobs API is an input/output system only.  What you do with the information and abilities to control and manage Jobs is entirely up to you and your needs.  

For users running Databricks Jobs at scale, one dream ability is to optimize and intelligently control jobs clusters to minimize costs and hit SLA goals.  Building such a system is not trivial and requires an entire team to develop a custom algorithm as well as infrastructure.

Here at Sync, we built Gradient to solve exactly this need.  Gradient is an all-in-one Databricks Jobs intelligence system that works with the Jobs API to help automatically control your jobs clusters.  Check out the documentation here to get started.

Updating From Jobs API 2.0 to 2.1

The largest update from API 2.0 to 2.1 is the inclusion of multiple tasks in a job, as described in the official documentation.  To explain a bit more, Databricks jobs can contain multiple tasks in a single job, where each task can be a different notebook, for example.  All API 2.1 requests must conform to the multi-task format and responses are structured in the multi-task format.

Databricks jobs api example 

Here is an example, borrowed from the official documentation, of how to create a job:

To create a job with the Databricks REST API, run the curl command below, which creates a cluster based on the parameters located in the create-job.json

curl --netrc --request POST \

https://<databricks-instance>/api/2.0/jobs/create \

--data @create-job.json \

| jq .

An example of what goes into the create-job.json is found below

{

  "name": "Nightly model training",

  "new_cluster": {

    "spark_version": "7.3.x-scala2.12",

    "node_type_id": "r3.xlarge",

    "aws_attributes": {

      "availability": "ON_DEMAND"

    },

    "num_workers": 10

  },

  "libraries": [

    {

      "jar": "dbfs:/my-jar.jar"

    },

    {

      "maven": {

        "coordinates": "org.jsoup:jsoup:1.7.2"

      }

    }

  ],

  "email_notifications": {

    "on_start": [],

    "on_success": [],

    "on_failure": []

  },

  "webhook_notifications": {

    "on_start": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_success": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_failure": []

  },

  "notification_settings": {

    "no_alert_for_skipped_runs": false,

    "no_alert_for_canceled_runs": false,

    "alert_on_last_attempt": false

  },

  "timeout_seconds": 3600,

  "max_retries": 1,

  "schedule": {

    "quartz_cron_expression": "0 15 22 * * ?",

    "timezone_id": "America/Los_Angeles"

  },

  "spark_jar_task": {

    "main_class_name": "com.databricks.ComputeModels"

  }

}

Azure databricks jobs api 

The REST APIs are identical across all 3 cloud providers (AWS, GCP, Azure).  Users can toggle between the different cloud versions in the reference page on the top left corner

Conclusion

The Databricks Jobs API is a powerful system which enables to programmatically control and monitor their jobs.  Likely this is useful for “power users” who want to control many jobs or for users who need to use an external orchestrator, like Airflow, to orchestrate their jobs.

To add automatic intelligence to your Databricks Jobs API solutions to help lower costs and hit SLAs, check out Gradient as a potential fit.

Databricks Pricing Page

Databricks Pricing Calculator

Pricing For Azure

How To Optimize Databricks Clusters

Databricks Instructor-Led Courses

Databricks Guided Access Support Subscription

Migrate Your Data Warehouse to Databricks

Databricks Support Policy 

Introducing the Sync Databricks Workspace Health Check

blue bricks representing databricks workspace

Introducing the Sync Databricks Workspace health check, a program that we’ve spearheaded to help Databricks users identify common mistakes in their Workspaces.

Here at Sync, we’ve worked with a ton of companies and looked at their overall Databricks workspace usage. We’ve seen all sorts of usage from jobs, all-purpose compute, SQL warehouses, to Delta Live tables and have seen many recurring patterns.

While many companies do operate Databricks well, there are some patterns we’ve observed that have led to wasted compute resources and inflated costs. As a result, we built a tool to help quickly identify these common pitfalls and give users a quick rundown of the health of their overall usage.

With our personalized health check, you’re able gain insight into:

  • Your top 10 jobs most qualified for Gradient
  • Candidates for EBS, Photon, and autoscaling
  • Compute cluster utilization scoring
  • SQL Warehouse utilization efficiency
  • 12-month projected usage growth
  • Estimated overall cost savings
  • Incorrectly run jobs on all-purpose compute clusters

While we foresee the health check to continue to evolve and grow, let’s dive into some of the popular metrics used today:

Nail down APC vs. jobs compute usage to significantly reduce costs by identifying and leveraging the most cost effective job compute option, immediately allowing for savings up to 50%. Companies small and large often incorrectly use all-purpose compute clusters for their production jobs, when they should be using jobs clusters. While this is a subtle detail, it can instantly lead to a 2x cost reduction with just a few clicks.

Visualize APC and warehouse utilization to identify underused clusters and warehouses within your workspace. Both APC and SQL warehouses can fall into the same pitfall of being “always on” even though nobody is using them. With our health check, you’re able to quickly see where that is happening and how to prevent it.

Efficiently select instances across an organization to determine if your users are opting out of default settings in an effort to optimize. Platform teams thrive when they’re able to see the distribution of instances that are being used. This helps identify what kind of clusters are popular and effective. If the Databricks default cluster is used often (e.g. in AWS it’s “i3”), it’s likely that team members are opting for default settings and aren’t spending much time trying to find better instances for optimal performance.

Gain a better understanding of EBS, Photon, and autoscaling optimization insights by identifying how many clusters use these features to assess potential savings that could add major benefit to your jobs. Photon and Autoscaling are options Databricks often recommends for job clusters. However, these features are only beneficial some of the time, ultimately depending on the characteristic details of your job.

Rank your top Jobs candidates for Gradient based on schedule, duration, and consistency. One of the largest sources of cost are jobs clusters used in production. Sync’s core product offering helps to automatically optimize these clusters for cost and performance. When you’re working with hundreds, or even thousands, of jobs in your workload, it can be daunting to identify which jobs should take priority. To help with this, your Workspace health check includes a proprietary ranking system that identifies jobs to see if Gradient’s cluster optimizations are a good fit.

Our health check notebook is an easy-to-use solution that you can run on your own at zero cost to you. 

Want to get a head start and learn more about integrating Gradient into your stack? Head here to request your personalized Databricks health check.