News

Sync’s Health Check for Databricks Workspaces

Whether you’re a data engineer, a manager of a data team, or an executive overseeing a data platform, your focus might be on growth, and to continue to build and innovate. However, this may come at the expense of ballooning costs that are getting harder and harder to get under control. This ultimately leads to a point where you need to make some tough cost-cutting decisions — like migrating to a less expensive platform — or even tougher decisions — like laying off part of your team.

Our data platform costs are increasing 20% MoM. How do we reduce our costs and get our budget under control?

Senior Data Engineer at a martech company

Can you help us get a better understanding of how we’re using Databricks? We want to get our costs under control but we don’t know where to start.

Staff Project Manager at a large pharma company

What is the Health Check?

Sync Computing’s Health Check for Databricks Workspaces is a Databricks notebook that runs entirely within your Databricks environment. It provides you with a detailed report on findings and actions that help in reducing spend, as well as lead to a deeper understanding of your use cases, patterns, and practices in Databricks.

How stable are job runs?

What is the distribution of job runs by compute type?

What does Photon usage look like?

What are the most frequently used instance types?

Are clusters being auto-terminated or sitting idle?

What are my most expensive jobs?

Sync Health Check provides answers to all the above questions, and more!

We’ll cover a few of these questions in this blog post to demonstrate how Health Check can help get you the data you need to make informed decisions when it comes to your Databricks usage.

How do I get it?


The health check for Databricks workspaces is a free tool anyone can download by following the link below:

Request health check notebook download here

We do ask for your contact information so we can follow up to see if the notebook was useful and to receive any feedback.  We’d also love to hear ideas on any new analysis we can add!

Without further delay, let’s dive into what’s in the health check and how it can be a useful tool:

Job Run Stability

Jobs with low stability and failed runs cost money but don’t drive any business value. These jobs may also be preventing you from meeting your SLAs and causing thrash in your teams. Health Check shows you how many of your job runs result in success and how many result in failure. This insight helps you prioritize actions if the failures are costing you more than the business value they’re driving.

Actionable insight: Prioritize fixing or pausing the jobs with high failure rates to save costs, deliver on SLAs, and reduce team thrash.

Jobs by Compute Type

Databricks offers several compute options to run your workload. For example, Jobs Compute clusters are best suited for jobs that run on a schedule while All-Purpose clusters are best suited for ad hoc analysis. We’ve seen many cases where users run scheduled jobs on All-Purpose clusters primarily to circumvent cluster spin up and spin down times. However, All-Purpose clusters come at a higher cost (at least 2.5x the cost of Jobs Compute clusters!!).

Actionable insight: Migrate to cheaper Jobs Compute clusters and establish clear policies to grant exceptions to use All-Purpose Compute clusters.

Photon Usage

Photon may do the job in terms of providing extremely fast query performance. However, whether or not it’s delivering ROI depends on the performance gain compared to the cost increase of using Photon. Note that Photon is not free and is typically a 2x cost increase for DBUs compared to non-Photon. For more information and details, check out our blog on whether Databricks clusters with Photon and Graviton instances are worth it.

Actionable insight: Compute the ROI you’re getting out of Photon.

Most Frequently used Instance Types

As the subtitle suggests, this shows you the most commonly used instances. The types of instances used may change over time, as the needs of your business change. Being able to track the trends in instance types being used enables your business to remain agile and respond quickly to changing needs – such as efficiently managing your reserved instances.

Actionable insight: Drive better alignment between Databricks instance usage and your organization’s preferred instances.

Auto-Termination

Clusters with no auto-termination, or longer auto-termination continue to accrue costs when they’re idle. This is generally the case with All-Purpose compute clusters, and the wasted spend could have been avoided with better policies on auto-termination. Additionally, with more of these clusters being spun up, the total idle time keeps increasing. Jobs Compute clusters, on the other hand, are terminated right after the job completes so there’s generally no waste related to idle time.

Actionable insight: Set auto-termination to a minimum for your clusters and establish clear policies to grant exceptions while encouraging cluster re-use.

Most Expensive Jobs

Health Check shows you your top most expensive jobs based on DBUs alone. This is only part of the picture, but when compared against the business value that these jobs drive then you can determine whether you’re getting ROI out of these jobs. If you’ve determined that the job is high value, then the next step is to increase ROI through rightsizing. A major cause of bloated costs is over-provisioning, where you’re still paying for underutilized resources.

Actionable insight: Determine if these jobs are high value and whether there’s opportunity to rightsize the compute to move the needle on ROI.

Wrapping Up

Sync’s Health Check provides deep insights into how your organization uses Databricks, and shines light on areas where there is opportunity to improve.

Feel free to reach out to us! We’d love to hear your feedback on how the Sync Health Check worked for you, and where there’s room for improvement. You can reach us here or send an email to our support team.

March 2024 Release Notes

release notes

Our team has been hard at work to deliver industry-leading features to support users in achieving optimal performance within the Databricks ecosystem. Take a look at our most recent releases below.

Worker Instance Recommendations

Introducing Worker Instance Recommendations directly from the Sync UI. With this feature, you are able to tap into optimal cluster configuration recos so that your configs are optimized for individual jobs.

The instance recos within Gradient not only optimize the number of workers, but also the worker size. For example, if you are using i3.2xl instances, Gradient will find the right instance size (such as i3.xl, i3.4xl, i3.8xl, etc) in the i3 instance type.


Instance Fleet Support

If your company is using Instance Fleet Clusters, Gradient is now compatible!  There are no changes required on the user flow, as this feature is automatically supported in the backend.  Just onboard your jobs like normal into Gradient, and we’ll handle the rest.

Hosted Log Collection


Running Gradient is now more streamlined than ever! You’re now able to opt into hosted log collection entirely in the Sync environment with Sync-hosted or user-hosted collection options. What does this mean? It means that there are no extra steps or external clusters needed to run Gradient, allowing Sync to do all the heavy lifting while minimizing the impact on your Databricks workspace. 

With hosted DBX log collection within Gradient, you’re able to minimize onboarding errors due to annoying permission settings while increasing visibility into any potential collection failures, ultimately giving you and your team more control over your cluster log data.


Getting Started with Collection Setup
The Databricks Workspace integration flow is triggered when a user clicks on Add → Databricks Workspace after they have configured their workspace and webhook. Users will also now have a toggle option to choose between Sync-hosted (recommended) or User-hosted collection.

  • Sync-hosted collection – The user will be optionally prompted to share their preference for cluster logs stored for their Databricks Jobs. This will initially be an immutable setting saved on the Workspace.
    • For AWS – Users will need to add a generated IAM policy and IAM Role to their AWS account. The IAM policy allows us to ec2:DescribeInstances, ec2:DescribeVolumes, and optionally an s3:GetObject and s3:ListBucket to the specific bucket and prefix to which users have configured uploading cluster logs. S3 permissions are optional because they may be using DBFS to record cluster logs. The user needs to add a “Trusted Relationship” to the IAM role to give our Sync IAM role permissions to sts:AssumeRole using an ExternalId we provide them. Gradient will then generate this policy and trust relationship for the user in a JSON format to be copy and pasted.
    • For Azure – Coming soon!
  • User-hosted collection – For both Azure/AWS will proceed as the normal workspace integration requirements dictate

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Request a demo or reach out to us at info@synccomputing.com.

Why Your Databricks Cluster EBS Settings Matter

Sean Gorsky & Cayman Williams

Figure 1: Point comparison of between the cost and runtime of a Databricks Job run using the Default EBS settings and Sync’s Optimized EBS settings. More details about the job that was used to create this data can be found in the lower left plot in Figure 4.

Choosing the right hardware configuration for your Databricks jobs can be a daunting task. Between instance types, cluster size, runtime engine, and beyond, there are an enormous number of choices that can be made. Some of these choices will dramatically impact your cost and application duration, others may not. It’s a complicated space to work in, and Sync is hard at work in making the right decisions for your business.

In this blog post, we’re going to dig into one of the more subtle configurations that Sync’s Gradient manages for you. It’s subtlety comes from being squirreled in the “Advanced” settings menu, but the setting can have an enormous impact on the runtime and cost of your Databricks Job. The stark example depicted in Figure 1 is the result of Sync tuning just this one setting. That setting — really a group of settings — are the EBS volume settings.

EBS on Databricks

Elastic Block Storage (EBS) is AWS’s scalable storage service designed to work with EC2 instances. An EBS volume can be attached to an instance and serves as disk storage for that instance. There are different types of EBS volumes, the three of which are relevant to Databricks:

  1. st1 volumes are HDD drives used in Databricks Storage Autoscaling
  2. gp2 volumes are SSDs, user selects the Volume count and Volume Size
  3. gp3 volumes are similar to gp2, but you may pay for additional throughput and IOPS separately

Apache Spark may utilize disk space, including for disk caching, disk spillage, or as intermediate storage between stages. Consequently, EBS volumes are required to run your Databricks cluster if there is no disk-attached (NVMe) storage. However, Databricks does not require a user to specify EBS settings. They exist, squirreled away in the Advanced menu of cluster creation, but if no selection is made then Databricks will automatically choose settings for you.

Figure 2: Screenshot of Databricks’ “Advanced” options on the Compute tab, showing the EBS gp2 volume options. If your workspace is on gp3 you can also tune the IOPS and Throughput separately, though this option is not enabled in the interface (it is possible through the API or by manipulating the cluster in the UI’s JSON mode)

The automatic EBS settings depend on the size of the instance chosen, with bigger instances getting more baseline storage according to AWS’s best practices. While these baseline settings are sufficient for running applications, they are often suboptimal. The difference comes down to how EBS settings impact the throughput of data transfer to and from the volumes.

Take for example the gp2 volume class, where the volume IOPS and throughput are direct functions of the size of the volume. The bigger the volume size, the faster you can transfer data (up to a limit). There’s additional complexity beyond this, including bandwidth bursting and instance bandwidth limits.

So how does Sync address this problem?

Laying the Groundwork

Sync has approached this problem the same way we’ve approached most problems we tackle — mathematically. If you get way down in the weeds, there’s a mathematical relationship between the EBS settings (affecting the EBS bandwidth), the job duration, and the job cost.

The following formula shows the straightforward relationship between the EBS settings (S), the application Duration [hr], and the various charge rates [$/hr]. For clarity we write the Duration only as a function of S, but in reality it depends on many other factors, such as the compute type or the number of workers.

At first glance this equation is straightforward. The EBS settings impact both the job duration and the EBS charge rate. There must be some EBS setting where the decrease in duration outweighs the increase in charge rate to yield the lower possible cost.

Figure 3 exemplifies this dynamic. In this scenario we ran the same Databricks job repeatedly on the same hardware, only tuning the EBS settings to change each instance’s effective EBS throughput. An instance’s EBS throughput is the sum of the throughputs of the attached EBS volumes (ThroughputPerVolume*VolumesPerInstance), up to the maximum throughput allowed by the instance (MaxInstanceThroughput). This leads to a convenient “Normalized EBS Throughput” defined as ThroughputPerVolume*VolumesPerInstance/MaxInstanceThroughput, which we use to represent the instance EBS bandwidth.

Figure 3: (left) Application duration vs normalized EBS throughput, defined as ThroughputPerVolume*VolumesPerInstance/MaxInstanceThroughput. Increasing throughput reduces runtime with diminishing returns, and increasing throughput beyond 1.0 (the maximum throughput allowed by the instance) has no effect on the application duration. (right) Total cluster cost vs normalized EBS throughput. Since EBS contributes to the cost rate of the cluster, the optimal cost corresponds to a throughput value below the instance maximum.

The plot on the right shows the cost for each point in the left plot. Notably, there’s a cost-optimum at a normalized throughput of ~0.5, well below the instance maximum. This is a consequence of the delicate balance between the cost rate of the EBS storage and its impact on duration. The wide vertical spread at a given throughput is due to the intricate relationship between EBS settings and throughput. In short, there are multiple setting combinations that will yield the same throughput, but those settings do not have the same cost.

Sync’s Solution

The most notable feature in Figure 3 is the smooth and monotonically decreasing relationship between duration and throughput. This is not entirely unexpected, as small changes in throughput ought to yield small changes in duration, and it would be surprising if increasing the throughput also increased the runtime. Consequently, this space is ripe for the use of modeling — provided you have an accurate enough model for how EBS settings would realistically impact duration (wink).

The downside to modeling is that it requires some training data, which means a customer would have to take deliberate steps to collect the data for model training. For GradientML we landed on a happy medium.

Our research efforts yielded a simple fact: immediately jumping to EBS settings that efficiently maximize the worker instance EBS throughput will yield a relatively small increase in the overall charge rate but in most cases results in a worthwhile decrease in run duration. When we first start managing a job, we bump up the EBS settings to do exactly this.

We explore the consequences of this logic in Figure 4, which depicts six different jobs where we compare the impact of different EBS settings on cost and runtime. Every job uses the same cluster consisting of one c5.24xlarge worker. In addition to the “default” and “optimized” settings discussed thus far, we also tested with autoscaled storage (st1 volumes, relatively slow HDDs), and disk-attached storage (one c5d.24xlarge worker instead, this is lightning fast NVMe storage).

The top row consists of jobs which are insensitive to storage throughput, but we see that maximizing the EBS settings did not meaningfully impact cost. In these cases data transfer to and of from storage had a negligible impact on the duration of the overall application, and so the duration was insensitive to the EBS bandwidth.

The bottom row consists of jobs where this data transfer does meaningfully impact the application duration, and are therefore more sensitive to the throughput. Coincidentally, the disk-attached runs did not show any meaningful cost reduction over the EBS-optimized runs, though this is most certainly not a universal trend.

Figure 4: Several tests to assess the impact of EBS setting on Databricks Job durations. The top row depicts jobs where the EBS choice has a negligible impact on duration and cost. The bottom row depicts jobs which are very sensitive to EBS throughput, indicated by the steep drop in cost of the ebs_optimized and disk_attached bars. Every run uses a single c5.24xlarge worker instance, except for the disk-attached (green) runs, which use one c5d.24xlarge worker.

Conclusion

With the abstraction that is cloud computing, even the simplest offerings can come with a great deal of complexity that impacts the duration and cost of your application. As we’ve explored in this blog post, choosing appropriate EBS settings for Databricks clusters is an excellent illustration of this fact. Fortunately, the smooth relationship between duration and an instance’s EBS throughput lends itself to the powerful tools of mathematical modeling — the kind of thing that Sync eats and breaths. We’ve employed this expertise not only in the analysis in this blog, but in our compute management product GradientML which manages the compute decisions in Databricks clusters, and automatically implements these optimizations on your behalf.

Gradient New Product Update Q4 2023

Today we are excited to announce our next major product update for Gradient to help companies optimize their Databricks Jobs clusters.  This update isn’t just a simple UI upgrade…

We upgraded everything from the inside out! 

Without burying the lead, here’s a screenshot of the new project page for Gradient above.

Back in the last week of June of this year (2023), we debuted our first release of Gradient.  In the past few months we gathered all of the user feedback on how we can make the experience even better.

So what were the high level major feature requests that we learned in the past few months?

  • Visualizations – Visual graphs which show the cost and runtime impact of our recommendations to see the impact and ROI of Gradient
  • Easier integration – Easier “one-click” installation and setup experience with Databricks
  • More gains – Larger cost savings gains custom tailored to the unique nature of each job
  • Azure support – A large percentage of Databricks users are on Azure, and obviously they wanted us to support them

Those features requests weren’t small and required pretty substantial changes from the backend to the front, but at the end of the day we couldn’t agree more with the feedback.  While a sane company would prioritize and tackle these one by one, we knew each one of these were actually interrelated behind the scenes, and it wasn’t just a simple matter of checking off a list of features.

Here’s our high level demo video to see the new features in action!

So we took the challenge head on and said “let’s do all of it!”   With all of that in mind, let’s walk through each awesome new features!

Feature #1:  See Gradient’s ROI with cost and runtime Visualizations

With new timeline graphs users can see in real-time the performance of their jobs and what impact Gradient is having. As a general monitoring tool, users can now see the impact of various cloud anomalies on their cost and runtime. A summary of benefits is below:

  • Monitor your jobs total costs across both DBUs and cloud fees in real-time to stay informed
  • Ensure your job runtimes and SLAs are met
  • Learn what anomalies are impacting your jobs’ performances
  • Visualize Gradient’s value in by watching your cost and runtime goals being met

Feature #2:  Cluster integrations with AWS and Azure

Gradient now interfaces with both AWS and Azure cloud infrastructure to obtain low level metrics. We know many Databricks enterprises utilize Azure and this was a highly requested feature. A summary of benefits is below:

  • Granular compute metrics are obtained by retrieving cluster logs beyond what Databricks exposes in their system tables
  • Integrate with Databricks Workflows or Airflow to plug Gradient into how your company runs your infrastructure
  • Easy metrics gathering as Gradient does the heavy lifting for you and automatically compiles and links information across both Databricks and cloud environments

Feature #3:  A new machine learning algorithm that custom learns each job

A huge upgrade from our previous solution is a new machine learning algorithm that learns the behavior of each job individually before optimizing. One lesson we learned is each job is unique, from python, to SQL, to ML, to AI, the variety of codebases out there is massive. A blanket “heuristic” solution was not scalable, and it was clear we needed something far more intelligent. A summary of the benefits is below:

  • Historical log information is used to train custom models for each of your jobs.  Since no two jobs are alike, custom models are critical to optimizing at scale.
  • Accuracy is ensured by training on real job performance data
  • Stability is obtained with small incremental changes and monitoring to ensure reliable performance

Feature #4: Auto-import and setup all of your jobs with a single click

Integrating with the Databricks environment is not easy, as most practitioners can attest to. We invested a lot of development into “how do we make it easy to on-board jobs?” After a bunch of work and talking to early users – we’ve built the easiest system we could find – just push a button.

Behind the scenes, we’re interacting with the Databricks API, tokens, secrets, init scripts, webhooks, logging files, cloud compute metrics, storage – just to name a few. A summary of the benefits is below:

  • Gradient connects to your Databricks workspace behind the scenes to make importing and setting up job clusters as easy as a single click
  • Non-invasive webhook integration is used to link your environment with Gradient without any modifications to your code or workflows

Feature #5:  View and approve recommendations with a click

With all of the integration setup done in the previous feature, applying recommendation is now a piece of cake. Just click a button and your Databricks jobs will be automatically updated. No need to go into the DB console or change anything in another system. We take care of all of that for you! A summary of the benefits is below:

  • View recommendations in the Gradient UI for your approval before any changes are actually made
  • Click to approve and apply a single recommendation so you are always in control

Feature #6:  Change your SLA goals at any time

We always believed that business should drive infrastructure, not the other way around. Now you can change your SLA goals at anytime and Gradient will change your cluster settings to meet your goals. With the new visualizations, you can see everything changing in real time as well. A summary of the benefits is below:

  • Runtime SLA goals ultimately dictate the cost and performance of your jobs.  Longer SLAs can usually lead to lower costs, while shorter SLAs could lead to higher costs.
  • Goals change constantly for your business, Gradient allows your infrastructure to keep up at scale
  • Business lead infrastructure allows you to start with your business goals and work backwards to your infrastructure, not the other way around

Feature #7:  Enable auto-apply for self-improving jobs

One big request was for users at scale, who have hundreds or thousands of jobs. There’s no way someone would want to click an “apply” button 1000x a day! So, for our ultimate experience, we can automatically apply our recommendations and all you have to do is sit back and watch the savings. A summary of the benefits is below:

  • Focus on business goals by allowing Gradient to constantly improve your job clusters to meet your ever changing business needs
  • Optimize at scale with auto apply, no need to manually analyze individual jobs – just watch Gradient get to work across all of your jobs
  • Free your engineers from manually tweaking cluster configurations and allowing them to focus on more important work

Try it yourself!

We’d love to get your feedback on what we’re building.  We hope these features resonate with you and your use case.  If you have other use cases in mind, please let us know! 

To get started – see our docs for the installation process!

Connect with us now via booking a demo, chatting with us, or emailing us at support@synccomputing.com.

Sync is now SOC2 Type I Compliant!

Introduction

In the world of Data, iron clad security is table stakes when considering 3rd party vendors to work with.  Here at Sync, we take our customer’s security seriously and want to assure that their sensitive information is handled with utmost care. This is why we are thrilled to announce that Sync has successfully achieved SOC 2 Type I compliance, a significant milestone in our commitment to data security and privacy.

To request our SOC2 Type I report, please see our security portal in the documentation

What is SOC 2 Type I Compliance?

SOC 2 (System and Organization Controls) is a widely recognized auditing standard developed by the American Institute of CPAs (AICPA). It focuses on the security, availability, processing integrity, confidentiality, and privacy of customer data. Achieving SOC 2 compliance demonstrates a company’s dedication to implementing and adhering to strict information security policies and procedures.

A SOC 2 Type I report is the initial step in the compliance process. It attests that the organization’s security controls are in place and have been effectively designed to meet the criteria specified in the Trust Service Criteria. Type I reports evaluate the operational effectiveness of these controls at a point in time, whereas a Type II report evaluates these controls over a period of time, typically six months or more.

Why SOC 2 Type I Compliance Matters

Enhanced Data Security: Achieving SOC 2 Type I compliance signifies a rigorous commitment to safeguarding sensitive information. It ensures that our systems and procedures have been thoroughly scrutinized and meet the highest standards of data security.

Customer Trust and Confidence: In an era where data breaches are commonplace, customers are becoming increasingly vigilant about the companies they choose to do business with. SOC 2 compliance provides assurance that we take data protection seriously and are willing to invest in the necessary safeguards.

Competitive Advantage: SOC 2 compliance is a differentiator in the market. It sets us apart from competitors who may not have undergone such stringent assessments. It becomes a clear signal to potential clients that we prioritize their data security.

Reduced Risk of Security Incidents: The rigorous auditing process required for SOC 2 compliance often reveals areas for improvement in an organization’s security posture. Addressing these issues reduces the risk of potential security incidents, which can have serious repercussions for both the company and its customers.

Streamlined Vendor Relationships: Many organizations now require their vendors to demonstrate SOC 2 compliance as a prerequisite for doing business. By achieving this certification, we eliminate a potential barrier to entry in the marketplace and can establish partnerships with companies that prioritize security.

The Road to Compliance

Achieving SOC 2 Type I compliance was no small feat for Sync. It required meticulous planning, dedication, and collaboration across various departments. Here’s a glimpse into the journey:

Risk Assessment and Gap Analysis: A primary step was to conduct a thorough risk assessment to identify potential vulnerabilities along with a gap analysis to determine where our existing security controls could be improved to reduce risk.

Policy and Procedure Development: We developed and implemented new policies and procedures to address control deficiencies. These covered a wide range of areas, including access controls, encryption, incident response, and more.

Employee Training and Awareness: Our employees are our first line of defense against security threats. Extensive training programs were implemented to ensure that every team member understands their role in maintaining the security of our systems and data.

Continuous Monitoring and Testing: Achieving compliance is not a one-time event. It requires ongoing vigilance and testing to ensure that security controls remain effective over time. Regular audits and assessments are now a permanent part of our security strategy.

Conclusion

Achieving SOC 2 Type I compliance is a significant milestone for Sync, one that underscores our unwavering commitment to data security and privacy. It is a testament to the hard work and dedication of our team members across the organization.

As we move forward, we will continue to invest in and prioritize data security to ensure that our customers can trust us with their most sensitive information. SOC 2 compliance is not the end of our journey; it is a foundation upon which we will build even stronger security practices to meet the evolving challenges of the digital landscape.

We are excited to embark on this new chapter and look forward to providing our customers with the highest level of confidence in the security of their data. Thank you for being a part of this journey with us.

To view our SOC2 Type 1 report, see our security portal in the documentation.