databricks

Databricks 101: An Introductory Guide on Navigating and Optimizing this Data Powerhouse

Noa Shavit
07.16.2024

In an era where data reigns supreme, the tools and platforms that businesses utilize to harness and analyze their data can make or break their competitive edge. Among these, Databricks stands out as a powerhouse, yet navigating its complexities often feels like deciphering cryptic code. With businesses generating an average of 2.5 quintillion bytes of data daily, the need for a robust, efficient, and cost-effective data cloud has never been more critical.

In this post, we demystify Databricks, with a focus on Job Clusters. Readers will gain insight into the platform’s workspace, the pivotal role of workflows, and the nuanced world of compute resources including All-Purpose Compute (APC) clusters vs Jobs Compute clusters. We’ll also shed light on how to avoid costly errors and optimize resource allocation for data workloads.

Curious about how Databricks can transform your data strategy while keeping costs in check? Read on to learn how.

Introduction to Databricks: Components and configurations

*Image source: https://www.serveradminz.com/blog/databricks-an-advanced-analytics-solution/*

Databricks, at its core, is a comprehensive platform that integrates various facets of data science and engineering, offering a seamless experience for managing and analyzing vast datasets.

Central to its ecosystem is the Workspace, a collaborative environment where data teams can store files, collaborate on notebooks, and orchestrate data workflows, to name a few capabilities. Workspaces act as the nerve center along with the Unity Catalog, which provides a bridge between workspaces. Together they facilitate an organized approach to data projects and ensure that valuable insights are never siloed.

Workflows, available under each Workspace, represent recurring production workloads (aka jobs) which are vital for database production environments. These workflows automate routine data processing tasks, such as machine-learning pipelines, ETL workloads, and data streaming, ensuring reliability and efficiency in data operations. Understanding the significance of workflows is essential for anyone looking to streamline their data processes in Databricks.

Databricks Job Clusters

Workflows rely on Job Clusters for compute resources. Databricks offers various compute resource options to pick from during the cluster set up process. While the default cluster compute is set to Serverless, you can expose and tweak granular configuration options by picking one of the classic compute options.

Read on for more info about Databricks Job Cluster options.

Serverless Jobs vs. Classic Compute

Multiple options to pick from during the cluster set up process

Databricks recently announced the release of serverless compute across the platforms at the Data + AI 2024 conference. The goal is to provide users with a simplified cluster management experience and reduce compute costs. However, based on our research, Jobs Serverless isn’t globally cheaper than Classic Compute, and there is a price for the convenience that serverless offers.

We compared the cost and performance of serverless jobs vs jobs running on classic clusters and found that while short ad-hoc jobs are ideal candidates for serverless, optimized classic clusters outperformed their serverless counterparts by 60% in costs.

More about the lessons we learned from experimenting with Databricks Serverless Jobs.

As for the convenience aspect, it does come at a cost. Serverless compute unburdens you from setting up cluster configuration, but that also means you lose control over job performance and costs. There are no settings to tweak or prices to negotiate with cloud providers. So if you are like us, and want to be able to control the configuration of your data infrastructure this might not be the right option for you.

On-demand clusters vs. Spot instances

Compute resources form the backbone of Databricks cluster management. The basic classification of compute resources is between on-demand clusters vs Spot instances. Spot instances are considered cost effective, offering discounts of up to 90% on remnant cloud compute. However, they aren’t the most stable or reliable. This is because the number of workers running can change in the middle of a job, which is dangerous. The job runtime and cost could become highly variable, and sometimes even crash.

Overall, on-demand instances are better suited for workloads that cannot be interrupted, workloads with unknown compute requirements, and immediate launch and operation needs. Spot instances, on the other hand, are ideal for workloads that can be interrupted and stateful applications with surge usage.

Spot vs On-demand	On-demand Instances	Spot Instances
Access	Immediate	Only if there is unused compute capacity
Performance	Stable and reliable	Limited stability as workers can be pulled during a job run
Cost	Known	Varies. Up to 90% cheaper than on-demand rates
Best for	Jobs that cannot be interrupted Jobs with unknown compute requirements	Jobs that can be interrupted Stateful apps with surge usages

Spot vs On-demand Clusters

All-Purpose Compute vs Jobs Compute

Databricks offers a couple forms of compute resources, including All-Purpose Compute (APC) clusters, Jobs Compute clusters, and SQL warehouses. Each resource is designed with specific use cases in mind.

SQL warehouses, for example, allow users to query information in Databricks using the SQL programming language. APCs and Jobs Compute, on the other hand, are more general compute resources capable of running many different languages such as Python, SQL, and Scala to run your jobs. While APCs are ideal for interactive analysis and collaborative research, Jobs Compute is ideal for scheduled jobs and workflows. Understanding the distinctions and use cases for these compute resources is crucial for optimizing performance and managing costs.

Unfortunately, navigating the compute options for your clusters can be difficult. Common errors include the accidental deployment of more expensive compute resources, such as APC cluster when a Jobs Compute cluster would suffice. Such mistakes can have significant financial implications. In fact, we’ve found that APCs can cost up to 50% more than running the same job using Jobs Compute.

The difference between APC clusters and Jobs Compute clusters represents a crucial decision point for Databricks users, as they spin up a cluster. And each option is tailored to different stages of the data analysis lifecycle. Knowing these nuances can help you avoid common errors to ensure that your Databricks environment is both effective and cost-efficient. For example, APC clusters can be used for more exploratory research work while Job Clusters are used for mature production scheduled workloads.

Photon workload acceleration

Databricks Photon is a high-performance vectorized query engine that accelerates workloads.

Photon can substantially speed up job execution, particularly for SQL-based jobs and large-scale production Spark jobs. Unfortunately, Photon costs about 2x/DBU.

Learn more about the pros and cons of Databricks Photon in this post.

Many factors influence whether Photon is the right choice for you, which is why we recommend you A/B test the same job with and without Photon enabled. From our experience, 9 out of 10 organizations opt out of Photon once the results are in.

Spark autoscaling

Databricks offers an autoscaling feature to dynamically adjust the number of worker nodes in a cluster based on the workload demands.

By dynamically adjusting resources, autoscaling can reduce costs for some jobs. However, due to the spin up time cost of new workers, sometimes autoscaling can lead to increased costs. In fact, we found that simply applying autoscale across the board can lead to a 30% increase in compute costs.

Read the full analysis of Databricks autoscaling here.

Notebooks for ad-hoc research in Databricks

Databricks revolutionizes the data science and data engineering landscape with its notebook concept, which combines code with annotations. It embodies a dynamic canvas where you can breathe life into your projects.

Notebooks in Databricks excel in facilitating chunk-based code execution, a feature that significantly enhances debugging efforts and iterative development. By enabling users to execute code in segments, we eliminate the need for running entire scripts for minor adjustments. This accelerates the development cycle and promotes a more efficient debugging process.

Notebooks can run on APC clusters and Jobs Compute clusters depending on the task at hand. This choice is paramount when project requirements dictate the need for either exploratory analysis when APC is ideal, or scheduled jobs when Jobs Compute will be the most cost effective.

It’s important to note that All-Purpose Compute clusters are shared environments with a collaborative aspect. Multiple data scientists can simultaneously work on the same project with collaboration facilitated by shared compute resources. As a result, team members can contribute to and iterate on analyses in real-time. Such a shared environment streamlines research efforts, making APC clusters invaluable for projects requiring collective input and data exploration.

The journey from ad-hoc research in notebooks to the production-ready status of workflows marks a significant transition. While notebooks serve as a sandbox for exploration and development, workflows embody the structured, repeatable processes essential for production environments.

This contrast underscores the evolutionary path from exploratory analysis, where ideas and hypotheses are tested and refined, to the deployment of automated, scheduled jobs that operationalize the insights gained. It is this transition from the “playground” of notebooks to the “factory floor” of workflows that encapsulates the full spectrum of data processing within Databricks, bridging the gap between initial discovery and operational efficiency.

From development to production: Workflows and jobs

The journey from ad-hoc research in notebooks to the deployment of production-grade workflows in Databricks encapsulates a sophisticated transition, pivotal for turning exploratory analyses into automated, recurring processes that drive business operations. At the core of this transition are workflows, which serve as the backbone for scheduling and automating jobs in Databricks.

Workflows stand out by their ability to convert manual, repetitive tasks into automated sequences that run based on predefined triggers. These triggers vary widely, from the arrival of new files in a data lake to continuous data streams that require real-time processing. By leveraging workflows, data teams can ensure that their data pipelines are responsive, timely, and efficient, enabling seamless transitions from data ingestion to insight generation.

Directed Acyclic Graphs (DAGs) are central to workflows and further enhance the platform by providing a graphical representation of workflows. DAGs in Databricks allow users to visualize the sequence and dependencies of jobs, offering a clear overview of how data moves through various processing stages. This is an essential feature for optimizing and troubleshooting workflows, enabling data teams to identify bottlenecks or inefficiencies within their pipelines.

Transitioning data workloads from the exploratory realm of notebooks to the structured environment of production-grade workflows enables organizations to harness the full potential of their data, transforming raw information into actionable insights.

Orchestration and Competition: Databricks Workflows vs. Airflow and Azure Data Factory

By offering Workflows as a native orchestration tool, Databricks aims to simplify the data pipeline process, making it more accessible and manageable for its users. With that said, it also opens up a complex landscape of competition, especially when viewed against established orchestration tools like Apache Airflow and Azure Data Factory.

Comparing Databricks Workflows with Apache Airflow and Azure Data Factory reveals a compelling competitive landscape. Apache Airflow, with its open-source lineage, offers extensive customization and flexibility, drawing users who prefer a hands-on, code-centric approach to orchestration. Azure Data Factory, on the other hand, integrates deeply with other Azure services, providing a seamless experience for users already entrenched in the Microsoft ecosystem.

Databricks Workflows promise simplicity and integration, especially appealing to those who already leverage the platform. The key differentiation lies in Databricks’ ability to offer a cohesive experience, from data ingestion to analytics, within a single platform.

The strategic implications for Databricks are multifaceted. On one hand, introducing Workflows as an all-in-one solution locks users into its ecosystem. On the other hand, it challenges Databricks to continually add value beyond what users can achieve with other tools. The delicate balance between locking users in and providing unmatched value is critical; users will tolerate a certain degree of lock-in if they perceive they are receiving significant value in return.

As Databricks continues to expand its suite of tools with Workflows and beyond, the landscape of data engineering and analytics tools is set for a dynamic evolution. The balance between ecosystem lock-in and value provision will remain a critical factor in determining user adoption and loyalty. The competition between integrated solutions and specialized tools will shape the future of data orchestration, with opportunities for both consolidation and innovation.

Conclusion

In today’s data-driven world, mastering tools like Databricks is crucial. This guide aims to help simplify the management of Databricks Job Clusters. Our goal is to help you navigate the complexities and optimize resource allocation effectively.

Understanding the distinction between compute resources, such as APC clusters for interactive analysis and Jobs Compute clusters for scheduled jobs, is essential. Understanding when to apply Photon and when not to is also critical for cluster optimization. Simply choosing the right compute options for your jobs can reduce your Databricks bill by 30% or more.

The journey from exploration and research in notebooks to production-ready automated and scheduled workflows is a significant evolution. Workflows simplify data pipeline orchestration, and compete with other data orchestration tools like Apache Airflow and Azure Data Factory.

While Airflow offers extensive customization and Azure Data Factory integrates seamlessly with Microsoft services, Databricks Workflows provide a cohesive experience from data ingestion to analytics. The challenge for Databricks is to balance ecosystem lock-in with delivering unmatched value, ensuring it continues to enhance its tools and maintain user loyalty in an evolving data landscape.

If you are interested in learning more about Databricks cluster management and cost optimization follow us on Medium, or subscribe to your newsletter using the form in our footer.

If you’ve already ramped up your activity on Databricks and are looking for a way to cut costs, we can help you reduce spend by 50%. Use this free Databricks workspace health check for an estimation of your potential savings, and actionable insights into your cluster configurations (most expensive jobs, candidates for Photon and autoscaling, APC mistakes, and more).

Top 9 Lessons Learned about Databricks Jobs Serverless

Jeffrey Chou
07.02.2024

To much fanfare, Databricks announced their wide release of serverless compute across all of their platforms at the Data + AI 2024 conference. It’s quite clear that Databricks’ vision is to own the compute layer to make life easier for end users, so they don’t have to worry about annoying cluster or versions details.

We at Sync are all about cloud compute efficiency. So of course we had to take a closer look at Databricks Serverless Compute to provide an honest perspective of serverless computing compared classic computing, and evaluate the pros and cons.

Full disclosure – Sync is a Databricks partner. With that said, everything we state here is merely our opinion, which is based on the experimental measurements covered below. Our goal is to provide an unbiased guide, so that users can decide for themselves whether or not serverless is a good choice for them.

This post focuses on the jobs serverless feature, which is currently under public preview. For more information about the SQL warehouse serverless product, check out this post.

What is serverless? – A travel analogy

At a high level serverless means that end users don’t have to provision cloud infrastructure anymore, Databricks will do it all for you – so users can just focus on their code. For example, selecting which instances to use is now under the control of Databricks, not the end user.

While this sounds like a no-brainer, if you zoom in a bit closer there are some pros and cons. Let’s use an analogy here. Let’s say you want to travel from San Francisco to London.

The “classic” way of doing this is all of the planning is up to you, whether you travel by car, train, boat, bus, walking, running, or even swimming it’s all up to you to figure out. And then there’s lodging, scheduling, and budget to think about. This is a lot of work. However, you can custom tailor to your exact specifications, including timing and budget.

The “serverless” way of doing this is you close your eyes in San Francisco and you wake up in London. That sounds like the dream situation. However, you are soon handed a bill that costs $50,000 and you arrive a day after your big meeting. How did that work out for you?

If you’re a high volume traveler for business, you may prefer the high level of control the “classic” way brings, since you need that level of granularity. If you’re a wealthy aimless traveler just enjoying the world, serverless is probably the dream. It all depends, both methods are potentially great.

The lessons learned

As you review the results, we’d like you to bear in mind that jobs serverless will most likely improve with time, and that our current results are merely a snapshot in time. Will these numbers hold up in 1 year? Maybe, maybe not – we have no idea. We can only hope that this feedback helps shape the offering with some honest feedback from the field.

Now for the good stuff: here are the top 9 lessons learned from evaluating Databricks Jobs Serverless.

Serverless compute is not cost optimized
Ideal for short or ad-hoc jobs
Eliminating spin up time is the biggest value add
Serverless has zero knobs, which makes life easy but at the price of control
You have no control over the runtime of your jobs
Migrating to serverless is not easy
Costs are completely determined by Databricks
What happens if there’s an error?
You can’t leverage your cloud contracts

Read on for our in-depth analysis of Databricks server vs. serverless computing.

1. Serverless compute is not cost optimized

By far, the biggest hope users have for serverless is that “it will optimize my compute for me.” While this is true in some regard, the core issue users need to understand is “Does serverless provide the lowest cost option?” We can answer very clearly and unambiguously that, unfortunately, it does not. Serverless is not the cheapest option around.

While serverless is a pretty good option, it’s certainly not the most optimal when it comes to cost for all use cases. As evidence, we ran a test job and found that an optimized cluster by Gradient (our flagship product), outperformed Databricks serverless jobs by roughly 60% from a cost perspective!

Our test job runs basic queries on a randomly generated dataset. The runtime on a classic cluster is about 1 hour, which is pretty typical in many jobs we’ve encountered at companies. Serverless was able to run the job much faster, taking only 30 minutes which was great to see. Unfortunately, the runtime savings didn’t translate to the cost savings, as you can see in the chart below.

In the test we are utilizing on-demand instances with list pricing. Users will likely save even more if using Spot instances. However, you have no option to use spot instances with serverless. You have no access at all to what’s going on.

This test result might not translate to your internal jobs. You may have a job that demonstrates that serverless massively outperforms an optimized cluster in terms of costs – it all depends on your workload. With that said, this data point does prove that serverless is not GLOBALLY optimal. Serverless does not guarantee cost savings.

Some skeptics might say that the cost savings of serverless appears in the form of engineering hours saved. With serverless, engineers don’t have to spend time thinking about clusters – which can translate to real time and money saved. We completely agree with this point of view, that is very substantial.

Our one counter argument is that getting started on any cluster is pretty easy today, so most people don’t spend time tuning their clusters if they don’t want to. Engineers typically resort to cluster tuning to help lower cost or improve performance. So if the cost and performance of Serverless is not ideal, you’re just out of luck – serverless may not be solving the root issue that tuning is attempting to solve.

At the end of the day everything depends on the particularities of your workload and use case.

2. Ideal for short or ad-hoc jobs

A great use case for serverless, which we fully endorse, is using serverless for short (<5 min) jobs. The elimination of spin up time for your cluster is a massive win that we really love.

Here’s an experiment we ran using a trivial job that doesn’t really do anything. The job doesn’t even run Spark and is run on a single node cluster. The cost was slightly lower with serverless, but the big win was in the runtime where we saw roughly 80% reduction in runtime! This improvement is mostly due to the complete elimination of cluster spin up time which can take 5-10 minutes.

What’s interesting is even though serverless does not have spin up time, the cost premium for serverless still equated to roughly the same overall cost – which was a bit disappointing.

With that said, the big win is that users may not always know that they are running a short job and could be massively over provisioning their clusters on accident.

Serverless helps to avoid that mistake and that can result in substantial big cost savings – simply preventing human error.

3. Eliminating spin up time is the biggest value add

We couldn’t love this aspect enough. Cluster spin up time is such a pain to deal with when you’re just trying to run something in real-time. So many times users have to wait for a cluster to spin up and get sidetracked by another task so that they don’t come back to the cluster until an hour later.

Our one nit pick here is that for scheduled jobs in a pipeline, spin up time is less of a concern. These are jobs that can be running at all hours of the day at scales of 1000s of jobs running per day. At that level, spin up time is really a cost factor – and then the real question is if serverless provides the lowest cost.

However, if you’re doing a quick ad-hoc experiment, or just want to get a quick result – we highly recommend serverless as you’ll get your result much faster.

4. Serverless has zero knobs, which makes life easy but at the price of control

Something quite unique about Databrick jobs serverless is that there are zero knobs. Not even “T-shirt” sizing, like what we have today in SQL serverless platforms. This means that you can’t even select “small, medium, large, x-large” clusters – you don’t get to select anything.

For a company that is just trying to get jobs up and running asap, we think this can be pretty great. It can save engineers some time when it comes to provisioning infrastructure.

The big tradeoff is that you can’t change anything. If you care about cost and runtime and want the ability to tune performance, then this may not be a convenient feature.

5. You have no control over the runtime of your jobs

The big downside of jobs serverless is that there’s no way to tune the cluster to adjust cost or runtime. You basically have to live with whatever Databricks decides. This means that if you want faster runtime, you can’t just throw a bigger cluster at it and call it a day. You can’t do anything really, except change your code. You’re stuck.

We can only assume that eventually Databricks will throw in some high-level “performance” knob, as we think this is a pretty big limitation, but who knows.

6. Migrating to serverless is not easy

Serverless utilizes shared compute resources in the background, and as a result enforces a large number of general restrictions. We’ve heard rumors that it’s a giant Spark on Kubernetes clusters, but don’t quote us on that.

A couple impactful restrictions of serverless are:

You must have Unity Catalog enabled
Scala and R are not supported
Only ANSI SQL is supported when writing SQL
Spark RDD APIs are not supported
Caching API and SQL commands are not supported
Global temporary views are not supported
You cannot access DBFS

This goes on and on, exceeding over 100 limitations. We, in fact, had a hard time getting ANY job to run on serverless. We ran into issues even with simple test jobs. It wasn’t until we manually changed the code and moved data around that we finally got it to work.

Our opinion is unless this is improved dramatically, it will be a giant lift and shift amount of work for enterprises who built their jobs on classic compute. Serverless eliminates the general flexibility we had on classic clusters. This will likely slow the adoption of serverless for larger companies. Probably new workloads will get onboarded to serverless first, before any big migration effort takes place.

7. Costs are completely determined by Databricks

One thing we found troubling was the pricing. On the Databricks website they say that the cost is $0.35/DBU. But, where is the DBU/hr metric? Normally, one would take the runtime of the job and calculate a $/hr rate. Then, a user could tune the cluster size and tune the rate of cost. But with zero knobs, we have no control over the rate of cost.

It appears that we only get the number of DBUs a job costs AFTER it completes via the system tables. We find it odd that this number is calculated completely in the dark. Today my job costs 10 DBUs, next month it may cost 12 DBUs. Why? No clue. The amount of power Databricks has here is a bit overreaching in our opinion.

In fact, because there is no DBU/hr metric, the “list price” they have here of $0.35/DBU is incomplete. Since the amount of DBUs my jobs cost is solely under Databricks’ control, the end price is pretty much arbitrary. For example, the calculation of job cost is simply: $0.35/DBU * X DBU, where X is the amount of DBUs Databricks determines and reports in the system tables.

This is quite the advantage for Databricks, and in our opinion, it will result in large revenue and profit growth. Any compute optimization they do on the backend helps widen their margins, while the cost to the user stays mysterious. I don’t blame Databricks for doing this, and this is also not a new concept. In fact, many companies prefer to run serverless for this revenue generating reason. Users get “convenience” and Databricks makes more money, it’s a fair exchange.

What’s funny is that this is approaching the Snowflake’s serverless model. Not only is Snowflake the company’s mortal enemy, many people complain that Snowflake is too expensive for this very reason. It will be interesting to see how the market reacts in the long run and if sticker shock to their serverless bills will cause some CFOs to take action.

8. What happens if there’s an error?

In Spark-land, everyone knows of the dreaded out of memory (OOM) error. What happens if the cluster under the serverless hood hits this error for your job? Typically we would try to fix this with more memory, but we don’t have that option with serverless.

Are users now dependent on Databricks to fix this? That could be dangerous. In cluster-land, a million things can go wrong, and now that it’s all managed by Databricks you’re pretty much at the mercy of their support team if anything critical goes down.

9. You can’t leverage your cloud contracts

If you have established special discounts with AWS or Azure, or custom plans for certain instances – listen up. If you use serverless jobs those discounts can become irrelevant in regards to your Databricks usage. This is because on serverless the compute runs inside the Databricks environment.

This may or may not apply to your company, it really depends on the nature of your contracts and the volume of your Databricks usage. We thought we’d point it out as it is a major difference between classic and serverless compute.

Conclusion

Distributed computing is a very complex topic, and rarely is anything a guaranteed slam dunk. In this post, we covered the pros and cons of Databricks serverless jobs as it is today.. Some will find this capability beneficial and that’s fantastic! Based on our initial analysis, we found serverless jobs to be ideal for short and ad-hoc jobs, as the largest value add is the elimination of spin up times.

As most companies, Databricks’ marketing overshoots the benefits of their features from time to time. We’ve already reported on this in regards to Photon and Autoscaling, and we hope we have now sprinkled some truth in regards to Serverless as well. By the way, Databricks states that Photon and autoscaling are automatic for jobs serverless, which in our analysis, often leads to unnecessary cost increases.

As you can imagine, we get a lot of questions about serverless. Here are the top two questions we are asked about Databricks serverless jobs:

Databricks says serverless is cost efficient, so what’s the deal?

Yes, Databricks has presented materials touting that serverless is cost efficient relative to classic compute. This presentation from DAIS 2024 is a great example. The real question is – what do you compare serverless to? If you’re comparing serverless to just default settings, then serverless may likely do very well. But a workload optimized cluster can likely outperform serverless. Finding an optimized cluster, though, is by no means an easy feat. How do we determine how “optimal” it is?

Skeptics may say that we are biased as well, that examples above are based on jobs we knew could outperform serverless. This is a totally fair position to have. For the record – we did not cherry pick a workload in our tests, we just quickly found the first job we could that was even compatible with serverless.

At the end of the day, benchmarks presented by external parties can be totally irrelevant to your use case. Fancy benchmarks like TPC-DS, or even the one we shared in this post do not look like your jobs. There’s only one thing that really matters: YOUR WORKLOADS.

Finally, we say, don’t take our word for it, nor Databricks’. Compare serverless head to head with an optimized classic cluster and see for yourself. If you need help, Gradient is here to lead you to help that coveted optimized cluster. Or, if you have the knowledge and skills, you can manually optimize your cluster and compare it to serverless. u.

How does this impact Sync?

Does serverless impact the Sync roadmap? We have short term and a long term answers to this question:

Short term: Serverless and classic will co-exist for quite some time. We see serverless as just another option (like Photon, or autoscaling). Sync’s algorithms can test whether serverless or an optimized classic cluster is best for your needs, and share recommendations or auto-apply those changes for you. At the end of the day, all we care about is selecting the best options and configurations based on your goals. If serverless is that, then we’ll be happy to point your jobs in that direction.

Long term: We are a cloud compute management company, and Databricks is just the first stop in our evolution. Our plan is to expand to all facets of cloud computing, from Spark, to bare CPUs, GPUs, Kubernetes etc., it’s all up for grabs. Databricks has been a great partner and an important first step, but longer term, it will be one of dozens of platforms we support. Our guess is optimizing classic compute will play a role for Databricks users for many years to come, and we’re happy to help them with that

Like we said earlier, don’t take our word for it. Try serverless out for yourself, do your own homework. Conduct an A/B test and see if serverless is actually cost effective. If you need help automatically finding the optimized classic cluster, feel free to check us out.

Gradient Product Update— Discover, Monitor, and Automate Databricks Clusters

Jeffrey Chou
06.06.2024

With the Databricks Data+AI Summit 2024 just around the corner, we of course had to have a major product launch to go with it!

We’re super excited to announce an entirely new user flow and features to the product, making it faster to get started and providing a more comprehensive management solution. At a high level, the new expansion involves these new features:

Discover – Find new jobs to optimize
Monitor – Track all job costs and metrics over time
Automate – Auto-apply to save time, costs, and hit SLAs

With ballooning Databricks costs and constrained budgets, Databricks efficiency is crucial for sustainable growth for any company. However, optimizing Databricks clusters is a difficult and time consuming task riddled with low level complexity and red tape.

Our goal with Gradient is to make it as easy and painless as possible to identify jobs to optimize, track overall ROI, and automate the optimization process.

The last automation piece is what sets Gradient apart. Gradient is designed to optimize at scale, for companies that have 100+ production jobs. At that scale, automation is a must, and is where we shine. Gradient provides a new level of efficiency unobtainable with any other tool on this planet.

With automatic cluster management, engineering teams are free to pursue more important business goals while Gradient works around the clock.

Let’s drill in a bit deeper into what these new features are:

Discover

Find your top jobs to optimize as well as discover new opportunities to improve your efficiency even more. This page is refreshed daily so you always get up-to date insights and historical tracking.

How to get started – Simply enter your Databricks credentials and click go! You can get running from scratch in less than a minute

What is shown:

Top jobs to optimize with Gradient
Jobs with Photon enabled
Jobs with Autoscaling enabled
All purpose compute jobs
Jobs with no-job id (meaning they could come from an external orchestrator like Airflow)

To see how fast and easy it is to get the Discover page up and running, check out the video below:

Monitor

Track Spark metrics and costs of all of your jobs managed with Gradient in a single pane of glass view. Use this view to get a bird’s eye view on all of your jobs and track the overall ROI of Gradient with the “Total Savings” view.

How to get started – Onboard your Databricks workspace in the integration page. This may require involving your devops teams as various cloud permissions are required.

What is shown:

Total core hours
Total Spend
Total recommendations applied
Total cost savings
Total estimated developer time saved
Total number of projects
Number of SLAs met

Automate

Enable auto-apply to automatically optimize your Databricks jobs clusters to hit your cost and runtime goals. Save time and money with automation.

How to get started – Onboard your Databricks workspace in the integration page (no need to repeat if already done above)

What is shown:

Job costs over time
Job runtime over time
Job configuration parameters
Cluster configurations
Spark metrics
Input data size

Conclusion

Get started in a minute yourself with the Discover page and start finding new opportunities to optimize your Databricks environment. Login yourself to get started!

Or if you’d prefer a hands on demo, we’d be happy to chat. Schedule a demo here

May 2024 Release Notes

Jeffrey Chou
05.29.2024

April showers bring May product updates! Take a look at Sync’s latest product releases and features. 💐

The Sync team is heading to San Francisco for the Databricks Data+AI Summit 2024! We’ll be at Booth #44 talking all things Gradient with a few new surprise features in store.

Want to get ahead of the crowd? Book a meeting with our team before the event here.

Download our Databricks health check notebook

Have you taken advantage of our fully customizable health check notebook yet?

With the notebook, you’ll be able to answer questions such as:
⚙️ What is the distribution of job runs by compute type?
⚙️ What does Photon usage look like?
⚙️ What are the most frequently used instance types?
⚙️ Are APC clusters being auto-terminated or sitting idle?
⚙️ What are my most expensive jobs?

The best part? It’s a free tool that gives you actionable insights so you can work toward optimally managing your Databricks jobs clusters.

Head here to get started.

Apache Airflow Integration

Apache Airflow for Databricks now directly integrates with Gradient. Via the Sync Python Library, users are able to integrate Databricks pipelines when using 3rd party tools like Airflow.

To get started simply integrate your Databricks Workspace with Gradient via the Databricks Workspace Integration. Then, configure your Airflow instance and ensure that the syncsparkpy library has been installed using the Sync CLI.

Take a look at an example Airflow DAG below:

from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonVirtualenvOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
from airflow.models.variable import Variable
from airflow.models import TaskInstance

from sync.databricks.integrations.airflow import airflow_gradient_pre_execute_hook



default_args = {
    'owner': 'airflow'
}

with DAG(
    dag_id='gradient_databricks_multitask',
    default_args=default_args,
    schedule_interval = None,
    start_date=days_ago(2),
    tags=['demo'],
    params={
        'gradient_app_id': 'gradient_databricks_multitask',
        'gradient_auto_apply': True,
        'cluster_log_url': 'dbfs:/cluster-logs',
        'databricks_workspace_id': '10295812058'
    }
) as dag:

    def get_task_params():
        task_params = {
            "new_cluster":{
                  "node_type_id":"i3.xlarge",
                  "driver_node_type_id":"i3.xlarge",
                  "custom_tags":{},
                  "num_workers":4,
                  "spark_version":"14.0.x-scala2.12",
                  "runtime_engine":"STANDARD",
                  "aws_attributes":{
                     "first_on_demand":0,
                     "availability":"SPOT_WITH_FALLBACK",
                     "spot_bid_price_percent":100
                  }
               },
               "notebook_task":{
                  "notebook_path":"/Users/pete.tamisin@synccomputing.com/gradient_databricks_multitask",
                  "source":"WORKSPACE"
               }
        }

        return task_params

    notebook_task = DatabricksSubmitRunOperator(
        pre_execute=airflow_gradient_pre_execute_hook,
        task_id="notebook_task",
        dag=dag,
        json=get_task_params(),
    )

##################################################################


    notebook_task

And voila! After implementing your DAG, head to the Projects dashboard in Gradient to review recommendations and make any necessary changes to your cluster config.

Take a look at our documentation to get started.

What is Databricks Unity Catalog (and Should I Be Using It)?

Jeffrey Chou

Since launching in 2013, Databricks has continuously evolved its product offerings from machine learning pipeline to end-to-end data warehousing and data intelligence platform.

While we at Sync are big fans of all things Databricks (particularly how to optimize cost and speed) we often get questions about understanding Databricks new offerings—particularly as product development has accelerated in the last 2 years.

To help in your understanding, we wrote this blog post to address the question, “What is Databricks Unity Catalog?” and whether users should be using it (the answer is yes). We walk through a precise technical answer, and then dive into the details of the catalog itself, how to enable it and frequently asked questions.

What Is the Unity Catalog in Databricks?

The Databrick’s Unity Catalog is a centralized data governance layer that allows for granular security control and managing data and metadata assets in a unified system within Databricks. Additionally, the unity catalog provides tools for access control, audits, logs and lineage.

You can think of the unity catalog as an update designed to bridge gaps in the Databrick ecosystem—specifically to eliminate and improve upon third-party catalogs and governance tools. With many cloud-specific tools being used, Databricks brought in a unified solution for data discovery and governance that would seamlessly integrate with their Lakehouse architecture. Thus, while Unity Catalog was initially billed as a governance tool, in reality it streamlines processes across the board. While simplistic, it’s not wrong to say Unity Catalog simply makes everything Databricks run smoother.

Notably, the Unity Catalog is being offered by default on the Databricks Data Intelligence Platform. This is because Databricks believes the Unity Catalog is a huge benefit to their users (and we are inclined to agree!). If you have access to the Unity Catalog, we highly recommend enabling it in your workspace.

What benefits does the Databricks Unity Catalog have to offer?

The Unity Catalog benefits can be thought of in four buckets: data governance, data discovery, data lineage, and data sharing and access.

Data Discovery

The unity catalog provides a structured way to tag, document and manage data assets and metadata. This allows for a comprehensive search interface that utilizes lineage metadata (including full lineage) history and ensures security based on user permissions.

Users can either explore data objects through the Catalog Explorer, or parse through data using SQL or Python to query datasets and create dashboard from available data objects. In Catalog explorer, users can preview sample data, read comments and check field details (50 second preview from Databricks here).

A preview of the Catalog explorer for data discovery in Unity Catalog (via Databricks/Youtube)

Data Governance

Unity Catalog is a layer over all external compute platforms and acts as a central repository for all structured and unstructured data assets (such as files, dashboards, tables, views, volumes, etc). This unified architecture allows for a governance model that includes controls, lineage, discovery, monitoring, auditing, and sharing.

Unity Catalog thus offers a single place to administer data access policies that apply across all workspaces. This allows you to simplify access management with a unified interface to define access policies on data and AI assets and consistently apply and audit these policies on any cloud or data platform.

All of Databricks governance parameters can be accessed via their Unity Catalog Governance Portal. The Databricks Data Intelligence Platform leverages AI to best understand the context of tables and columns, the volume of which can be impossible for manual categorization. This also enables you to quickly assess how many of your tables are monitored via Lakehouse Monitoring — Databricks’s new “AI for Governance tool”.

A screenshot of the Unity Catalog Governance portal shows how their Lakehouse Monitoring uses AI to automatically monitor tables and alert users to uses like PII leakage or data drift (via Databricks/Youtube)

With Lakehouse monitoring you can also set up alerts that automatically detect and correct PII leakage, data quality, data drift and more. These auto alerts are contained within their own section of the Governance Portal, which shows when the issue was first detected, and where the issue first stemmed from.

A preview of the governance action items shows how issues are identified by cause and Catalog/Schema/Table. Digging further in will reveal the time and date of first incidence as well as it where it stems from.

It incorporates a data governance framework and maintains an extensive audit log of actions performed on data stored within a Databricks account.

Data Lineage

As the importance of Data Lineage has grown, Databricks has responded with end-to-end lineages for all workloads. Lineage data includes notebooks, workflows and dashboards and is captured down to the column level. Unity Catalog users can parse and extract lineage metadata from queries and external tools using SQL or any other language enabled in their workspace, such as Python. Lineage can be visualized in the Catalog Explorer in near-real-time and

Unity Catalog’s lineage feature provides a comprehensive view of both upstream and downstream dependencies, including the data type of each field. Users can easily follow the data flow through different stages, gaining insights into the relationships between field and tables.

An example of the metadata lineage within Unity Catalog

Like their governance model, Databricks restricts access to data lineage based on the logged-in users’ privileges.

Data Sharing and Access

One of the most welcomed features of Databricks Unity Catalog is its built-in sharing method which is built on Delta Sharing, Databricks’ popular cloud-platform-agnostic open protocol for sharing data and managing permissions launched in 2021.

Within Unity Catalog you can access control mechanisms use identity federation, allowing Databricks users to be service principals, individual users, or groups. In addition, SQL-based syntax or the Databricks UI can be used to manage and control access, based on tables, rows, and columns, with the attribute level controls coming soon.

How Does Databricks Unity Catalog Enhance Data Governance and Security

Databricks has a standards-complaint security model based on ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views.

Unity Catalog grants user-level permissions for Governance Portal, Catalog Explorer and for data lineages and sharing. Unity Catalog in effect has one model for safeguarding appropriating access across your full data estate with permissions, row level, and column level security.
It almost allows registering and governing access to external data sources, such as cloud object storage, databases, and data lakes, through external locations and Lakehouse Federation.

Does Unity Catalog Help With Databricks Cost?

Yes, because Unity Catalog reduces both storage costs and fees for external licensing, it reduces cost compared to previous solutions. It also indirectly saves time by greatly reducing bottlenecks for ingesting data, reducing time spent on repetitive tasks by an average of 80% (according to Databricks). This all comes free and automatically enabled for all new users of the Databricks Data Intelligence Platform.

How do I set up and configure Unity Catalog in Databricks?

The following is a step-by-step guide to setting up and configuring Databricks Unity Catalog.

Confirm Your Workspace Is Enabled For Unity Catalog.
Log into your account and click Workspaces. From there check the Metastore Column. If a metastore name is preset, it means your workspace is attached to a Unity Catalog.

If your workspace doesn’t return a metastore, you’ll want to either to enable and attach your workspace, or create a Unity Catalog metastore.

Add users and assign the workspace admin role.
The user who creates a workspace is automatically added as an admin role. That admin can then add and invite users, and can assign workplace admin roles and metastore admin roles.
Create Clusters or SQL Warehouses for users to run queries and create objects. To run Unity Catalog workloads, compute resources must comply with certain security requirements. As a workspace admin, you can opt to make compute creation restricted to admins or let users create their own SQL warehouses and clusters
Grant Privileges to Users. To create objects and access them in Unity Catalog catalogs and schemas, a user must have permission to do so. See how to grant privileges and manage admin privileges.
Create New Catalogs and Schemas. To start using Unity Catalog, you must have at least one catalog defined. Catalogs are the primary unit of data isolation and organization in Unity Catalog. All schemas and tables live in catalogs, as do volumes, views, and models. You’ll want to create managed storage for the new catalog, then bind the new catalog your workspace, and then grant privileges for that catalog. Full instructions here.

What Integrations Work With Data Unity Catalog?

Unity Catalog works existing data storage systems and governance solutions such as Atlan, Fivetran, dbt or Azure data factory. It also integrates with business intelligence solutions such as Tableau, PowerBi and Qlik. This makes it simple to leverage your existing infrastructure for updated governance model, without incurring expensive migration costs (for a full list of integrations check out Databricks page here).

What if my workspace wasn’t enabled for Unity Catalog automatically?

If your workspace was not enabled for Unity Catalog automatically, an account admin or metastore admin must manually attach the workspace to a Unity Catalog metastore in the same region. If no Unity Catalog metastore exists in the region, an account admin must create one. For instructions, see Create a Unity Catalog metastore.

Unity Catalog Limitations

The following limitations apply for all object names in Unity Catalog:

Object names cannot exceed 255 characters.
The following special characters are not allowed:
- Period (.)
- Space ( )
- Forward slash (/)
- All ASCII control characters (00-1F hex)
- The DELETE character (7F hex)
Unity Catalog stores all object names as lowercase.
When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (-).

For a full list of Unity Catalog Limitations, read the full documentation for the Unity Catalog.

Unity Catalog FAQs

How does Databricks Unity Catalog differ from Hive Metastore?
Databricks Unity Catalog offers a centralized data governance model, supports external data access, data isolation, and advanced features like column-level security, while Hive Metastore has limited governance capabilities.
How Long is Lineage Data Stored in Databricks Unity Catalog?
Lineage data on Databricks Unity Catalog is retained for 1 year.
What are the supported compute and cluster access modes for Databricks Unity Catalog?
Supported access modes are Shared Access Mode and Single User Access Mode. No-Isolation Shared Mode is not supported.
What data file formats are supported for managed and external tables in Databricks Unity Catalog?
Managed tables must use the Delta table format, while external tables can use Delta, CSV, JSON, Avro, Parquet, ORC, and Text formats.
How do you enable your workspace for Databricks Unity Catalog?
You can enable Unity Catalog during workspace creation or assign an existing metastore to your workspace through the Databricks account console.
How do you control access to data and objects in Databricks Unity Catalog?
You can use admin privileges, object ownership, privilege inheritance, basic object privileges (GRANT/REVOKE), dynamic views for row/column security, and manage external locations and credentials.
What is the Databricks Unity Catalog object model?
The object model follows a hierarchical structure: Metastore ► Catalog ► Schema ► Tables, Views, Volumes, and Models.
Can you transfer ownership of objects in Unity Catalog?
Yes, you can transfer ownership of catalogs, schemas, tables, and views to other users or groups using SQL commands or the Catalog Explorer UI.
How do you create a new catalog in Unity Catalog?
You can use the CREATE CATALOG SQL command, specifying a name and managed location if needed. You must have CREATE CATALOG privileges on the metastore.
How do you grant permissions on a catalog or schema?
Use the GRANT statement with the desired privileges (e.g., CREATE SCHEMA, CREATE TABLE) and the catalog or schema name, followed by the user or group to grant access to.
What is the syntax for referring to a table in Unity Catalog?
Use the three-part naming convention: <catalog>.<schema>.<table>
How do you create a managed table in Unity Catalog?
Use the CREATE TABLE statement, specifying the table name, columns, and partitioning if needed. Managed tables are created in the managed storage location.
Can you access data in the Hive Metastore through Unity Catalog?
Yes, data in the Hive Metastore becomes a catalog called hive_metastore, and you can access tables using the hive_metastore.<schema>.<table> syntax.
How do you drop a table in Databricks Unity Catalog?
You can use the DROP TABLE statement followed by the fully qualified table name (e.g., DROP TABLE <catalog>.<schema>.<table>).

Unity Catalog is the solution to a problem was created as Databricks grew beyond its initial usage. In order to streamline the various product offerings within their ecosystem, Databricks introduced the Unity Catalog to eliminate third-party integrations, particularly in the realm of data governance. We feel this has been tremendously well executed and as Unity Catalog comes free and installed by default for all new databricks data intelligence platform users, we feel it’s highly advantageous to maximize its utility, particularly for data governance, lineage and data discovery.

Useful Links

How Abnormal Reduced Databricks Costs by 38% with Gradient

Jeffrey Chou
05.18.2024

Who is Abnormal?

Abnormal is a hypergrowth company in the email security space that helps companies worldwide prevent email attacks while automating security operations. They rely on Databricks extensively to help process terabytes of data across thousands of jobs daily, translating to an enormous amount of daily Databricks usage. From ETL jobs, streaming, SQL, to machine learning – it’s safe to say that Abnormal has a diverse set of applications run on Databricks.

Abnormal’s Problem with Databricks

Abnormal’s data platform team came to Sync with a common problem — they simply had too many jobs for a platform team to control in terms of cost, efficiency, and runtime SLAs, all while operating at scale. This problem wasn’t getting easier with their business growing each month – the number of jobs to manage was only increasing.

To put some numbers behind the scale of the problem, let’s say each Databricks cluster can have 10 options to configure. That times a thousand is 10,000 parameters each day that need to be monitored and optimized. Keep in mind, many things can change as well from data size to new code being pushed – so a configuration that worked yesterday can be wrong today.

And to make the problem harder, a bad configuration can lead to an out of memory error and result in a crashed production job. Making a mistake with a configuration is no joke, and that’s why many teams are hesitant to try and optimize themselves.

The Abnormal team was looking for a solution that could manage and optimize many of their production jobs automatically and at scale without any crashes. In some sense, they wanted a tool that would allow their small platform team to seem like an army capable of successfully managing thousands of jobs at once with high fidelity.

How Gradient Helped

Abnormal’s use case is one we’ve heard many times, and one that really strikes at the true value of what Gradient can do – automatically manage and optimize Databricks jobs at scale. With Gradient, a single person can potentially manage thousands of jobs, without breaking a sweat.

How Gradient Works

Gradient builds a custom model for each job it manages, and trains its model on historical data, keeping track of any statistical variations that may occur. With this information, Gradient can confidently apply changes to a user’s cluster automatically to steer it towards the desired performance, learning with each iteration.

Gradient’s model provides the ultimate infrastructure management solution with the following value points relevant for Abnormal:

Automatically maintains optimal performance – Gradient continuously monitors your jobs and configurations to ensure mathematically optimal performance despite any variations that may occur with no code changes.
Lowers costs – Gradient can tune clusters at scale to hit minimize costs
Hit SLAs – Gradient can tune clusters to ensure runtime SLAs are hit
Offloads maintenance work – Monitoring and tuning clusters can now be automated away, saving precious time for data and platform engineers
Beyond serverless – Similar to the value of Serverless clusters, Gradient removes all infrastructure decision making away from end users. The one major step forward Gradient performs is it will actively optimize the configurations to hit end user goals – white still leaving everything exposed for users to observe and control if they want.

Automatic Results in Production – “Set and Forget”

An hourly production job with a runtime of about 60 minutes was selected to import into the Gradient system. Integrating Gradient into Abnormals production Airflow environment took under an hour for the one-time installation setup. After the installation, adding new jobs is as simple as clicking a few buttons in the Gradient UI.

After importing the job into Gradient, Abnormal simply enabled “auto-apply” and then walked away. A couple hours later they logged back into Gradient and saw the results below – a 38% cost savings and a 2x speedup of their jobs in production without lifting a finger.

Analyzing the Results

The image above shows an annotated screenshot from Abnormal’s Gradient dashboard for the job. The top graph shows the cost as a function of job run iteration. Each vertical line with a Gradient logo illustrates when a cluster recommendation was applied in production.

The gray area represents the “training” phase of the process while the green area represents the “optimizing” phase. The bottom graph shows the runtime of the application across iterations.

While Abnormal normally runs their job on Spot instances, for this case we switched it to an on-demand cluster to help mitigate any cluster performance noise that could be attributed to the randomness of spot clusters. Basically, it helps to keep the test clean to ensure any performance enhancements we achieve weren’t caused by other random fluctuations. Otherise, the clusters were identical. After optimization, the cluster can be switched back to Spot, where the savings will transfer directly.

In the above cost graph, once the cluster switches to on-demand, the training and optimizing begins. The starting cluster consisted of a r5.xlarge driver node and 10 c5.4xlarge worker nodes. Gradient, under the hood, optimized both cluster and EBS settings.

What Gradient found in this particular example was that this job was under-provisioned and actually needed a larger cluster that resulted in both a 38% cost savings and a 2X speedup. This result can be counter-intuitive as most people may think cost and runtime savings can only occur when a cluster is over-provisioned and simply shrinking a cluster is the obvious way to cut costs.

Towards the end of the image above, you’ll notice that Gradient will manage the cluster and keep it at optimal performance, with no human intervention needed. Even if things change, Gradient will be able to adjust and accommodate.

Was the reduction due to decreasing data sizes?

One question Abnormal had immediately was, maybe this cost reduction was due to a decreasing data size – a completely valid thought. We of course want to ensure that the performance enhancements observed were due to our optimizations and not some external factor.

Fortunately, when we analyzed the input data size during these job runs, they were all consistent across job runs. Proving that this performance boost was due to Gradient’s actions.

How did Gradient know what to do?

As mentioned earlier, Gradient uses a proprietary mathematical model of Databricks. With a few training points, the system can fit model coefficients to quickly provide an accurate prediction of which way to tune a cluster. The best part is, with each new iteration data point, the model only gets more accurate over time.

In this particular scenario, the data was informing the model that it was indeed an under-provisioned cluster and that increasing the size was the right path.

To ensure job run safety, Gradient will gradually walk clusters towards optimal configurations – monitoring the effects of the change with each iteration. This is an important step to prevent catastrophic failures. Changing a cluster too drastically in a single shot is a high risk maneuver due to the unpredictability of Spark jobs.

So what did Abnormal think?

“Gradient is indispensable in our data-driven landscape, where the ever-expanding data volumes require constant monitoring and optimizations. Without Gradient, these oversights can lead to expensive errors, system failures, or excessive costs. With its automated optimization of distributed Spark jobs and scalable solutions, Gradient guarantees seamless pipeline operation, enabling us to focus on delivering products and features to our customers.” – Blaine Elliot – Platform Engineer @Abnormal

Try it yourself today

Gradient is available to try today for users on Databricks AWS and Azure. With each passing month, we’re releasing new features and new optimization paths. If you want help managing and optimizing your Databricks clusters, request a demo or see our site for more information.