Blog

Databricks 101: An Introductory Guide on Navigating and Optimizing this Data Powerhouse

Databricks intro guide

In an era where data reigns supreme, the tools and platforms that businesses utilize to harness and analyze their data can make or break their competitive edge. Among these, Databricks stands out as a powerhouse, yet navigating its complexities often feels like deciphering cryptic code. With businesses generating an average of 2.5 quintillion bytes of data daily, the need for a robust, efficient, and cost-effective data cloud has never been more critical. 

In this post, we demystify Databricks, with a focus on Job Clusters. Readers will gain insight into the platform’s workspace, the pivotal role of workflows, and the nuanced world of compute resources including All-Purpose Compute (APC) clusters vs Jobs Compute clusters. We’ll also shed light on how to avoid costly errors and optimize resource allocation for data workloads. 

Curious about how Databricks can transform your data strategy while keeping costs in check? Read on to learn how.

Introduction to Databricks: Components and configurations

Image source: https://www.serveradminz.com/blog/databricks-an-advanced-analytics-solution/

Databricks, at its core, is a comprehensive platform that integrates various facets of data science and engineering, offering a seamless experience for managing and analyzing vast datasets.

Central to its ecosystem is the Workspace, a collaborative environment where data teams can store files, collaborate on notebooks, and orchestrate data workflows, to name a few capabilities. Workspaces act as the nerve center along with the Unity Catalog, which provides a bridge between workspaces. Together they facilitate an organized approach to data projects and ensure that valuable insights are never siloed.

Workflows, available under each Workspace, represent recurring production workloads (aka jobs) which are vital for database production environments. These workflows automate routine data processing tasks, such as machine-learning pipelines, ETL workloads, and data streaming, ensuring reliability and efficiency in data operations. Understanding the significance of workflows is essential for anyone looking to streamline their data processes in Databricks.

Databricks Job Clusters

Workflows rely on Job Clusters for compute resources. Databricks offers various compute resource options to pick from during the cluster set up process. While the default cluster compute is set to Serverless, you can expose and tweak granular configuration options by picking one of the classic compute options.

Read on for more info about Databricks Job Cluster options.

Serverless Jobs vs. Classic Compute

Multiple options to pick from during the cluster set up process

Databricks recently announced the release of serverless compute across the platforms at the Data + AI 2024 conference. The goal is to provide users with a simplified cluster management experience and reduce compute costs. However, based on our research, Jobs Serverless isn’t globally cheaper than Classic Compute, and there is a price for the convenience that serverless offers.

We compared the cost and performance of serverless jobs vs jobs running on classic clusters and found that while short ad-hoc jobs are ideal candidates for serverless, optimized classic clusters outperformed their serverless counterparts by 60% in costs. 

More about the lessons we learned from experimenting with Databricks Serverless Jobs

As for the convenience aspect, it does come at a cost. Serverless compute unburdens you from setting up cluster configuration, but that also means you lose control over job performance and costs. There are no settings to tweak or prices to negotiate with cloud providers. So if you are like us, and want to be able to control the configuration of your data infrastructure this might not be the right option for you. 

On-demand clusters vs. Spot instances 

Compute resources form the backbone of Databricks cluster management. The basic classification of compute resources is between on-demand clusters vs Spot instances. Spot instances are considered cost effective, offering discounts of up to 90% on remnant cloud compute. However, they aren’t the most stable or reliable. This is because the number of workers running can change in the middle of a job, which is dangerous. The job runtime and cost could become highly variable, and sometimes even crash. 

Overall, on-demand instances are better suited for workloads that cannot be interrupted, workloads with unknown compute requirements, and immediate launch and operation needs. Spot instances, on the other hand, are ideal for workloads that can be interrupted and stateful applications with surge usage.

Spot vs On-demandOn-demand InstancesSpot Instances
AccessImmediate Only if there is unused compute capacity
PerformanceStable and reliable Limited stability as workers can be pulled during a job run
CostKnownVaries. Up to 90% cheaper than on-demand rates
Best forJobs that cannot be interrupted
Jobs with unknown compute requirements
Jobs that can be interrupted
Stateful apps with surge usages
Spot vs On-demand Clusters

All-Purpose Compute vs Jobs Compute

Databricks offers a couple forms of compute resources, including All-Purpose Compute (APC) clusters, Jobs Compute clusters, and SQL warehouses. Each resource is designed with specific use cases in mind. 

SQL warehouses, for example, allow users to query information in Databricks using the SQL programming language. APCs and Jobs Compute, on the other hand, are more general compute resources capable of running many different languages such as Python, SQL, and Scala to run your jobs. While APCs are ideal for interactive analysis and collaborative research, Jobs Compute is ideal for scheduled jobs and workflows. Understanding the distinctions and use cases for these compute resources is crucial for optimizing performance and managing costs.

Unfortunately, navigating the compute options for your clusters can be difficult. Common errors include the accidental deployment of more expensive compute resources, such as APC cluster when a Jobs Compute cluster would suffice. Such mistakes can have significant financial implications. In fact, we’ve found that APCs can cost up to 50% more than running the same job using Jobs Compute. 

The difference between APC clusters and Jobs Compute clusters represents a crucial decision point for Databricks users, as they spin up a cluster. And each option is tailored to different stages of the data analysis lifecycle. Knowing these nuances can help you avoid common errors to ensure that your Databricks environment is both effective and cost-efficient. For example, APC clusters can be used for more exploratory research work while Job Clusters are used for mature production scheduled workloads.

Photon workload acceleration

Databricks Photon is a high-performance vectorized query engine that accelerates workloads. 

Photon can substantially speed up job execution, particularly for SQL-based jobs and large-scale production Spark jobs. Unfortunately, Photon costs about 2x/DBU. 

Learn more about the pros and cons of Databricks Photon in this post.

Many factors influence whether Photon is the right choice for you, which is why we recommend you A/B test the same job with and without Photon enabled. From our experience, 9 out of 10 organizations opt out of Photon once the results are in. 

Spark autoscaling

Databricks offers an autoscaling feature to dynamically adjust the number of worker nodes in a cluster based on the workload demands. 

By dynamically adjusting resources, autoscaling can reduce costs for some jobs. However, due to the spin up time cost of new workers, sometimes autoscaling can lead to increased costs. In fact, we found that simply applying autoscale across the board can lead to a 30% increase in compute costs. 

Read the full analysis of Databricks autoscaling here.

Notebooks for ad-hoc research in Databricks

Databricks revolutionizes the data science and data engineering landscape with its notebook concept, which combines code with annotations. It embodies a dynamic canvas where you can breathe life into your projects. 

Notebooks in Databricks excel in facilitating chunk-based code execution, a feature that significantly enhances debugging efforts and iterative development. By enabling users to execute code in segments, we eliminate the need for running entire scripts for minor adjustments. This accelerates the development cycle and promotes a more efficient debugging process.

Notebooks can run on APC clusters and Jobs Compute clusters depending on the task at hand. This choice is paramount when project requirements dictate the need for either exploratory analysis when APC is ideal, or scheduled jobs when Jobs Compute will be the most cost effective. 

It’s important to note that All-Purpose Compute clusters are shared environments with a collaborative aspect. Multiple data scientists can simultaneously work on the same project with collaboration facilitated by shared compute resources. As a result, team members can contribute to and iterate on analyses in real-time. Such a shared environment streamlines research efforts, making APC clusters invaluable for projects requiring collective input and data exploration.

The journey from ad-hoc research in notebooks to the production-ready status of workflows marks a significant transition. While notebooks serve as a sandbox for exploration and development, workflows embody the structured, repeatable processes essential for production environments. 

This contrast underscores the evolutionary path from exploratory analysis, where ideas and hypotheses are tested and refined, to the deployment of automated, scheduled jobs that operationalize the insights gained. It is this transition from the “playground” of notebooks to the “factory floor” of workflows that encapsulates the full spectrum of data processing within Databricks, bridging the gap between initial discovery and operational efficiency.

From development to production: Workflows and jobs

The journey from ad-hoc research in notebooks to the deployment of production-grade workflows in Databricks encapsulates a sophisticated transition, pivotal for turning exploratory analyses into automated, recurring processes that drive business operations. At the core of this transition are workflows, which serve as the backbone for scheduling and automating jobs in Databricks. 

Workflows stand out by their ability to convert manual, repetitive tasks into automated sequences that run based on predefined triggers. These triggers vary widely, from the arrival of new files in a data lake to continuous data streams that require real-time processing. By leveraging workflows, data teams can ensure that their data pipelines are responsive, timely, and efficient, enabling seamless transitions from data ingestion to insight generation.

Directed Acyclic Graphs (DAGs) are central to workflows and further enhance the platform by providing a graphical representation of workflows. DAGs in Databricks allow users to visualize the sequence and dependencies of jobs, offering a clear overview of how data moves through various processing stages. This is an essential feature for optimizing and troubleshooting workflows, enabling data teams to identify bottlenecks or inefficiencies within their pipelines.

Transitioning data workloads from the exploratory realm of notebooks to the structured environment of production-grade workflows enables organizations to harness the full potential of their data, transforming raw information into actionable insights.

Orchestration and Competition: Databricks Workflows vs. Airflow and Azure Data Factory

By offering Workflows as a native orchestration tool, Databricks aims to simplify the data pipeline process, making it more accessible and manageable for its users. With that said, it also opens up a complex landscape of competition, especially when viewed against established orchestration tools like Apache Airflow and Azure Data Factory.

Comparing Databricks Workflows with Apache Airflow and Azure Data Factory reveals a compelling competitive landscape. Apache Airflow, with its open-source lineage, offers extensive customization and flexibility, drawing users who prefer a hands-on, code-centric approach to orchestration. Azure Data Factory, on the other hand, integrates deeply with other Azure services, providing a seamless experience for users already entrenched in the Microsoft ecosystem. 

Databricks Workflows promise simplicity and integration, especially appealing to those who already leverage the platform. The key differentiation lies in Databricks’ ability to offer a cohesive experience, from data ingestion to analytics, within a single platform. 

The strategic implications for Databricks are multifaceted. On one hand, introducing Workflows as an all-in-one solution locks users into its ecosystem. On the other hand, it challenges Databricks to continually add value beyond what users can achieve with other tools. The delicate balance between locking users in and providing unmatched value is critical; users will tolerate a certain degree of lock-in if they perceive they are receiving significant value in return.

As Databricks continues to expand its suite of tools with Workflows and beyond, the landscape of data engineering and analytics tools is set for a dynamic evolution. The balance between ecosystem lock-in and value provision will remain a critical factor in determining user adoption and loyalty. The competition between integrated solutions and specialized tools will shape the future of data orchestration, with opportunities for both consolidation and innovation.

Conclusion

In today’s data-driven world, mastering tools like Databricks is crucial. This guide aims to help simplify the management of Databricks Job Clusters. Our goal is to help you navigate the complexities and optimize resource allocation effectively. 

Understanding the distinction between compute resources, such as APC clusters for interactive analysis and Jobs Compute clusters for scheduled jobs, is essential. Understanding when to apply Photon and when not to is also critical for cluster optimization. Simply choosing the right compute options for your jobs can reduce your Databricks bill by 30% or more. 

The journey from exploration and research in notebooks to production-ready automated and scheduled workflows is a significant evolution. Workflows simplify data pipeline orchestration, and compete with other data orchestration tools like Apache Airflow and Azure Data Factory. 

While Airflow offers extensive customization and Azure Data Factory integrates seamlessly with Microsoft services, Databricks Workflows provide a cohesive experience from data ingestion to analytics. The challenge for Databricks is to balance ecosystem lock-in with delivering unmatched value, ensuring it continues to enhance its tools and maintain user loyalty in an evolving data landscape.

If you are interested in learning more about Databricks cluster management and cost optimization follow us on Medium, or subscribe to your newsletter using the form in our footer.

If you’ve already ramped up your activity on Databricks and are looking for a way to cut costs, we can help you reduce spend by 50%. Use this free Databricks workspace health check for an estimation of your potential savings, and actionable insights into your cluster configurations (most expensive jobs, candidates for Photon and autoscaling, APC mistakes, and more). 

Top 9 Lessons Learned about Databricks Jobs Serverless

To much fanfare, Databricks announced their wide release of serverless compute across all of their platforms at the Data + AI 2024 conference. It’s quite clear that Databricks’ vision is to own the compute layer to make life easier for end users, so they don’t have to worry about annoying cluster or versions details.

We at Sync are all about cloud compute efficiency. So of course we had to take a closer look at Databricks Serverless Compute to provide an honest perspective of serverless computing compared classic computing, and evaluate the pros and cons.

Full disclosure – Sync is a Databricks partner. With that said, everything we state here is merely our opinion, which is  based on the experimental measurements covered below. Our goal is to provide an unbiased guide, so that users can decide for themselves whether or not  serverless is a good choice for them.

This post  focuses on the jobs serverless feature, which is currently under public preview. For more information about the SQL warehouse serverless product, check out this post

What is serverless? – A travel analogy

At a high level serverless means that end users don’t have to provision cloud infrastructure anymore, Databricks will do it all for you – so users can just focus on their code.  For example, selecting which instances to use is now under the control of Databricks, not the end user.  

While this sounds like a no-brainer, if you zoom in a bit closer there are some pros and cons.  Let’s use an analogy here.  Let’s say you want to travel from San Francisco to London. 

  • The “classic” way of doing this is all of the planning is up to you, whether you travel by car, train, boat, bus, walking, running, or even swimming it’s all up to you to figure out.  And then there’s lodging, scheduling, and budget to think about.  This is a lot of work.  However, you can custom tailor to your exact specifications, including timing and budget.
  • The “serverless” way of doing this is you close your eyes in San Francisco and you wake up in London. That sounds like the dream situation.  However, you are soon handed a bill that costs $50,000 and you arrive a day after your big meeting.  How did that work out for you?  

If you’re a high volume traveler for business, you may prefer the high level of control the “classic” way brings, since you need that level of granularity.  If you’re a wealthy aimless traveler just enjoying the world, serverless is probably the dream.  It all depends, both methods are potentially great.

The lessons learned

As you review the results, we’d like you to bear in mind that jobs serverless will most likely improve with time, and that our current results are merely a snapshot in time. Will these numbers hold up in 1 year? Maybe, maybe not – we have no idea. We can only hope that this feedback helps shape the offering with some honest feedback from the field.  

Now for  the good stuff: here are the top 9 lessons learned from evaluating Databricks Jobs Serverless.

  1. Serverless compute is not cost optimized
  2. Ideal for short or ad-hoc jobs
  3. Eliminating spin up time is the biggest value add
  4. Serverless has zero knobs, which makes life easy but at the price of control
  5. You have no control over the runtime of your jobs
  6. Migrating to serverless is not easy
  7. Costs are completely determined by Databricks
  8. What happens if there’s an error?
  9. You can’t leverage your cloud contracts

Read on for our in-depth analysis of Databricks server vs. serverless computing.

1. Serverless compute is not cost optimized

By far, the biggest hope users have for serverless is that “it will optimize my compute for me.” While this is true in some regard, the core issue users need to understand is “Does serverless provide the lowest cost option?” We can answer very clearly and unambiguously that, unfortunately, it does not. Serverless is not the cheapest option around.

While serverless is a pretty good option, it’s certainly not the most optimal when it comes to cost for all use cases.  As evidence, we ran a test job and found that an optimized cluster by Gradient (our flagship product), outperformed Databricks serverless jobs by roughly 60% from a cost perspective!  

Our test job  runs basic queries on a randomly generated dataset. The runtime on a classic cluster is about 1 hour, which is pretty typical in many jobs we’ve encountered at companies. Serverless was able to run the job much faster, taking only 30 minutes which was great to see. Unfortunately, the runtime savings didn’t translate to the cost savings, as you can see in the chart below.

In the test we are utilizing on-demand instances with list pricing.  Users will likely save even more if using Spot instances.  However, you have no option to use spot instances with serverless.  You have no access at all to what’s going on.

This test result  might not translate to your internal jobs. You may have a job that demonstrates that serverless massively outperforms an optimized cluster in terms of costs – it all depends on your workload.  With that said, this data point does prove that serverless is not GLOBALLY optimal. Serverless does not guarantee cost savings.

Some skeptics might say that the cost savings of serverless appears in the form of engineering hours saved. With serverless, engineers don’t have to spend time thinking about clusters – which can translate to real time and money saved. We completely agree with this point of view, that is very substantial. 

Our one counter argument is that getting started on any cluster is pretty easy today, so most people don’t spend time tuning their clusters if they don’t want to.  Engineers typically resort to cluster tuning to help lower cost or improve performance. So if the cost and performance of Serverless is not ideal, you’re just out of luck – serverless may not be solving the root issue that tuning is attempting to solve.

At the end of the day everything depends on the particularities of your workload and use case.

2. Ideal for short or ad-hoc jobs

A great use case for serverless, which we fully endorse, is using serverless for short (<5 min) jobs. The elimination of spin up time for your cluster is a massive win that we really love.

Here’s an experiment we ran using a trivial job that doesn’t really do anything.  The job doesn’t even run Spark and is run on a single node cluster. The cost was slightly lower with serverless, but the big win was in the runtime where we saw roughly 80% reduction in runtime!  This improvement is mostly due to the complete elimination of cluster spin up time which can take 5-10 minutes.  

What’s interesting is even though serverless does not have spin up time, the cost premium for serverless still equated to roughly the same overall cost – which was a bit disappointing.

With that said, the big win is that users may not always know that they are running a short job and could be massively over provisioning their clusters on accident.  

Serverless helps to avoid that mistake and that can result in substantial big cost savings – simply preventing human error.

3. Eliminating spin up time is the biggest value add 

We couldn’t love this aspect enough. Cluster spin up time is such a pain to deal with when you’re just trying to run something in real-time. So many times users have to wait for a cluster to spin up and get sidetracked by another task so that they don’t come back to the cluster until an hour later.

Our one nit pick here is that for scheduled jobs in a pipeline, spin up time is less of a concern. These are jobs that can be running at all hours of the day at scales of 1000s of jobs running per day. At that level, spin up time is really a cost factor – and then the real question is if serverless provides the lowest cost.

However, if you’re doing a quick ad-hoc experiment, or just want to get a quick result – we highly recommend serverless as you’ll get your result much faster.

4. Serverless has zero knobs, which makes life easy but at the price of control

Something quite unique about Databrick jobs serverless is that there are zero knobs. Not even “T-shirt” sizing, like what we have today in SQL serverless platforms. This means that you can’t even select “small, medium, large, x-large” clusters – you don’t get to select anything.

For a company that is just trying to get jobs up and running asap, we think this can be pretty great. It can save engineers some time when it comes to provisioning infrastructure.  

The big tradeoff is that you can’t change anything. If you care about cost and runtime and want the ability to tune performance, then this may not be a convenient feature.

5. You have no control over the runtime of your jobs

The big downside of jobs serverless is that there’s no way to tune the cluster to adjust cost or runtime. You basically have to live with whatever Databricks decides. This means that if you want faster runtime, you can’t just throw a bigger cluster at it and call it a day. You can’t do anything really, except change your code. You’re stuck.

We can only assume that eventually Databricks will throw in some high-level “performance” knob, as we think this is a pretty big limitation, but who knows.

6. Migrating to serverless is not easy

Serverless utilizes shared compute resources in the background, and as a result enforces a large number of general restrictions. We’ve heard rumors that it’s a giant Spark on Kubernetes clusters, but don’t quote us on that.

A couple impactful restrictions of serverless are:

  • You must have Unity Catalog enabled
  • Scala and R are not supported
  • Only ANSI SQL is supported when writing SQL
  • Spark RDD APIs are not supported
  • Caching API and SQL commands are not supported
  • Global temporary views are not supported
  • You cannot access DBFS

This goes on and on, exceeding over 100 limitations. We, in fact, had a hard time getting ANY job to run on serverless. We ran into issues even with simple test jobs. It wasn’t until we manually changed the code and moved data around that we finally got it to work.

Our opinion is unless this is improved dramatically, it will be a giant lift and shift amount of work for enterprises who built their jobs on classic compute. Serverless eliminates the general flexibility we had on classic clusters. This will likely slow the adoption of serverless for larger companies. Probably new workloads will get onboarded to serverless first, before any big migration effort takes place.

7. Costs are completely determined by Databricks

One thing we found troubling was the pricing. On the Databricks website they say that the cost is $0.35/DBU. But, where is the DBU/hr metric? Normally, one would take the runtime of the job and calculate a $/hr rate. Then, a user could tune the cluster size and tune the rate of cost. But with zero knobs, we have no control over the rate of cost.  

It appears that we only get the number of DBUs a job costs AFTER it completes via the system tables. We find it odd that this number is calculated completely in the dark. Today my job costs 10 DBUs, next month it may cost 12 DBUs. Why? No clue. The amount of power Databricks has here is a bit overreaching in our opinion.  

In fact, because there is no DBU/hr metric, the “list price” they have here of $0.35/DBU is incomplete.  Since the amount of DBUs my jobs cost is solely under Databricks’ control, the end price is pretty much arbitrary.  For example, the calculation of job cost is simply: $0.35/DBU * X DBU, where X is the amount of DBUs Databricks determines and reports in the system tables.

This is quite the advantage for Databricks, and in our opinion, it will result in large revenue and profit growth. Any compute optimization they do on the backend helps widen their margins, while the cost to the user stays mysterious. I don’t blame Databricks for doing this, and this is also not a new concept. In fact, many companies prefer to run serverless for this revenue generating reason. Users get “convenience” and Databricks makes more money, it’s a fair exchange.

What’s funny is that this is approaching the Snowflake’s serverless model. Not only is Snowflake the company’s mortal enemy, many people complain that Snowflake is too expensive for this very reason. It will be interesting to see how the market reacts in the long run and if sticker shock to their serverless bills will cause some CFOs to take action. 

8. What happens if there’s an error?

In Spark-land, everyone knows of the dreaded out of memory (OOM) error. What happens if the cluster under the serverless hood hits this error for your job? Typically we would try to fix this with more memory, but we don’t have that option with serverless. 

Are users now dependent on Databricks to fix this? That could be dangerous. In cluster-land, a million things can go wrong, and now that it’s all managed by Databricks you’re pretty much at the mercy of their support team if anything critical goes down.

9. You can’t leverage your cloud contracts

If you have established special discounts with AWS or Azure, or custom plans for certain instances – listen up. If you use serverless jobs those discounts can become irrelevant in regards to your Databricks usage. This is because on serverless the compute runs inside the Databricks environment.

This may or may not apply to your company, it really depends on the nature of your contracts and the volume of your Databricks usage. We thought we’d point it out as it is a major difference between classic and serverless compute.

Conclusion

Distributed computing is a very complex topic, and rarely is anything a guaranteed slam dunk. In this post, we covered the pros and cons of Databricks serverless jobs as it is today.. Some will find this capability beneficial and that’s fantastic! Based on our initial analysis, we found serverless jobs to be ideal for short and ad-hoc jobs, as the largest value add is the elimination of spin up times.

As most companies, Databricks’ marketing overshoots the benefits of their features from time to time. We’ve already reported on this in regards to Photon and Autoscaling, and we hope we have now sprinkled some truth in regards to Serverless as well.  By the way, Databricks states that Photon and autoscaling are automatic for jobs serverless, which in our analysis, often leads to unnecessary cost increases. 

As you can imagine, we get a lot of questions about serverless. Here are the top two questions we are asked about Databricks serverless jobs:

Databricks says serverless is cost efficient, so what’s the deal?

Yes, Databricks has presented materials touting that serverless is cost efficient relative to classic compute. This presentation from DAIS 2024 is a great example. The real question is – what do you compare serverless to? If you’re comparing serverless to just default settings, then serverless may likely do very well. But a workload optimized cluster can likely outperform serverless. Finding an optimized cluster, though, is by no means an easy feat. How do we determine how “optimal” it is?  

Skeptics may say that we are biased as well, that examples above are based on jobs we knew could outperform serverless. This is a totally fair position to have. For the record – we did not cherry pick a workload in our tests, we just quickly found the first job we could that was even compatible with serverless. 

At the end of the day, benchmarks presented by external parties can be totally irrelevant to your use case. Fancy benchmarks like TPC-DS, or even the one we shared in this post do not look like your jobs. There’s only one thing that really matters: YOUR WORKLOADS.  

Finally, we say, don’t take our word for it, nor Databricks’. Compare serverless head to head with an optimized classic cluster and see for yourself. If you need help, Gradient is here to lead you to help that coveted optimized cluster. Or, if you have the knowledge and skills, you can manually optimize your cluster and compare it to serverless. u. 

How does this impact Sync?

Does serverless impact the Sync roadmap? We have short term and a long term answers to this question:

Short term: Serverless and classic will co-exist for quite some time. We see serverless as just another option (like Photon, or autoscaling). Sync’s algorithms can test whether serverless or an optimized classic cluster is best for your needs, and share recommendations or auto-apply those changes for you. At the end of the day, all we care about is selecting the best options and configurations based on your goals. If serverless is that, then we’ll be happy to point your jobs in that direction.

Long term: We are a cloud compute management company, and Databricks is just the first stop in our evolution. Our plan is to expand to all facets of cloud computing, from Spark, to bare CPUs, GPUs, Kubernetes etc., it’s all up for grabs. Databricks has been a great partner and an important first step, but longer term, it will be one of dozens of platforms we support. Our guess is optimizing classic compute will play a role for Databricks users for many years to come, and we’re happy to help them with that

Like we said earlier, don’t take our word for it.  Try serverless out for yourself, do your own homework.  Conduct an A/B test and see if serverless is actually cost effective.  If you need help automatically finding the optimized classic cluster, feel free to check us out.

Gradient Product Update— Discover, Monitor, and Automate Databricks Clusters

With the Databricks Data+AI Summit 2024 just around the corner, we of course had to have a major product launch to go with it!

We’re super excited to announce an entirely new user flow and features to the product, making it faster to get started and providing a more comprehensive management solution.  At a high level, the new expansion involves these new features:

  1. Discover – Find new jobs to optimize
  2. Monitor – Track all job costs and metrics over time
  3. Automate – Auto-apply to save time, costs, and hit SLAs

With ballooning Databricks costs and constrained budgets, Databricks efficiency is crucial for sustainable growth for any company.  However, optimizing Databricks clusters is a difficult and time consuming task riddled with low level complexity and red tape.  

Our goal with Gradient is to make it as easy and painless as possible to identify jobs to optimize, track overall ROI, and automate the optimization process.  

The last automation piece is what sets Gradient apart.  Gradient is designed to optimize at scale, for companies that have 100+ production jobs.  At that scale, automation is a must, and is where we shine.  Gradient provides a new level of efficiency unobtainable with any other tool on this planet.  

With automatic cluster management, engineering teams are free to pursue more important business goals while Gradient works around the clock.

Let’s drill in a bit deeper into what these new features are:

Discover 

Find your top jobs to optimize as well as discover new opportunities to improve your efficiency even more.  This page is refreshed daily so you always get up-to date insights and historical tracking.

How to get started – Simply enter your Databricks credentials and click go!  You can get running from scratch in less than a minute

What is shown:  

  • Top jobs to optimize with Gradient
  • Jobs with Photon enabled
  • Jobs with Autoscaling enabled
  • All purpose compute jobs
  • Jobs with no-job id (meaning they could come from an external orchestrator like Airflow)

To see how fast and easy it is to get the Discover page up and running, check out the video below:

Monitor 

Track Spark metrics and costs of all of your jobs managed with Gradient in a single pane of glass view.  Use this view to get a bird’s eye view on all of your jobs and track the overall ROI of Gradient with the “Total Savings” view.  

How to get started – Onboard your Databricks workspace in the integration page. This may require involving your devops teams as various cloud permissions are required.

What is shown: 

  • Total core hours
  • Total Spend
  • Total recommendations applied
  • Total cost savings
  • Total estimated developer time saved
  • Total number of projects
  • Number of SLAs met

Automate

Enable auto-apply to automatically optimize your Databricks jobs clusters to hit your cost and runtime goals.  Save time and money with automation.  

How to get started –  Onboard your Databricks workspace in the integration page (no need to repeat if already done above)

What is shown: 

  • Job costs over time
  • Job runtime over time
  • Job configuration parameters
  • Cluster configurations
  • Spark metrics
  • Input data size

Conclusion

Get started in a minute yourself with the Discover page and start finding new opportunities to optimize your Databricks environment.  Login yourself to get started!

Or if you’d prefer a hands on demo, we’d be happy to chat.  Schedule a demo here

What is Databricks Unity Catalog (and Should I Be Using It)?

Since launching in 2013, Databricks has continuously evolved its product offerings from machine learning pipeline to end-to-end data warehousing and data intelligence platform

While we at Sync are big fans of all things Databricks (particularly how to optimize cost and speed) we often get questions about understanding Databricks new offerings—particularly as product development has accelerated in the last 2 years. 

To help in your understanding, we wrote this blog post to address the question, “What is Databricks Unity Catalog?” and whether users should be using it (the answer is yes). We walk through a precise technical answer, and then dive into the details of the catalog itself, how to enable it and frequently asked questions.

What Is the Unity Catalog in Databricks?

The Databrick’s Unity Catalog is a centralized data governance layer that allows for granular security control and managing data and metadata assets in a unified system within Databricks. Additionally, the unity catalog provides tools for access control, audits, logs and lineage. 

You can think of the unity catalog as an update designed to bridge gaps in the Databrick ecosystem—specifically to eliminate and improve upon third-party catalogs and governance tools. With many cloud-specific tools being used, Databricks brought in a unified solution for data discovery and governance that would seamlessly integrate with their Lakehouse architecture. Thus, while Unity Catalog was initially billed as a governance tool, in reality it streamlines processes across the board. While simplistic, it’s not wrong to say Unity Catalog simply makes everything Databricks run smoother.  

Notably, the Unity Catalog is being offered by default on the Databricks Data Intelligence Platform. This is because Databricks believes the Unity Catalog is a huge benefit to their users (and we are inclined to agree!). If you have access to the Unity Catalog, we highly recommend enabling it in your workspace. 

What benefits does the Databricks Unity Catalog have to offer?

The Unity Catalog benefits can be thought of in four buckets: data governance, data discovery, data lineage, and data sharing and access.  

Data Discovery

The unity catalog provides a structured way to tag, document and manage data assets and metadata. This allows for a comprehensive search interface that utilizes lineage metadata (including full lineage) history and ensures security based on user permissions.

Users can either explore data objects through the Catalog Explorer, or parse through data using SQL or Python to query datasets and create dashboard from available data objects. In Catalog explorer, users can preview sample data, read comments and check field details (50 second preview from Databricks here).

A preview of the Catalog explorer for data discovery in Unity Catalog (via Databricks/Youtube)

Data Governance

Unity Catalog is a layer over all external compute platforms and acts as a central repository for all structured and unstructured data assets (such as files, dashboards, tables, views, volumes, etc). This unified architecture allows for a governance model that includes controls, lineage, discovery, monitoring, auditing, and sharing.

Unity Catalog thus offers a single place to administer data access policies that apply across all workspaces. This allows you to simplify access management with a unified interface to define access policies on data and AI assets and consistently apply and audit these policies on any cloud or data platform.

All of Databricks governance parameters can be accessed via their Unity Catalog Governance Portal. The Databricks Data Intelligence Platform leverages AI to best understand the context of tables and columns, the volume of which can be impossible for manual categorization. This also enables you to quickly assess how many of your tables are monitored via Lakehouse Monitoring — Databricks’s new “AI for Governance tool”. 

 A screenshot of the Unity Catalog Governance portal shows how their Lakehouse Monitoring uses AI to automatically monitor tables and alert users to uses like PII leakage or data drift (via Databricks/Youtube)

With Lakehouse monitoring you can also set up alerts that automatically detect and correct PII leakage, data quality, data drift and more. These auto alerts are contained within their own section of the Governance Portal, which shows when the issue was first detected, and where the issue first stemmed from.

A preview of the governance action items shows how issues are identified by cause and Catalog/Schema/Table. Digging further in will reveal the time and date of first incidence as well as it where it stems from.

It incorporates a data governance framework and maintains an extensive audit log of actions performed on data stored within a Databricks account.

Data Lineage

As the importance of Data Lineage has grown, Databricks has responded with end-to-end lineages for all workloads. Lineage data includes notebooks, workflows and dashboards and is captured down to the column level. Unity Catalog users can parse and extract lineage metadata from queries and external tools using SQL or any other language enabled in their workspace, such as Python. Lineage can be visualized in the Catalog Explorer in near-real-time and 

Unity Catalog’s lineage feature provides a comprehensive view of both upstream and downstream dependencies, including the data type of each field. Users can easily follow the data flow through different stages, gaining insights into the relationships between field and tables.

An example of the metadata lineage within Unity Catalog

An example of the metadata lineage within Unity Catalog

Like their governance model, Databricks restricts access to data lineage based on the logged-in users’ privileges.

Data Sharing and Access

One of the most welcomed features of Databricks Unity Catalog is its built-in sharing method which is built on Delta Sharing, Databricks’ popular cloud-platform-agnostic open protocol for sharing data and managing permissions launched in 2021.

Within Unity Catalog you can access control mechanisms use identity federation, allowing Databricks users to be service principals, individual users, or groups. In addition, SQL-based syntax or the Databricks UI can be used to manage and control access, based on tables, rows, and columns, with the attribute level controls coming soon.

How Does Databricks Unity Catalog Enhance Data Governance and Security 

Databricks has a standards-complaint security model based on ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views. 

Unity Catalog grants user-level permissions for Governance Portal, Catalog Explorer and for data lineages and sharing. Unity Catalog in effect has one model for safeguarding appropriating access across your full data estate with permissions, row level, and column level security. 
It almost allows registering and governing access to external data sources, such as cloud object storage, databases, and data lakes, through external locations and Lakehouse Federation.

Does Unity Catalog Help With Databricks Cost?

Yes, because Unity Catalog reduces both storage costs and fees for external licensing, it reduces cost compared to previous solutions. It also indirectly saves time by greatly reducing bottlenecks for ingesting data, reducing time spent on repetitive tasks by an average of 80% (according to Databricks). This all comes free and automatically enabled for all new users of the Databricks Data Intelligence Platform. 

How do I set up and configure Unity Catalog in Databricks?

The following is a step-by-step guide to setting up and configuring Databricks Unity Catalog. 

  1. Confirm Your Workspace Is Enabled For Unity Catalog.
    Log into your account and click Workspaces. From there check the Metastore Column. If a metastore name is preset, it means your workspace is attached to a Unity Catalog. 

If your workspace doesn’t return a metastore, you’ll want to either to enable and attach your workspace, or create a Unity Catalog metastore.

  1. Add users and assign the workspace admin role.
    The user who creates a workspace is automatically added as an admin role. That admin can then add and invite users, and can assign workplace admin roles and metastore admin roles.
  2. Create Clusters or SQL Warehouses for users to run queries and create objects. To run Unity Catalog workloads, compute resources must comply with certain security requirements. As a workspace admin, you can opt to make compute creation restricted to admins or let users create their own SQL warehouses and clusters
  3. Grant Privileges to Users. To create objects and access them in Unity Catalog catalogs and schemas, a user must have permission to do so.  See how to grant privileges and manage admin privileges.
     
  4. Create New Catalogs and Schemas. To start using Unity Catalog, you must have at least one catalog defined. Catalogs are the primary unit of data isolation and organization in Unity Catalog. All schemas and tables live in catalogs, as do volumes, views, and models. You’ll want to create managed storage for the new catalog, then bind the new catalog your workspace, and then grant privileges for that catalog. Full instructions here

What Integrations Work With Data Unity Catalog? 

Unity Catalog works existing data storage systems and governance solutions such as Atlan, Fivetran, dbt or Azure data factory. It also integrates with business intelligence solutions such as Tableau, PowerBi and Qlik. This makes it simple to leverage your existing infrastructure for updated governance model, without incurring expensive migration costs (for a full list of integrations check out Databricks page here).

What if my workspace wasn’t enabled for Unity Catalog automatically?

If your workspace was not enabled for Unity Catalog automatically, an account admin or metastore admin must manually attach the workspace to a Unity Catalog metastore in the same region. If no Unity Catalog metastore exists in the region, an account admin must create one. For instructions, see Create a Unity Catalog metastore.

Unity Catalog Limitations

The following limitations apply for all object names in Unity Catalog:

  • Object names cannot exceed 255 characters.
  • The following special characters are not allowed:
    • Period (.)
    • Space ( )
    • Forward slash (/)
    • All ASCII control characters (00-1F hex)
    • The DELETE character (7F hex)
  • Unity Catalog stores all object names as lowercase.
  • When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (-).

For a full list of Unity Catalog Limitations, read the full documentation for the Unity Catalog.

Unity Catalog FAQs

  • How does Databricks Unity Catalog differ from Hive Metastore?
    Databricks Unity Catalog offers a centralized data governance model, supports external data access, data isolation, and advanced features like column-level security, while Hive Metastore has limited governance capabilities.
  • How Long is Lineage Data Stored in Databricks Unity Catalog?
    Lineage data on Databricks Unity Catalog is retained for 1 year.
  • What are the supported compute and cluster access modes for Databricks Unity Catalog?
    Supported access modes are Shared Access Mode and Single User Access Mode. No-Isolation Shared Mode is not supported.
  • What data file formats are supported for managed and external tables in Databricks Unity Catalog?
    Managed tables must use the Delta table format, while external tables can use Delta, CSV, JSON, Avro, Parquet, ORC, and Text formats.
  • How do you enable your workspace for Databricks Unity Catalog?
    You can enable Unity Catalog during workspace creation or assign an existing metastore to your workspace through the Databricks account console.
  • How do you control access to data and objects in Databricks Unity Catalog?
    You can use admin privileges, object ownership, privilege inheritance, basic object privileges (GRANT/REVOKE), dynamic views for row/column security, and manage external locations and credentials.
  • What is the Databricks Unity Catalog object model?
    The object model follows a hierarchical structure: Metastore ► Catalog ► Schema ► Tables, Views, Volumes, and Models.
  • Can you transfer ownership of objects in Unity Catalog?
    Yes, you can transfer ownership of catalogs, schemas, tables, and views to other users or groups using SQL commands or the Catalog Explorer UI.
  • How do you create a new catalog in Unity Catalog?
    You can use the CREATE CATALOG SQL command, specifying a name and managed location if needed. You must have CREATE CATALOG privileges on the metastore.
  • How do you grant permissions on a catalog or schema?
    Use the GRANT statement with the desired privileges (e.g., CREATE SCHEMA, CREATE TABLE) and the catalog or schema name, followed by the user or group to grant access to.
  • What is the syntax for referring to a table in Unity Catalog?
    Use the three-part naming convention: <catalog>.<schema>.<table>
  • How do you create a managed table in Unity Catalog?
    Use the CREATE TABLE statement, specifying the table name, columns, and partitioning if needed. Managed tables are created in the managed storage location.
  • Can you access data in the Hive Metastore through Unity Catalog?
    Yes, data in the Hive Metastore becomes a catalog called hive_metastore, and you can access tables using the hive_metastore.<schema>.<table> syntax.
  • How do you drop a table in Databricks Unity Catalog?
    You can use the DROP TABLE statement followed by the fully qualified table name (e.g., DROP TABLE <catalog>.<schema>.<table>).

Unity Catalog is the solution to a problem was created as Databricks grew beyond its initial usage. In order to streamline the various product offerings within their ecosystem, Databricks introduced the Unity Catalog to eliminate third-party integrations, particularly in the realm of data governance. We feel this has been tremendously well executed and as Unity Catalog comes free and installed by default for all new databricks data intelligence platform users, we feel it’s highly advantageous to maximize its utility, particularly for data governance, lineage and data discovery. 

Useful Links 

Why are your Databricks jobs performances changing over time?

For those running and tracking their production Databricks jobs, many may often see “random” fluctuations in runtime or slowly changing performance over days and weeks.

Immediately, people may often wonder:

  • “Why is my runtime increasing since last week?”
  • “Is the cost of this job also increasing?”
  • “Is the input data size changing?”
  • “Is my job spilling to disk more than before?”
  • “Is my job in danger of crashing?”

To help give engineers and managers more visibility into how their production jobs are performing over time, we just launched a new visualization feature in Gradient that will hopefully help provide quick answers to engineers.

A snapshot of the new visualizations is shown below, where on a single page you can see and correlate all the various metrics that may be impacting your cost and runtime performance.  In the visualization below, we track the following parameters:

  • Job cost (DBU + Cloud fees)
  • Job Runtime
  • Number of core*hours
  • Number of workers
  • Input data size
  • Spill to disk
  • Shuffle read/write

Why does job performance even change?

The main three reasons why we see job performance change over time are:

  1. Code changes – Obviously with any significant code changes, your entire job could behave differently.  Tracking and understanding how your new code changes impact the business output however is less clear.   With these new visualizations, engineers can quickly see the “before and after” impact of any code changes they implement

  1. Data size changes – When your data increases or decreases, this can impact your job runtime (and hence costs).  While this makes sense, tracking and seeing how it changes over time is much more subtle.  It may be a slowly varying amount, or it could be very spiky data events with sudden changes.  Understanding how your data size impacts your runtime is a critical “first check”

  1. Spot instances revoking – When spot nodes are randomly pulled during your job execution, it can cause significant impact on the runtime of your job.  Since Spark has to essentially “recover” from a worker being pulled, the impact on runtime can be significant.  We’ve seen runtimes go up 2-3X simply from 1 worker being pulled.  It all depends on at what point the Spot instance is being pulled and the impact.  Since this is basically random, your overall spot runtimes can have wildly varying runtimes. 


As a side note, we’ve observed that an optimized on-demand cluster can often beat Spot pricing because of this very reason.  Over the long haul, a stable and optimized on demand cluster is better than a wildly varying Spot cluster.

How does this feature differ from Overwatch?

For those in the know, Overwatch is a great open source tool built by Databricks to help plot all sorts of metrics to help teams monitor their jobs.  The main differences and advantages of the visualizations we show are:

1)  Total cost metrics – Gradient pulls both the DBU and estimated Cloud costs for your clusters and shows you the total cost.  Cost data from Databricks only includes DBUs.  While it is possible to wrap in cloud costs with overwatch, it’s a huge pain to set up and configure.  Gradient does this “out of the box”

2)  “Out of the box” Ready – While Overwatch does technically contain the same data that Gradient shows, users would still have to write queries to do all the aggregations properly by pulling the task level or stage level tables as required. Overwatch is best considered a “data dump” and then users will have to wrangle it correctly and do all the dashboarding work in a way that meets their needs. Our value add is that we do all this leg work for you and just present the metrics from day 1.

3)  Trends over time – Gradient aggregates the data to show users how all these various metrics are changing over time, so users can quickly understand what has changed recently.  Looking at a single snapshot in time is often not very useful, as users need to see “what happened before?”  to really understand what has changed and what they can do about it.  While technically this is do-able with Overwatch, it requires users to do the work in building and collecting the data.  Gradient does this “out of the box”

How does this help with stability and reliability?

Beyond cost efficiency, stability is often a higher priority than everything.  Nobody wants a crashed job.  These metrics help give engineers “early signals” if their cluster is headed towards a dangerous cliff.  For example, data sizes may start growing beyond the memory of your cluster, which could cause the dreaded “out of memory” error.

Seeing how your performance is trending over time is a critical piece of information users need to help prevent dangerous crashes from happening.

Conclusion

We hope this new feature makes life a lot easier for all the data engineers and platform managers out there.  This feature comes included with Gradient and is live today!  We’d love your feedback.

It probably goes without saying that this feature is in addition to our active optimization solution that can auto-tune your clusters to hit your cost or runtime goals.  The way we look at it, is we’re expanding our value to our users by providing critical metrics.

We’d love your feedback and requests for what other metrics you’d love to see.  Try it out today or reach out for a demo!

Databricks Delta Live Tables 101

Databricks’ DLT offering showcases a substantial improvement in the data engineer lifecycle and workflow. By offering a pre-baked, and opinionated pipeline construction ecosystem, Databricks has finally started offering a holistic end-to-end data engineering experience from inside of its own product, which provides superior solutions for raw data workflow, live batching and a host of other benefits detailed below.

  1. What are Delta Live Tables?
  2. How are Delta Live Tables, Delta Tables, and Delta Lake related?
  3. Breaking Down The Components of Delta Live Tables
  4. When to Use Views or Materialized Views in Delta Live Tables
  5. Where are the advantages of Delta Live Tables?
  6. What is the cost of Delta Live Tables?

Since its release in 2022 Databricks’ Delta Live Tables have quickly become a go-to end-to-end resource for data engineers looking to build opinionated ETL pipelines for streaming data and big data. The pipeline management framework is considered one of most valuable offerings on the databricks platform, and is used by over 1,000 companies including Shell and H&R block. 

As an offering DLT begins to look similar to the DBT value proposition, and with a few changes (namely, jinja templating), DLT may be poised to expand more into what has traditionally been considered DBT’s wheelhouse. DLT is also positioned to begin consuming workloads that were previously handled by multiple separate orchestration, observability, and quality vendors. 

In our quest to help customers manage, understand, and optimize their Databricks workloads, we sought out to understand the value proposition for both customers, and for Databricks. In this post, we break down DLT as both a product offering as well as it’s ROI for customers.

What Are Delta Live Tables?

Delta Live Tables, or DLT, is a declarative ETL framework that dramatically simplifies the development of both batch and streaming pipelines. Concretely though, DLT is just another way of authoring and managing pipelines in databricks. Tables are created using the @dlt.table() annotation on top of functions (which return queries defining the table) in notebooks.  

Delta Live Tables are built using Databricks foundational technology such as the Delta Lake and Delta File format. As such, they operate in conjunction with these two. However, whereas these two focus on the more “stagnant” portions of the data process, DLT focuses on the transformation piece. Specifically, the DLT framework allows data engineers to describe how data should be transformed between tables in the DAG. 

Delta Live Tables are largely used to allow data engineers to accelerate their construction, deployment, and monitoring of a data pipeline. 

The magic of DLT, though, is most apparent when it comes to datasets that both involve streaming data and batch processing data. Whereas, in the past, users had to be keenly aware of and design pipelines for the type of the “velocity” (batch vs. streaming) of data transformed, DLT allows users to push this problem to the system itself. Meaning, users can write declarative transformations and let the system figure out how to handle the streaming or batch components. Operationally, Delta Live Tables add an abstraction layer over Apache Spark (or at least Databricks’ flavor of Spark). This layer provides visibility into the table dependency DAG, allowing authors to visualized, what can rapidly become inter-table dependencies. . 

The DAG may look something like this:

Table dependency visualization is just the beginning. DLT provides a comprehensive suite of tools on top of these pipelines that are set up by default. This can include tools such as data quality checks, orchestration solutions, governance solutions, and more.

When executed properly, DLT helps with total cost-of-ownership, data accuracy and consistency, speed, and pipeline visibility and management. There are many who actually say that DLT is Databricks’ foray into the world of DBT, hoping to cannibalize DBT’s offering. To the question of how this may all play out, we’ll just wait and see.

The word “Delta” appears a lot in the Databricks ecosystem, and to understand why, it’s important to look back at history. In 2019, Databricks publicly announced the Delta Lake, a foundational element for storing data (tables) into the Databricks Lakehouse. Delta Lake popularized the idea of a Table Format on top of files, with the goal of bringing reliability to data lakes. As such, Delta Lake provided ACID transactions, scalable metadata handling, and unified streaming/batch processing to existing Data Lakes in a Spark API compatible way.

Tables that live inside of this Delta Lake are written using the Delta Table format and, as such, are called Delta Tables. Delta Live Tables focus on the “live” part of data flow between Delta tables – usually called the “transformation” step in the ETL paradigm. Delta Live Tables (DLTs) offer declarative pipeline development and visualization.

In other words, Delta Table is a way to store data in tables, whereas Delta Live Tables allows you to describe how data flows between these tables declaratively. Delta Live Tables is a declarative framework that manages many delta tables, by creating them and keeping them up to date. In short, Delta Tables are a data format while Delta Live Tables is a data pipeline framework. All are built on the data lakehouse infrastructure of Delta Lake.

Breaking Down The Components Of Delta Live Tables

The core of DLT is the pipeline— the main unit of execution used to configure and run data processing workflows with Delta Live Tables. These pipelines link data sources to target datasets, through what’s known as a Directed Acyclic Graph, and are declared in Python or SQL source files. Delta Live Tables infers the dependencies between these tables, ensuring updates occur in the correct order.

Each pipeline configuration is defined by its settings, such as notebook, running mode, and cluster configuration. Before processing data with Delta Live Tables, you must configure a pipeline.

As an aside, for the developers reading this, for some reason the Databricks SDK defines a Pipeline as a List of Clusters, which either may be a preview for what to expect in new features or an oversight. We’ll find out soon.

Delta Live Table pipeline supports three types of datasets: Streaming tables, Materialized Views and Views. Streaming tables are ideal for ingestion workloads, and pipelines that require data freshness and low latency. They are designed for data sources that are append-only.

Supported views can either be materialized views where the results have been precomputed based on the update schedule of the pipeline in which they’re contained—or views, which compute results from source datasets as they are queried (leveraging caching optimizations when available). Delta Live Tables do not publish views to the catalog, so views can only be referenced within the pipeline in which they’re defined. Views are useful as intermediate queries that should not be exposed to end users or systems. Databricks describes how each is processed with the following table:

Dataset TypeHow are records processed through defined queries?
Streaming TableEach record is processed exactly once. This assumes an append-only source.
Materialized ViewsRecords are processed as required to return accurate results for the current data state. Materialized views should be used for data sources with updates, deletions, or aggregations, or for change data capture processing (CDC).
ViewsRecords are processed each time the view is queried. Use views for immediate transformations and data quality checks that should not be published to public datasets.

After defining your pipeline settings, you can declare your datasets in DLT using either SQL or Python. These declarations can then trigger an update to calculate results for each dataset in the pipeline.

When to Use Views or Materialized Views in Delta Live Tables

Given the existence of two options to create views on top of data, there must be some situations where one should be preferred over the other. The choice of View or Materialized View primarily depends on your use case. The biggest difference between the two is that Views, as defined above, are computed at query time, whereas Materialized Views are precomputed. Views also have the added benefit that they don’t actually require any additional storage, as they are computed on the fly. 

The general rule of thumb when choosing between the two has to do with the performance requirements and downstream access patterns of the table in question. When performance is critical, having to compute a view on the fly may be an unnecessary slowdown, even if some storage is saved by computing the table on-the-fly, in which case, Materialized Views may be preferred. The same is true when there are multiple downstream consumers of a particular View. Having to compute the exact same view, on the fly, for multiple tables is inefficient and unnecessary. In this case, persisting the Materialized View may be preferred.  

However, there are multiple situations where users just need a quick view, computed in memory, to reference a particular state of a transferred table. Rather than materializing this table, which again, is only needed for an operation in the same transformation, creating a View is more straightforward and efficient.

Databricks also recommends using views to enforce data quality constraints or to transform and enrich datasets that drive multiple downstream queries.

What Are the Advantages of Delta Live Tables?

There are many benefits to using a Delta Live Table, including simpler pipeline development, better data quality standards, and support for unified real time and batch analytics. 

  • Unified streaming/batch experience. By removing the need for data engineers to build distinct streaming / batch data pipelines, DLT simplifies one of the most difficult pain points of working with data, thereby offering a truly unified experience.
  • Opinionated Pipeline Management. The modern datastack is filled with orchestration players, observability players, data quality players, and many others. That makes it difficult, as a platform manager, to not only select how to configure the standard/template data stack, but also how to enforce those standards. DLT offers an opinionated way to orchestrate and assert dataquality.
  • Performance Optimization. DLTs offer the full advantages of Delta Tables, which are designed to handle large volumes of data and support fast querying, as their vectorized query execution allows them to process data in batches rather than one row at a time. This makes them ideal not just for real-time data ingestion but cleaning of large datasets.
  • Management. Delta Live Tables automate away, otherwise manual tasks, such as compactions or selection of job execution order.  Tests by Databricks show that with the use of automatic orchestration, DLT was 2x faster than the non-DLT Databricks baseline, as DLT is better at orchestrating tasks than humans (meaning, they claim DLT is better at determining and managing table dependencies).
  • Built-in Quality Assertions. Delta Live Tables also provide some data quality features, such as data cleansing and data deduplication, out of the box. Users can specify rules to remove duplicates or cleanse data as data is ingested into a Delta Live Table, ensuring data accuracy. DLT automatically provides real-time data quality metrics to accelerate debugging and improve the downstream consumer’s trust in the data.
  • ACID Transactions. Because DLTs use Delta format they support ACID transactions (Atomicity, Consistency, Isolation and Durability) which has become the standard for data quality and exactness.
  • Pipeline Visibility. Another one of the benefits of Delta Live Tables is a Directed Acyclic Graph of your data pipeline workloads. In fact, this is one of the bigger reasons that DBT adoption has occured at the speed it has. Simply visualizing your data pipelines has been a common challenge. DLT DLT gives you a clear, visually compelling way to both see and introspect your pipeline at various points.
  • Better CDC. Another large improvement in DLT is the ability to use Change Data Capture (CDC)  including support for Slowly Changing Dimensions Type 2 just by setting the enableTrackHistory parameter in the configuration. This is a data history tracking feature incredibly useful for audits and maintaining consistency across datasets. We dive a bit further into this below.

What To Know About Change Data Capture (CDC) in Delta Live Tables

One of the large benefits of Delta Live Tables is the ability to use Change Data Capture while streaming data. Change Data Capture refers to the tracking of all changes in a data source so they can be captured across all destination systems. This allows for a level of data integrity and consistency across all systems and deployment environments which is a massive improvement.

With Delta Live Tables, data engineers can easily implement CDC with new Apply Changes into the API (either with Python or SQL). The capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse.

Importantly, Delta Live Tables support Slowing Changing Dimensions (SCD) both type 1 and type 2. This is important because SCD type 2 retains a full history of values, which means even in your data lakehouse, where compute and storage are separate, you can retain a history of records—either on all updates or on updates to a specified set of columns.

In SDC2, when the value of an attribute changes, the current record is closed, a new record is created with the changed data values, and this new record becomes the current record. This means if a user entity in the database moves to a different address, we can store all previous addresses for that user.

This implementation is of great importance to organizations that require maintaining an audit trail of changes.

What is the cost of Delta Live Tables?

As with all things Databricks, the cost of Delta Live Tables depends on the compute function itself (as well as cost variance by region and cloud provider). On AWS, DLT compute can range from $0.20/dbu for DLT Core Compute Photon all the way up to $0.36/dbu for DLT Advanced Compute. However keep in mind these prices can be up to twice as high when applying expectations and CDC, which are among the chief benefits of Delta Live Tables.

From an efficiency perspective, DLT results in a reduction in total cost of ownership. Automatic orchestration tests by Databricks have shown total compute time to be reduced to as much as half with Delta Live Tables–ingesting up to 1 billion records for under $1. Additionally, Delta Live integrates the orchestrator and Databricks into a single console, which reduce the cost of maintaining two different systems to maintain two solutions.


However, users should also be cautioned that without proper optimization, Delta Live Tables can result in a large increase in virtual machine instances, which is why its crucial to always maintain your auto scaling resources.

Want to learn more about getting started with Delta Live Tables? Reach out to us at info@synccomputing.com.