What is Databricks Unity Catalog (and Should I Be Using It)?

Since launching in 2013, Databricks has continuously evolved its product offerings from machine learning pipeline to end-to-end data warehousing and data intelligence platform

While we at Sync are big fans of all things Databricks (particularly how to optimize cost and speed) we often get questions about understanding Databricks new offerings—particularly as product development has accelerated in the last 2 years. 

To help in your understanding, we wrote this blog post to address the question, “What is Databricks Unity Catalog?” and whether users should be using it (the answer is yes). We walk through a precise technical answer, and then dive into the details of the catalog itself, how to enable it and frequently asked questions.

What Is the Unity Catalog in Databricks?

The Databrick’s Unity Catalog is a centralized data governance layer that allows for granular security control and managing data and metadata assets in a unified system within Databricks. Additionally, the unity catalog provides tools for access control, audits, logs and lineage. 

You can think of the unity catalog as an update designed to bridge gaps in the Databrick ecosystem—specifically to eliminate and improve upon third-party catalogs and governance tools. With many cloud-specific tools being used, Databricks brought in a unified solution for data discovery and governance that would seamlessly integrate with their Lakehouse architecture. Thus, while Unity Catalog was initially billed as a governance tool, in reality it streamlines processes across the board. While simplistic, it’s not wrong to say Unity Catalog simply makes everything Databricks run smoother.  

Notably, the Unity Catalog is being offered by default on the Databricks Data Intelligence Platform. This is because Databricks believes the Unity Catalog is a huge benefit to their users (and we are inclined to agree!). If you have access to the Unity Catalog, we highly recommend enabling it in your workspace. 

What benefits does the Databricks Unity Catalog have to offer?

The Unity Catalog benefits can be thought of in four buckets: data governance, data discovery, data lineage, and data sharing and access.  

Data Discovery

The unity catalog provides a structured way to tag, document and manage data assets and metadata. This allows for a comprehensive search interface that utilizes lineage metadata (including full lineage) history and ensures security based on user permissions.

Users can either explore data objects through the Catalog Explorer, or parse through data using SQL or Python to query datasets and create dashboard from available data objects. In Catalog explorer, users can preview sample data, read comments and check field details (50 second preview from Databricks here).

A preview of the Catalog explorer for data discovery in Unity Catalog (via Databricks/Youtube)

Data Governance

Unity Catalog is a layer over all external compute platforms and acts as a central repository for all structured and unstructured data assets (such as files, dashboards, tables, views, volumes, etc). This unified architecture allows for a governance model that includes controls, lineage, discovery, monitoring, auditing, and sharing.

Unity Catalog thus offers a single place to administer data access policies that apply across all workspaces. This allows you to simplify access management with a unified interface to define access policies on data and AI assets and consistently apply and audit these policies on any cloud or data platform.

All of Databricks governance parameters can be accessed via their Unity Catalog Governance Portal. The Databricks Data Intelligence Platform leverages AI to best understand the context of tables and columns, the volume of which can be impossible for manual categorization. This also enables you to quickly assess how many of your tables are monitored via Lakehouse Monitoring — Databricks’s new “AI for Governance tool”. 

 A screenshot of the Unity Catalog Governance portal shows how their Lakehouse Monitoring uses AI to automatically monitor tables and alert users to uses like PII leakage or data drift (via Databricks/Youtube)

With Lakehouse monitoring you can also set up alerts that automatically detect and correct PII leakage, data quality, data drift and more. These auto alerts are contained within their own section of the Governance Portal, which shows when the issue was first detected, and where the issue first stemmed from.

A preview of the governance action items shows how issues are identified by cause and Catalog/Schema/Table. Digging further in will reveal the time and date of first incidence as well as it where it stems from.

It incorporates a data governance framework and maintains an extensive audit log of actions performed on data stored within a Databricks account.

Data Lineage

As the importance of Data Lineage has grown, Databricks has responded with end-to-end lineages for all workloads. Lineage data includes notebooks, workflows and dashboards and is captured down to the column level. Unity Catalog users can parse and extract lineage metadata from queries and external tools using SQL or any other language enabled in their workspace, such as Python. Lineage can be visualized in the Catalog Explorer in near-real-time and 

Unity Catalog’s lineage feature provides a comprehensive view of both upstream and downstream dependencies, including the data type of each field. Users can easily follow the data flow through different stages, gaining insights into the relationships between field and tables.

An example of the metadata lineage within Unity Catalog

An example of the metadata lineage within Unity Catalog

Like their governance model, Databricks restricts access to data lineage based on the logged-in users’ privileges.

Data Sharing and Access

One of the most welcomed features of Databricks Unity Catalog is its built-in sharing method which is built on Delta Sharing, Databricks’ popular cloud-platform-agnostic open protocol for sharing data and managing permissions launched in 2021.

Within Unity Catalog you can access control mechanisms use identity federation, allowing Databricks users to be service principals, individual users, or groups. In addition, SQL-based syntax or the Databricks UI can be used to manage and control access, based on tables, rows, and columns, with the attribute level controls coming soon.

How Does Databricks Unity Catalog Enhance Data Governance and Security 

Databricks has a standards-complaint security model based on ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (also called schemas), tables, and views. 

Unity Catalog grants user-level permissions for Governance Portal, Catalog Explorer and for data lineages and sharing. Unity Catalog in effect has one model for safeguarding appropriating access across your full data estate with permissions, row level, and column level security. 
It almost allows registering and governing access to external data sources, such as cloud object storage, databases, and data lakes, through external locations and Lakehouse Federation.

Does Unity Catalog Help With Databricks Cost?

Yes, because Unity Catalog reduces both storage costs and fees for external licensing, it reduces cost compared to previous solutions. It also indirectly saves time by greatly reducing bottlenecks for ingesting data, reducing time spent on repetitive tasks by an average of 80% (according to Databricks). This all comes free and automatically enabled for all new users of the Databricks Data Intelligence Platform. 

How do I set up and configure Unity Catalog in Databricks?

The following is a step-by-step guide to setting up and configuring Databricks Unity Catalog. 

  1. Confirm Your Workspace Is Enabled For Unity Catalog.
    Log into your account and click Workspaces. From there check the Metastore Column. If a metastore name is preset, it means your workspace is attached to a Unity Catalog. 

If your workspace doesn’t return a metastore, you’ll want to either to enable and attach your workspace, or create a Unity Catalog metastore.

  1. Add users and assign the workspace admin role.
    The user who creates a workspace is automatically added as an admin role. That admin can then add and invite users, and can assign workplace admin roles and metastore admin roles.
  2. Create Clusters or SQL Warehouses for users to run queries and create objects. To run Unity Catalog workloads, compute resources must comply with certain security requirements. As a workspace admin, you can opt to make compute creation restricted to admins or let users create their own SQL warehouses and clusters
  3. Grant Privileges to Users. To create objects and access them in Unity Catalog catalogs and schemas, a user must have permission to do so.  See how to grant privileges and manage admin privileges.
  4. Create New Catalogs and Schemas. To start using Unity Catalog, you must have at least one catalog defined. Catalogs are the primary unit of data isolation and organization in Unity Catalog. All schemas and tables live in catalogs, as do volumes, views, and models. You’ll want to create managed storage for the new catalog, then bind the new catalog your workspace, and then grant privileges for that catalog. Full instructions here

What Integrations Work With Data Unity Catalog? 

Unity Catalog works existing data storage systems and governance solutions such as Atlan, Fivetran, dbt or Azure data factory. It also integrates with business intelligence solutions such as Tableau, PowerBi and Qlik. This makes it simple to leverage your existing infrastructure for updated governance model, without incurring expensive migration costs (for a full list of integrations check out Databricks page here).

What if my workspace wasn’t enabled for Unity Catalog automatically?

If your workspace was not enabled for Unity Catalog automatically, an account admin or metastore admin must manually attach the workspace to a Unity Catalog metastore in the same region. If no Unity Catalog metastore exists in the region, an account admin must create one. For instructions, see Create a Unity Catalog metastore.

Unity Catalog Limitations

The following limitations apply for all object names in Unity Catalog:

  • Object names cannot exceed 255 characters.
  • The following special characters are not allowed:
    • Period (.)
    • Space ( )
    • Forward slash (/)
    • All ASCII control characters (00-1F hex)
    • The DELETE character (7F hex)
  • Unity Catalog stores all object names as lowercase.
  • When referencing UC names in SQL, you must use backticks to escape names that contain special characters such as hyphens (-).

For a full list of Unity Catalog Limitations, read the full documentation for the Unity Catalog.

Unity Catalog FAQs

  • How does Databricks Unity Catalog differ from Hive Metastore?
    Databricks Unity Catalog offers a centralized data governance model, supports external data access, data isolation, and advanced features like column-level security, while Hive Metastore has limited governance capabilities.
  • How Long is Lineage Data Stored in Databricks Unity Catalog?
    Lineage data on Databricks Unity Catalog is retained for 1 year.
  • What are the supported compute and cluster access modes for Databricks Unity Catalog?
    Supported access modes are Shared Access Mode and Single User Access Mode. No-Isolation Shared Mode is not supported.
  • What data file formats are supported for managed and external tables in Databricks Unity Catalog?
    Managed tables must use the Delta table format, while external tables can use Delta, CSV, JSON, Avro, Parquet, ORC, and Text formats.
  • How do you enable your workspace for Databricks Unity Catalog?
    You can enable Unity Catalog during workspace creation or assign an existing metastore to your workspace through the Databricks account console.
  • How do you control access to data and objects in Databricks Unity Catalog?
    You can use admin privileges, object ownership, privilege inheritance, basic object privileges (GRANT/REVOKE), dynamic views for row/column security, and manage external locations and credentials.
  • What is the Databricks Unity Catalog object model?
    The object model follows a hierarchical structure: Metastore ► Catalog ► Schema ► Tables, Views, Volumes, and Models.
  • Can you transfer ownership of objects in Unity Catalog?
    Yes, you can transfer ownership of catalogs, schemas, tables, and views to other users or groups using SQL commands or the Catalog Explorer UI.
  • How do you create a new catalog in Unity Catalog?
    You can use the CREATE CATALOG SQL command, specifying a name and managed location if needed. You must have CREATE CATALOG privileges on the metastore.
  • How do you grant permissions on a catalog or schema?
    Use the GRANT statement with the desired privileges (e.g., CREATE SCHEMA, CREATE TABLE) and the catalog or schema name, followed by the user or group to grant access to.
  • What is the syntax for referring to a table in Unity Catalog?
    Use the three-part naming convention: <catalog>.<schema>.<table>
  • How do you create a managed table in Unity Catalog?
    Use the CREATE TABLE statement, specifying the table name, columns, and partitioning if needed. Managed tables are created in the managed storage location.
  • Can you access data in the Hive Metastore through Unity Catalog?
    Yes, data in the Hive Metastore becomes a catalog called hive_metastore, and you can access tables using the hive_metastore.<schema>.<table> syntax.
  • How do you drop a table in Databricks Unity Catalog?
    You can use the DROP TABLE statement followed by the fully qualified table name (e.g., DROP TABLE <catalog>.<schema>.<table>).

Unity Catalog is the solution to a problem was created as Databricks grew beyond its initial usage. In order to streamline the various product offerings within their ecosystem, Databricks introduced the Unity Catalog to eliminate third-party integrations, particularly in the realm of data governance. We feel this has been tremendously well executed and as Unity Catalog comes free and installed by default for all new databricks data intelligence platform users, we feel it’s highly advantageous to maximize its utility, particularly for data governance, lineage and data discovery. 

Useful Links 

March 2024 Release Notes

release notes

Our team has been hard at work to deliver industry-leading features to support users in achieving optimal performance within the Databricks ecosystem. Take a look at our most recent releases below.

Worker Instance Recommendations

Introducing Worker Instance Recommendations directly from the Sync UI. With this feature, you are able to tap into optimal cluster configuration recos so that your configs are optimized for individual jobs.

The instance recos within Gradient not only optimize the number of workers, but also the worker size. For example, if you are using i3.2xl instances, Gradient will find the right instance size (such as i3.xl, i3.4xl, i3.8xl, etc) in the i3 instance type.

Instance Fleet Support

If your company is using Instance Fleet Clusters, Gradient is now compatible!  There are no changes required on the user flow, as this feature is automatically supported in the backend.  Just onboard your jobs like normal into Gradient, and we’ll handle the rest.

Hosted Log Collection

Running Gradient is now more streamlined than ever! You’re now able to opt into hosted log collection entirely in the Sync environment with Sync-hosted or user-hosted collection options. What does this mean? It means that there are no extra steps or external clusters needed to run Gradient, allowing Sync to do all the heavy lifting while minimizing the impact on your Databricks workspace. 

With hosted DBX log collection within Gradient, you’re able to minimize onboarding errors due to annoying permission settings while increasing visibility into any potential collection failures, ultimately giving you and your team more control over your cluster log data.

Getting Started with Collection Setup
The Databricks Workspace integration flow is triggered when a user clicks on Add → Databricks Workspace after they have configured their workspace and webhook. Users will also now have a toggle option to choose between Sync-hosted (recommended) or User-hosted collection.

  • Sync-hosted collection – The user will be optionally prompted to share their preference for cluster logs stored for their Databricks Jobs. This will initially be an immutable setting saved on the Workspace.
    • For AWS – Users will need to add a generated IAM policy and IAM Role to their AWS account. The IAM policy allows us to ec2:DescribeInstances, ec2:DescribeVolumes, and optionally an s3:GetObject and s3:ListBucket to the specific bucket and prefix to which users have configured uploading cluster logs. S3 permissions are optional because they may be using DBFS to record cluster logs. The user needs to add a “Trusted Relationship” to the IAM role to give our Sync IAM role permissions to sts:AssumeRole using an ExternalId we provide them. Gradient will then generate this policy and trust relationship for the user in a JSON format to be copy and pasted.
    • For Azure – Coming soon!
  • User-hosted collection – For both Azure/AWS will proceed as the normal workspace integration requirements dictate

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Request a demo or reach out to us at info@synccomputing.com.

Are Databricks clusters with Photon and Graviton instances worth it?

Configuring Databricks clusters can seem more like art than science.  We’ve reported in the past about ways to optimize worker and driver nodes, and how the proper selection of instances impacts a job’s cost and performance.  We’ve also discussed how autoscaling performs, and how it’s not always the most efficient choice for static jobs.  

In this blog post, we look across a few other popular questions and options we see from folks:

  1. How do Graviton instances impact cost and performance?
  2. How does the price and performance of Photon compare to standard instances?

What are Graviton instances?

Graviton instances on AWS contain custom AWS built processors, which promise to be a “major leap” in performance. Specifically for Spark, AWS published a report that claimed Graviton can help reduce costs up to 30% and speed up performance up to 15% for Apache Spark on EMR.   Although Databricks clusters can use Graviton, there haven’t been any performance metrics reported (that we know of).   There’s no extra surcharge for Graviton instances, and they are typically moderately priced compared to other instances.

What is Photon in Databricks?

Photon is a vectorized query engine written in C++ developed by the creators of Apache Spark and is available within the Databricks platform.  Photon is an amazing technical feat with a multitude of features and considerations, that extend well beyond the scope of this blog to go into.   For full details, we encourage readers to check out the original Photon academic paper here.  Unfortunately, Photon is not free and is typically a 2x cost increase for DBUs compared to non-photon.  So users have to decide if the cost increase is “worth it.”

At the highest level for most end users, as cited by the original academic paper::

  • Photon is great for CPU heavy operations such as joins, aggregations, and SQL expression evaluations.  
  • The academic paper claims about a 3x speedup on the TPC-H benchmark compared to standard Databricks runtime
  • Photon is not expected to provide a speedup to workloads that are I/O or network bound.

Yes, you can even run Photon on Graviton instances!  What happens with this powerful combo?  The data below shows the results.

How do I use Graviton and/or Photon?

Graviton instances typically have the “g” letter in the instance names, such as “m6g.xlarge” or “c7g.xlarge” and are selected during the cluster creation step within Databricks under “Worker type” and “Driver type”.

Photon is enabled by simply checking the box “Use Photon Acceleration” in the cluster creation step.  An image of the UI is shown below.

Experimental setup

In our analysis we utilize the TPC-DS 1TB benchmark, with all queries run sequentially.  We then look at the total runtime of all queries summed together.  To keep things simple and fair, every cluster has identical driver and worker instances.  We sampled 28 different instances spanning from photon enabled, Graviton, memory, compute, I/O, network, and storage optimized instances.   A full list of the parameters of each cluster are below:

  1. Driver:  [instance].xlarge
  2. Worker:  [instance].xlarge
  3. Number of workers: 10
  4. EBS volume: 64
  5. Databricks runtime version:  11.3.x-scala2.12
  6. Market:  On-demand
  7. Cloud provider: AWS
  8. Instances:  28 different instances on AWS

For the cost, we utilize only the DBU cost of each cluster.  We did not include the AWS costs for various reasons:

  • Cloud cost attribution difficulty:  Databricks internally re-uses clusters of adjacent jobs.  Meaning, AWS clusters for one job may be reused for a second job, if they require the same machine.  This causes identifying which job was using which cluster in AWS difficult to determine.  This is a niche problem, and only for people who want to determine the true cost of a single job
  • AWS costs depend on the market:  The AWS costs, or cloud costs in general, depend on the market.  Specifically, if users are using on-demand vs. spot nodes, it will drastically change the relative cost performance.  Furthermore, spot prices can fluctuate daily, so extracting fair comparisons would be difficult here.
  • AWS costs depend on contracts:  Large companies negotiate their own costs for their instances, thus again, making an overall apples to apples comparison difficult.

For the reasons above, the DBU costs are utilized because they are exact, easy to identify, and do not fluctuate depending on the market.  However, we will say that DBU costs can also depend on contracts.  But for the sake of this study, we’ll just use the list prices of DBUs.  As you can tell by these thoughts, doing actual cost comparisons is not a trivial task, and is highly dependent on each company’s use case.


The graph below shows the cost vs runtime plots of all 28 different clusters.  They are grouped into 3 sections, “Graviton” instances, “Photon” enabled instances, “Standard” instances (no photon, no Graviton), and “Graviton + Photon” instances.  Points that are closer to the bottom left hand corner of the graph are both “faster and cheaper.”

In the graph below, we can see two clear “clusters”, basically with and without Photon.  It’s clear from this data that Photon is legitimately faster.  Unfortunately, it doesn’t appear any cheaper, so if your goal is to save money these results are a bit of a downer.  If you’re trying to run faster, Photon may be exactly what you’re looking for.

The two bar graphs below contain the same data as the XY plot above, but they break out the data into runtime and DBU costs separately.  Also, we present the individual instances used, in case people would like a more granular view into the data.

After perusing through the data, our main observations are outlined below.  I’d like to heavily caution that these observations are purely from the experiment we ran above.  We urge people to exercise caution when trying to generalize these results, as individual jobs can have wildly different results than the ones we showed above.  With that said, these are the main takeaways:

  • Photon is generally 2x faster – Across the board Photon was about 2x faster than their non-photon counterparts (same instances).  This was great to see.  Although not as high as some of the claims reported by Databricks, we understand that it is highly dependent on the workload.  In my opinion a 2x speedup is pretty impressive.
  • Graviton was neutral  – The runtime for graviton was perhaps a bit faster than standard instances, but it’s unclear if it’s statistically significant.  There doesn’t seem much risk to using Graviton, and they are newer chips so maybe they will be faster for your jobs? 
  • Photon’s total cost is cheaper (with this data) – In the data above, since the DBU costs were about the same across all 3 types, and Photon’s runtimes were about 2x faster, one can logically conclude that the cloud portion of the costs (the AWS fees) will be less with Photon.  As a result, the total cost for an end user was cheapest with Photon enabled.
  • Photon pricing makes for complex cost ROI –  Because of the previous point, determining the ROI of Photon is difficult.  It basically boils down to if the speedup is fast enough to endure the increased cost.  If it does not, then users are essentially paying more money for a potentially faster job.  If Photon speedup is fast enough, then it will be cheaper.  What that threshold is will depend on the market and any discounts.  For the sake of this study, the crossover point for on-demand instances was around 20%.  Meaning, Photon needs to be at least 20% faster than Standard to observe any cost savings.

Formula for determining Photon ROI

For those that are mathematically inclined, here is a simple formula to help determine the “speedup threshold” which is the minimum speedup Photon needs to achieve for your job in order to break even.  If your speedup is greater than this threshold, then you are saving money.

For a simple example, let’s say all of the machine and DBU costs are 1, and the Photon cost increase is a factor of 2, and we have 10 workers.  With these very simple numbers, we get a Psth value of 1.5.  Plugging in 1.5 for Psth and setting R_orig =1 and solving for R_photon, that means Photon needs to be 33% faster to break even.  Clearly this value is heavily dependent on a lot of factors, all of which are shown in the equation above.


Overall the answers to the original two questions really comes down to “it depends.”  The data points we showed above are an infinitely small slice of what workloads actually look like.  Based on simply the data above, here are the answers:

1)  Photon will probably be faster than non-photon, but whether or not it’s cheaper will depend on how much faster it is relative to the costs.  To understand if the 2x DBU cost increase with Photon is worth it, it all depends on the markets and pricing of your cloud instances.

2)  On average Graviton was about the same for cost and runtime compared to standard instances.  We did not see any significant advantage of using Graviton here, but we didn’t see any downside either.  Maybe these new chips will be perfect for your workload, or maybe not.  It’s hard to tell.

However, with the data above, specifically around Photon, I can’t help but ask the question:

Is Databricks motivated to make Spark run faster? 

This is an interesting philosophical question where the tech enthusiast may clash with the business units.  The faster Databricks makes Spark, the less revenue they get, since they charge per minute.  Photon is an interesting case study in which, yes, they made Spark 2x faster – but then had to double their costs to not lose money.  This is at least one data point that shows you where Databricks basically sits: “Yes we can make Spark faster, but not cheaper.”

In my opinion, Databricks, and other cloud providers, are fundamentally motivated to increase revenue.  So making Spark run faster and/or cheaper is not in alignment with where they need to do as a business.  They will however make the product easier to use, or expand to other use cases which, fundamentally, increases revenue.
We of course respect the fact that any business needs to make money, so I don’t think anything improper is happening here.  But it does reveal an interesting conflict between technology and business and how that fundamentally impacts the end user.

How to Use the Gradient CLI Tool to Optimize Databricks / EMR Programmatically


The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts. 

Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs. 

If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.

Pre Work

This tutorial assumes that you have already created a Gradient account and generated your

Sync API keys. If you haven’t generated your key yet, you can do so on the Accounts tab of the Gradient UI.

Step 1: Setting up your Environment

Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.

python --version

I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.

python -m venv ~/VirtualEnvironments/gradient-cli

Step 2: Install the Sync Library

Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.

source ~/VirtualEnvironments/gradient-cli/bin/activate

Next use the pip package installer to install the latest version of the Sync Library.

pip install https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz

You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.

sync-cli --help

Step 3. Configure the Sync Library

Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:

sync-cli configure

You will be prompted for the following values:

Sync API key ID:

Sync API key secret:

Default prediction preference (performance, balanced, economy) [economy]:

Would you like to configure a Databricks workspace? [y/n]:

Databricks host (prefix with https://):

Databricks token:

Databricks AWS region name:

If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.

Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups.

These values are stored in ~/.sync/config.

Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.

Step 4. Example Uses

Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.


All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.

Project Commands:

create – Create a project

sync-cli projects create --description [TEXT] --job-id [Databricks Job ID] PROJECT_NAME

delete – Delete a project

sync-cli projects delete PROJECT_ID

get – Get info on a project

sync-cli projects get PROJECT_ID

list – List all projects for account

sync-cli projects list


You can also use the CLI to manage, generate and retrieve predictions. This is useful when you want to automate the implementation of recommendations within your Databricks or EMR environments.

Prediction commands:

get – Retrieve a specific prediction

sync-cli predictions get --preference [performance|balanced|economy] PREDICTION_ID

list – List all predictions for account or project

sync-cli predictions list --platform [aws-emr|aws-databricks] --project TEXT

status – Get the status of a previously initiated prediction

sync-cli predictions status PREDICTION_ID

The CLI also provides platform specific commands to generate and retrieve predictions.


For Databricks you can generate a recommendation for a previously completed job run with the following command:

sync-cli aws-databricks create-prediction --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] RUN_ID

If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.

Alternatively, you can use the following command to run the job and request a recommendation with a single command:

sync-cli aws-databricks run-job --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] JOB_ID

This method is useful in cases when you are able to manually run your job without interfering with scheduled runs.

Finally, to implement a recommendation and run the job with the new configuration, you can issue the following command:

sync-cli aws-databricks run-prediction --preference [performance|balanced|economy] JOB_ID PREDICTION_ID


Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.

sync-cli aws-emr create-prediction --region [Your AWS Region] CLUSTER_ID

Use the following command to do so:

If you want to manually rerun the EMR job and immediately request a Gradient recommendation, use the following command:

sync-cli aws-emr record-run --region [Your AWS Region] CLUSTER_ID PROJECT

To execute the EMR job using the recommended configuration, use the following command:

sync-cli aws-emr run-prediction --region [Your AWS Region] PREDICTION_ID


Gradient is constantly working on adding support for new data engineering platforms. To see which platforms are supported by your version of the CLI, you can use the following command:

sync-cli products


Should you ever need to update your CLI configurations, you can call config again to change one or more your values.

sync-cli configure --api-key-id TEXT --api-key-secret TEXT --prediction-preference TEXT --databricks-host TEXT --databricks-token TEXT --databricks-region TEXT


The Token command returns an access token that you can use against our REST API with clients like postman

sync-cli token


With these simple commands, you can automate the end to end optimization of all your Databricks or EMR workloads, dramatically reducing your costs and improving the performance. For more information refer to our developer docs or reach out to us at info@synccomputing.com.

Integrating Gradient into Apache Airflow


In this blog post, we’ll explore how you can integrate Sync’s Gradient with Airflow. We’ll walk through the steps to create a DAG that will submit a run to Databricks, and then make a call through Sync’s library to generate a recommendation for an optimized cluster for that task. This DAG example can be used to automate the process of requesting recommendations for tasks that are submitted as jobs to Databricks.

A Common Use Case And It’s Challenges

Use Case:

A common implementation of Databricks within Airflow consists of using the DatabricksSubmitRunOperator to submit a pre-configured notebook to Databricks.


  • Due to orchestration outside of Databricks’ ecosystem, these jobs are reflected as one-time runs
  • It’s difficult to track cluster performance across multiple runs
  • This is exacerbated by the fact that a dag can have multiple tasks that submit these one-off ‘jobs’ to Databricks.

How Can We Fix This?

We’ll set up a python operator to utilize Sync’s Library so we can generate recommendations and view them in Gradient’s UI. From there we can see the changes we need to make to have cost reductions in our cluster definitions. Let’s dive in!

Preparing Your Airflow Environment


  • Airflow (This tutorial uses 2.0+)
  • Python 3.7+
  • Sync Library installed and environment variables configured on the airflow instance (details below)
  • An s3 path you would like to use for cluster logs – your databricks ARN will need access to this path so it can save the cluster logs there.
  • An account with Sync and a Project created to track the task you would like to optimize.

Sync Account Setup And Library Installation

Quick start instructions on how to create an account, project, and install the Sync Library can be found here. Please configure the cli on your airflow instance. When going through the configuration steps, be sure to choose yes when prompted to configure the Databricks variables.

Note: In the quickstart above, there are instructions on using an init script. Copy the contents of the init script into a file on a shared or personal workspace accessible by the account the Databricks job will run as.


Certain variables are generated and stored during installation of the sync library. For transparency, they are:

Besides the variables generated by the library, you’ll need the following ENV variables. These are necessary to use the AWS API to retrieve cluster logs when requesting a prediction. DBFS is supported, however, it is not recommended as it goes against Databrick’ best practices. As mentioned in the quick start, it’s best to set these via the AWS CLI.


Cluster Configuration

Referring back to our common use case, often a static cluster configuration is either defined within the dag or dynamically within a helper function that returns the cluster dictionary to be passed into the DatabricksSubmitRunOperator. In preparation for the first run, some specific cluster details need to be configured. 

What are we adding?

  • Cluster_log_conf:  An s3 path to send our cluster logs. These will be used to generate an optimized recommendation
  • Custom_tags: the sync:project_id tag is added so we can assign the run to a sync project
  • Init_scripts: identifies the init script path that we copied into our Databricks workspace during the quick start setup
  • spark_env_vars: environment variables passed to the cluster that the init script will use. Note: the retrieval of tokens/keys in this tutorial is simplified to use the information configured during the sync-cli setup process. Passing them in this manner will result in tokens being visible in plaintext when viewing the cluster in Databricks. Please use Databricks Secrets when productionalizing this code.

The rest of the cluster configuration dictionary comprises the typical settings you normally pass into the DatabricksSubmitRunOperator.

from sync.config import DatabricksConf as sync_databricks_conf
from sync.config import get_api_key

    "spark_version": "13.0.x-scala2.12",
    "cluster_log_conf": {
        "s3": {
            "destination": "", # Add the s3 path for the cluster logs
            "enable_encryption": True,
            "region": "", # Add your aws region ie: us-east-1
            "canned_acl": "bucket-owner-full-control",
    "custom_tags": {"sync:project-id": "",}, # Add the project id from Gradient
    "init_scripts": [
        {"workspace": {
            "destination": "" # Path to the init script in the workspace ie: Shared/init_scripts/init.sh
    "spark_env_vars": {
        "DATABRICKS_HOST": f"{sync_databricks_conf().host}",
        "DATABRICKS_TOKEN": f"{sync_databricks_conf().token}",
        "SYNC_API_KEY_ID": f"{get_api_key().id}",
        "SYNC_API_KEY_SECRET": f"{get_api_key().secret}",
        "AWS_DEFAULT_REGION": f"{os.environ['AWS_DEFAULT_REGION']}",
        "AWS_ACCESS_KEY_ID": f"{os.environ['AWS_ACCESS_KEY_ID']}",
        "AWS_SECRET_ACCESS_KEY": f"{os.environ['AWS_SECRET_ACCESS_KEY']}",

Reminder: the Databricks ARN attached to the cluster will need access to the s3 path specified in the cluster_log_conf.

Databricks Submit Run Operator Changes

Next, we’ll ensure the Databricks Operator passes the run_id of the created job back to xcom. This is needed in the subsequent task to request a prediction for the run. Just enable the do_xcom_push parameter.

# DAG code
    # Submit the Databricks run
    run_operator = DatabricksSubmitRunOperator(

Create A Recommendation You Can View In Gradient!

Upon successful completion of the DatabricksSubmitRunOperator task, we’ll have the run_id we need to create a recommendation for optimal cluster configuration. We’ll utilize the PythonOperator to call the create_prediction_for_run method from the Sync Library. Within the library, this method will connect to the Databricks instance to gather the cluster log location, fetch the logs, and generate the recommendation.

Below is an example of how to call the create_prediction_for_run method from the Sync Library. 

from sync.awsdatabricks import create_prediction_for_run

def submit_run_for_recommendation(task_to_submit: str, **kwargs):
    run_id = kwargs["ti"].xcom_pull(task_ids=task_to_submit, key="run_id")
    project_id = "Project_Id_Goes_Here"
        compute_type="Jobs Compute",

What this code block does:

  • wraps and implements create_prediction_for_run
  • pulls the run_id for the previous task from xcom. We supply the task_to_submit as the task_id that we named the DatabricksSubmitRunOperator.
  • We assign the project id for that task to the project_id variable.
  • We pass our project id, supplied on the project details page in Gradient, to the Sync library method.

Optionally, add a parameter to the submit_run_for_recommendation if you’d like to extract this out to the python operator. Edit plan_type and compute_type as needed, these reference your Databricks settings.

To call the submit_run_for_recommendation method we defined, implement the python operator as follows:

    submit_for_recommendation = PythonOperator(
            "task_to_submit": "Task_id of the DatabricksSubmitRunOperator of which to generate a recommendation for",

Putting It All Together

Let’s combine all of the above together in a DAG. The DAG will submit a run to Databricks, and then make a call through Sync’s library to generate a prediction for an optimized cluster for that task.

# DAG .py code
from airflow.operators.python_operator import PythonOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from sync.awsdatabricks import create_prediction_for_run
from sync.config import DatabricksConf as sync_databricks_conf
from sync.config import get_api_key

with DAG(
) as dag:

    # define the cluster configuration
        cluster_config = {
        "spark_version": "13.0.x-scala2.12",
        "cluster_log_conf": {
            "s3": {
                "destination": "", # Add the s3 path for the cluster logs
                "enable_encryption": True,
                "region": "", # Add your aws region ie: us-east-1
                "canned_acl": "bucket-owner-full-control",
        "custom_tags": {"sync:project-id": "",}, # Add the project id from Gradient
        "init_scripts": [
            {"workspace": {
                "destination": "" # Path to the init script in the workspace ie: Shared/init_scripts/init.sh
        "spark_env_vars": {
            "DATABRICKS_HOST": "", # f"{sync_databricks_conf().host}"
            "DATABRICKS_TOKEN": "", # f"{sync_databricks_conf().token}"
            "SYNC_API_KEY_ID": "", # f"{get_api_key().id}"
            "SYNC_API_KEY_SECRET": "", # f"{get_api_key().secret}"
            "AWS_DEFAULT_REGION": "", # f"{os.environ['AWS_DEFAULT_REGION']}"
            "AWS_ACCESS_KEY_ID": "", # f"{os.environ['AWS_ACCESS_KEY_ID']}"
            "AWS_SECRET_ACCESS_KEY": "", # f"{os.environ['AWS_SECRET_ACCESS_KEY']}",

    # define your databricks operator
    dbx_operator = DatabricksSubmitRunOperator(

    # define the submit function to pass to the PythonOperator
    def submit_run_for_recommendation(task_to_submit: str, **kwargs):
    run_id = kwargs["ti"].xcom_pull(task_ids=task_to_submit, key="run_id")
    project_id = "Project_Id_Goes_Here"
        compute_type="Jobs Compute",

    # define the python operator
    submit_for_recommendation = PythonOperator(
            "task_to_submit": "dbx_operator",

    # define dag dependency
    dbx_operator >> submit_for_recommendation

Viewing Your Recommendation

Once the code above is implemented into your DAG, head over to the Projects dashboard in Gradient. There you’ll be able to easily review recommendations and can make changes to the cluster configuration as needed.

Introducing: Gradient for Databricks

Wow the day is finally here! It’s been a long journey, but we’re so excited to announce our newest product: Gradient for Databricks.

Checkout our promo video here!

The quick pitch

Gradient is a new tool to help data engineers know when and how to optimize and lower their Databricks costs – without sacrificing performance.

For the math geeks out there, the name Gradient comes from the mathematical operator from vector calculus that is commonly used in optimization algorithms (e.g. gradient descent).

Over the past 18 months of development we’ve worked with data engineers around the world to understand their frustrations when trying to optimize their Databricks jobs. Some of the top pains we heard were:

  • “I have no idea how to tune Apache Spark”
  • “Tuning is annoying, I’d rather focus on development”
  • “There are too many jobs at my company. Manual tuning does not scale”
  • “But our Databricks costs are through the roof and I need help”

How did companies get here?

We’ve worked with companies around the world who absolutely love using Databricks. So how did so many companies (and their engineers) get to this efficiency pain point? At a high level, the story typically goes like this:

  • “The Honeymoon” phase: We are starting to use Databricks and the engineers love it
  • “The YOLO” phase: We need to build faster, let’s scale up ASAP. Don’t worry about efficiency.
  • “The Tantrum” phase: Uh oh. Everything on Databricks is exploding, especially our costs! Help!

So what did we do?

We wanted to attack the “Tantrum” problem head on. Internally three teams of data science, engineering, and product worked hand in hand with early design partners using our Spark Autotuner to figure out how to deliver a deeply technical solution that was also easy and intuitive. We used all the feedback on the biggest problems to build Gradient:

User Problem What Gradient Does
I don’t know when, why, or how to optimize my jobsGradient continuously monitors your clusters to notify you of when a new optimization is detected, estimate the cost/performance impact, and output a JSON configuration file to easily make the change.
I use Airflow or Databricks Workflows to orchestrate our jobs, everything I use must easily integrate.Our new python libraries and quick-start tutorials for Airflow and Databricks Workflows make it easy to integrate Gradient into your favorite orchestrators.
I just want to state my runtime requirements, and then still have my costs loweredSimply set your ideal max runtime and we’ll configure the cluster to hit your goals at the lowest possible cost.
My company wants us to use Autoscaling for our jobs clustersWhether you use auto-scaled or fixed clusters, Gradient supports optimizing both (or even switching from one to the other). 
I have hundreds of Databricks jobs. I need batch importing for optimizing to workProvide your Databricks token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.

We want to hear from you!

Our early customers made Gradient what it is today, and we want to make sure it’s meeting companies’ needs. We made getting started super easy (you can Just login to the app here). Feel free to browse the docs here. Please let us know how we did via Intercom (in the docs and app).