Databricks Workspace Health SQL Toolkit 

As data engineers, understanding the intricacies of your Databricks environment is important. You can’t optimize performance, budget or ensure efficient resource allocations without it. Thankfully, Databricks gives you a behind-the-scenes look at how your workspace is running in system tables. Everything from query performance to job execution and cluster activity is in those tables. 

But raw system data can be tricky to navigate, and sometimes you just need a quick answer to that burning question. That’s why we’ve created a Databricks Health SQL Toolkit—a free set of queries you can copy, paste, and run to get instant insights into your environment. Whether you’re debugging slow queries, tracking costs, or just curious about what’s happening under the hood, these queries should help you get there faster.

If you find these queries to be useful, you can download our 1-click dashboard for Databricks to visualize all of them in a sharable dashboard. Wow the team with insights in your Jobs, SQL warehouses, APC clusters, and DLT usage. 

General Usage

Let’s start with the basics. Here are a few queries to help you get a quick overview of how your organization is using Databricks and determine your next steps.  

Who submits the most amount of queries?

Knowing who your most active users by SQL query count helps  shed valuable insights into workspace utilization and helps you:

  • Identify power users who might benefit from additional training or resources
  • Detect unusual activity patterns that could indicate security issues or inefficient practices
  • Better understand how your organization uses Databricks

This will be essential to understand how to optimize your Databricks environment to meet the needs and usage patterns of your team. It will also be useful for cost allocation and user management.

Go here for the query

Who owns the most expensive clusters?

Pinpointing the users that own and run your most expensive Jobs and All-Purpose Compute clusters is the next logical step on the road to cutting costs. 

These insights enable you to:

  • Identify the owners of your most expensive clusters
  • Attribute costs to users and groups
  • Detect anomalies and potential misuse

Use this query to identify the users to follow up with to understand why costs are so high and devise a plan to reduce them.  

Go here for the query

How many clusters are currently active, and how were they created?

Understanding your cluster creation activity is crucial for resource management and cost optimization. 

This enables you to:

  • Ensure cluster creation practices are inline with policies 
  • Understand cluster creation trends
  • Identify rogue clusters created in alternative ways

Knowing where cluster creation originates can help optimize your workflows and ensure better governance.

Go here for the query

Which markets (Spot, On-Demand) do my clusters use?  

Market usage data helps you balance the cost and reliability of your infrastructure. While Spot instances may offer up to 90% lower costs, this is at the expense of reliability. Cloud providers can terminate nodes at any moment, with little to no notice, causing interruptions and even job failures.

This insight will enable you to:

  • Understand spot vs on-demand usage across the organization
  • Identify opportunities to leverage Spot 
  • Ensure efficient use of Spot as fallback 

Go here for the query

How “sticky” is my workspace?

Databricks stickiness is a critical metric to understand user engagement. It is calculated by dividing daily active users by monthly active users and helps:

  • Gauge the extent of adoption across the organization
  • Identify trends and changes in usage over time
  • Alert on usage anomalies and potential misuse

A high stickiness ratio means that users are consistently using the workspace, which indicates that you have successfully adopted Databricks within your organization.

Go here for the query

Cost Optimization

Understanding usage is nice, but cutting costs is the ultimate end goal. Here are a few useful queries to help you pinpoint where to focus your optimization efforts for the most impact. 

What are the current compute costs associated with my Databricks workspace?

Keeping track of your Databricks compute costs is fundamental for budget management. With some many options (classic, serverless, DBSQL; APC vs on-demand) it’s hard to know what to use when.

These insights help you:

  • Pinpoint high spend targets
  • Identify over and under used options
  • Detect unexpected or over consumption

Go here for the query

What are the product costs associated with my Databricks workspace?

Knowing which Databricks products you are spending the most on allows you to implement targeted cost-saving measures to ensure you’re getting the most out of your investment. 

These insights let you:

  • Identify high cost targets for optimization
  • Get a bird’s eye view of how the org spends on Databricks
  • Detect trends and changing patterns

Go here for the query

Which jobs are most likely over-provisioned? 

Identifying the workloads with the lowest average CPU usage is a quick way to detect overprovisioned clusters.

This information helps you:

  • Reduce spend from overprovisioned resources
  • Detect misconfigurations 
  • Better allocate resources
  • Understand which jobs will benefit from Gradient’s autonomous compute optimization

Go here for the query

This query is a great place to start your waste reduction efforts.

Which jobs are most likely under-provisioned? 

Identifying the data pipelines with the highest average memory utilization is an easy way to detect potential bottlenecks and under-provisioned resources.

These insight enables you to:

  • Identify jobs with potential bottlenecks
  • Improve performance with increased memory allocation
  • Detect anomalies and trends over time
  • Understand which jobs will benefit from Gradient’s autonomous compute optimization

Go here for the query

What are my most expensive jobs over the past 30 days?

Knowing which jobs are costing you the most is crucial for optimizing your data infrastructure.

This information helps you:

  • Identify the best candidates for optimization
  • Implement cost-saving for high-spend workloads (or use Gradient for automated Jobs cluster tuning)
  • Better understand your Databricks workload spend

Keeping tabs on your most expensive jobs every month is a great way to anticipate spend and detect costly workloads before they incur too many expenses. 

Go here for the query

What jobs are growing the most in costs week-over-week?

Workload costs can grow gradually over time. Understanding which of your workloads is growing the most in costs each week enables you to:

  • Be the first to identify jobs that are getting increasingly more expensive
  • Pinpoint workloads for optimization
  • Detect anomalies and potential problems

By following week over week increases in spend we can identify jobs that are becoming increasingly more expensive, even if spend grows slowly over time.

Go here for the query

What are the most expensive SQL warehouses for the last 30 days?

Pinpointing which warehouses are driving up costs over the last 30 days is a great way to attribute costs to users and teams.

This information helps you to:

  • Tie expensive warehouses to owners 
  • Simplify chargebacks and cost attribution
  • Identify anomalies and trends over time

This query should be your first step when looking to optimize warehouse costs. It will ensure you focus your efforts on high-impact targets and gives you everything you need to plan your next steps.

Go here for the query

What are the most expensive SQL queries over the past 30 days?

SQL Query costs can grow gradually with every run. Use this query to clear out the noise and focus on your most expensive SQL queries by total cost for the last 30 days.

This information enables you to:

  • Pinpoint top candidates for optimization
  • Detect anomalies and potential issues
  • Investigate cost increases before they become substantial

Go here for the query

Why are these SQL queries so expensive, and who owns them?

Pinpointing the SQL queries that are costing you the most in a great first step for optimization. But it is just a first step. This query adds granular information about each expensive SQL query allowing you to:

  • Find queries that require optimization 
  • Understand per run and overall costs
  • Better allocate resources 

Go here for the query

What are the most costly APC clusters over the past 30 days?

Understanding which APC clusters are costing you can help you cut costs by pinpointing candidates for optimization.

This information lets you:

  • Focus optimization effort on high-value targets
  • Identify potential candidates for on-demand clusters
  • Better allocate resources

Use this query to pinpoint the clusters that are prime for further investigation. 

Go here for the query

Which APC clusters are likely under-utilized?

Identifying your underutilized APC clusters in the past 30 days, along with their relative costs, is a great way to facilitate cost savings and resource optimization.

These insights enable you to:

  • Create a short list of clusters for downsizing 
  • Cross reference usage with cost to identify issues
  • Detect trends over time

Use this query to detect under-utilizated APC clusters and connect the dots between those clusters to users that created them. 

Go here for the query

What are the most expensive DLT clusters over the past 30 days?

Knowing which delta live tables clusters are costing you the most is the first step to optimization.

This information lets you:

  • Pinpoint DLT clusters for optimization
  • Identify tables for further analysis
  • Detect anomalies and trends over time

Go here for the query

Performance Management

Two queries to help optimize the performance of your Databricks ecosystem. 

Which notebooks are consuming the most DBUs over time?   

Determining which notebooks consume the most DBUs will help target your optimization efforts for maximum effect.

This information helps you:

  • Identify notebooks for optimization
  • Find problems and misuse  
  • Attribute costs by notebook owner

Go here for the query

What are my longest running queries?

Identifying performance bottlenecks is key to keeping your data infrastructure efficient and responsive.

This information enables you to:

  • Pinpoint queries that might be slowing down your system
  • Detect which queries need more resources vs code optimization
  • Identify patterns in slow queries that could indicate underlying issues 

By optimizing these queries, you can make a big difference in the overall responsiveness and efficiency of your Databricks workspace. 

Go here for the query

Data and Workflow Management

A couple queries to help manage your data and workflows on Databricks.

What are my most commonly used datasets? Where are they stored?

Knowing which of your datasets is used the most is crucial for optimizing data access and storage.

This insight helps you:

  • Prioritize optimization efforts by dataset to help lower storage costs
  • Apply caching strategies for frequently accessed data
  • Make informed decisions about data placement and replication

Optimizing your most-used datasets may help improve overall performance and reduce data transfer cost across your Databricks environment.

Go here for the query

How do I track lineage for data transformations in the workspace?

Visualize all the data lineage in your workspace. By mapping the relationships between source and target tables, you gain clarity into how data flows through the workspace. 

This information helps you understand:

  • Dependencies between datasets and transformations
  • Where the data used in critical workflows originates from
  • Upstream or downstream effects of changes

Go here for the query

Get the queries 

For the SQL queries to answer these questions and to access a comprehensive 1-click dashboard visualizing all these insights, go directly to the Databricks SQL Query Toolkit page.

This toolkit provides you with ready-to-use queries and a powerful dashboard to help you gain deep insights into your Databricks environment quickly and easily.  Paste these queries into your Databricks SQL Editor to see the results, or fill out the form to get a comprehensive dashboard with answers to all these questions.

Get the comprehensive dashboard

If you’d like all of the queries with corresponding plots all in 1 convenient dashboard, fill out the form here.

Conclusion

By regularly asking and answering questions like these, you’ll be well-equipped to optimize your Databricks workspace, reduce costs, and improve overall performance. Remember, the key to effective data engineering is not just in processing data, but in understanding and optimizing the environment in which that processing occurs.

Do you think these queries are useful? Are we missing any? We’d love to hear what you think! Feel free to drop us a line here.