20 SQL queries to assess Databricks health & free dashboard

Data engineers and platform leaders know that the health and efficiency of their Databricks environment has widespread implications. It directly impacts development velocity, operational costs and the reliability of data products. At Capital One Software, we understand that managing a scalable, performant and cost-optimized Databricks ecosystem requires a lot more than just provisioning clusters. It requires continuous monitoring, intelligent forecasting and proactive optimization.

Just as you wouldn't run a production application without robust observability, your Databricks environment demands a similar level of scrutiny. To this end, we have built a free Databricks health check dashboard to surface insights from +20 SQL queries spanning Jobs, SQL warehouses, all-purpose compute and Delta Live Tables usage. Use this free 1-click dashboard to identify your most costly users and resources, detect under or overprovisioned resources, understand market breakdowns and more. Read on for all the details. 

A deep dive into Databricks health check dashboard queries

The Databricks health check dashboard uses a collection of SQL queries that query System Tables for performance and usage information across Jobs, APC, DBSQL and DLT. The result is a suite of insightful reports that provide platform leaders with the data points they need to make strategic decisions, and data engineers with the specifics to guide optimization.

These +20 queries can be categorized into the following buckets:

  1. General usage: Understand how Databricks is used across the organization

  2. Cost optimization: Eliminate waste, identify low hanging fruit and opportunities for optimization 

  3. Performance management: Surface opportunities to improve the performance of queries and notebooks

  4. Data and workflow management: Detect commonly used datasets and understand data lineage 

Next we'll dive into each query.

General usage queries

As organizations scale on the data cloud it tends to become increasingly difficult to assess overall usage and trends across projects and teams. These queries help you get an overview of how Databricks is being used across the organization.

These five queries provide you with a high-level overview of usage, so you can determine your next steps with insights into:  

1. Who are my most active users?

Identifying your most active users by SQL query count helps in several ways. It sheds light on usage trends, pinpoints power users who might benefit from additional training and hardens your security posture by detecting unusual activity patterns that could indicate potential issues.

Here’s the query: 

    -- Count of queries executed by user

SELECT
  executed_by AS user,
  COUNT(*) AS query_count
FROM
  system.query.history
GROUP BY
  executed_by
ORDER BY
  query_count DESC
LIMIT 10;
  

2. Who owns your most expensive clusters?

Once you know who your most active users are, the next logical step is to identify the users that own and manage your most expensive clusters. This query enables you to identify costly cluster owners, attribute costs to those users and the groups they are associated with and detect anomalies or potential misuse of clusters. 

Here’s the query:

    -- Most expensive cluster owners

SELECT
  c.owned_by,
  SUM(u.usage_quantity) AS total_cost
FROM
  system.billing.usage u
  JOIN system.compute.clusters c
ON
  u.usage_metadata.cluster_id = c.cluster_id
WHERE
  u.usage_start_time >= current_date() - interval 30 days
GROUP BY
  c.owned_by
ORDER BY
  total_cost DESC
LIMIT 20
  

3. How many clusters are active and were they created according to policy?

Understanding where cluster creation originates and how many clusters are currently active helps you optimize workflows and ensure better governance. This query provides insights into cluster creation trends, enabling you to ensure cluster creation practices adhere to internal policies and quickly identify clusters that were created using alternative methods.

Get the query

4. What is my market usage breakdown (Spot vs on-demand)?

While spot instances may offer up to 90% lower costs, they come at the expense of reliability. Cloud providers can terminate nodes at any moment, with little to no notice, causing interruptions. It is crucial to understand your market usage, as it can be used as a tool to balance the cost and reliability of your infrastructure. 

Use this query to understand spot vs on-demand usage across the organization, identify opportunities to leverage spot instances and set up efficient use of spot as a fallback where relevant. 

Get the query

5. How “sticky” is my workspace?

Product stickiness is a critical metric for assessing engagement. Databricks stickiness is calculated just like any product, by dividing daily active users by monthly active users. A high stickiness rate indicates that the organization has successfully adopted the data cloud. 

This information is useful in gauging adoption across the org, changes over time and anomalies that requires further investigation.

Get the query

Cost optimization

Once we have a better grasp on how Databricks is being used across the organization we can set our focus on the next step: optimizing costs. The 12 queries in this section are meant to help you pinpoint where to focus your optimization efforts in order to drive the most impact.

6. What are my current compute costs by SKU?

Compute costs are a significant portion of the average organization's spend on Databricks. It’s oftentimes the #1 cost. As a result, keeping track of your compute costs is simply fundamental for budget management. 

This query pinpoints high spend targets and identifies both over and under-used compute SKUs. Use it to make sense of all the compute options you are currently spending on, i.e. classic clusters, serverless options, SQL warehouses, APC and on-demand clusters.

Here’s the query:

    -- Total compute costs by SKU

SELECT
  t1.sku_name,
  SUM(t1.usage_quantity * lp.pricing.default) AS total_cost
FROM
  system.billing.usage t1
  INNER JOIN system.billing.list_prices lp ON t1.cloud = lp.cloud
  AND t1.sku_name = lp.sku_name
  AND t1.usage_start_time >= lp.price_start_time
  AND (
    t1.usage_end_time <= lp.price_end_time
    OR lp.price_end_time IS NULL
  )
GROUP BY
  t1.sku_name
ORDER BY
  total_cost DESC;
  

7. What are my current costs by product?

Databricks offers a multitude of products to address different needs and users. Understanding which Databricks products you are spending the most on is the first step to implementing cost-saving measures, with the ultimate goal of maximizing your investment in Databricks. 

This query sheds light on how the organization spends on Jobs vs all purpose, interactive, SQL, vector search and other products. Use this query to pinpoint high cost targets for further investigation and targeted optimization.

Get the query

8. Which of my jobs are over-provisioned? 

When it comes to waste reductions, this query is a great place to start. Identifying the workloads with the lowest average CPU usage is a quick way to detect clusters that are most likely overprovisioned. If CPU usage is consistently low we are not fully leveraging the hardware potential. This means that we are paying for more compute resources than we need to.

Use this query to reduce spend from overprovisioned resources, identify potential misconfigurations or misuses and better allocate resources. 

Get the query

9. Which of my jobs are under-provisioned? 

Pinpointing the data workloads with the highest average memory utilization is an easy way to detect resources that are most likely under-provisioned. This query helps you detect pipelines with potential bottlenecks and might benefit from additional resources.

Use this query to increase memory allocation where needed to boost performance, detect anomalies and track trends over time.

Get the query

10. What are my most expensive jobs over the past 30 days?

You can’t optimize without knowing where to start. Identifying your most costly jobs over the past 30 days is key for continuous optimization of your data infrastructure. 

This query helps you identify prime candidates for optimization and implement cost-saving for high-spend workloads. It helps you better anticipate compute spend and detect costly workloads before they incur too many expenses. 

Get the query

11. What jobs are growing the most in costs week-over-week?

Data pipeline costs can grow gradually over time. Understanding which of your workloads is increasing the most in spend week-over-week helps you detect changes that might have otherwise gone unnoticed. 

Use this query to be the first to know when jobs are getting increasingly more expensive, even when that spend grows slowly. 

Here’s the query:

    -- Top jobs growing in week-over-week costs

with job_run_timeline_with_cost as (
  SELECT
    t1.*,
    t1.usage_metadata.job_id as job_id,
    t1.identity_metadata.run_as as run_as,
    t1.usage_quantity * list_prices.pricing.default AS list_cost
  FROM system.billing.usage t1
    INNER JOIN system.billing.list_prices list_prices
      ON
        t1.cloud = list_prices.cloud AND
        t1.sku_name = list_prices.sku_name AND
        t1.usage_start_time >= list_prices.price_start_time AND
        (t1.usage_end_time <= list_prices.price_end_time or list_prices.price_end_time is NULL)
  WHERE
    t1.billing_origin_product = 'JOBS' AND
    t1.usage_date >= CURRENT_DATE() - INTERVAL 14 DAY
),
most_recent_jobs as (
  SELECT
    *,
    ROW_NUMBER() OVER(PARTITION BY workspace_id, job_id ORDER BY change_time DESC) as rn
  FROM
    system.lakeflow.jobs QUALIFY rn=1
)
SELECT
    t2.name
    ,t1.workspace_id
    ,t1.job_id
    ,t1.sku_name
    ,t1.run_as
    ,Last7DaySpend
    ,Last14DaySpend
    ,last7DaySpend - last14DaySpend as Last7DayGrowth
    ,try_divide( (last7DaySpend - last14DaySpend) , last14DaySpend) * 100 AS Last7DayGrowthPct
FROM
  (
    SELECT
      workspace_id,
      job_id,
      run_as,
      sku_name,
      SUM(list_cost) AS spend
      ,SUM(CASE WHEN usage_end_time BETWEEN date_add(current_date(), -8) AND date_add(current_date(), -1) THEN list_cost ELSE 0 END) AS Last7DaySpend
      ,SUM(CASE WHEN usage_end_time BETWEEN date_add(current_date(), -15) AND date_add(current_date(), -8) THEN list_cost ELSE 0 END) AS Last14DaySpend
    FROM job_run_timeline_with_cost
    GROUP BY ALL
  ) t1
  LEFT JOIN most_recent_jobs t2 USING (workspace_id, job_id)
ORDER BY
  Last7DayGrowth DESC
LIMIT 20
  

12. What are the most expensive SQL warehouses for the last 30 days?

Knowing which of your warehouses are driving up costs over the last 30 days helps with cost attribution and chargebacks. This information allows you to tie expensive SQL warehouses to owners (users and teams), as well as detect anomalies and trends over time. 

Use this query as a first step when looking to optimize warehouse costs. It helps guide optimization efforts to high-impact targets and gives you everything you need to plan your next steps.

Get the query

13. What are the most expensive SQL queries over the past 30 days?

Query costs can grow gradually with every run. This query helps you focus your attention on your most expensive SQL queries surfacing total costs incurred over the past 30 days.

Use this query to detect cost increases before they become substantial and pinpoint queries for further analysis and optimization. 

Get the query

14. Why are these SQL queries so expensive, and who owns them?

Identifying your most expensive queries is great, but you also need to understand why they are so expensive. This query adds granular information about each expensive SQL query with per run and overall costs.

Use this query to identify SQL queries that require attention and better allocation resources.

Databricks health check dashboard

15. What are the most costly APC clusters over the past 30 days?

Keeping track of your most costly all purpose clusters is crucial as costs can grow quickly. This query helps you focus your optimization efforts on high-value targets. 

Use this query to identify potential candidates for on-demand clusters and better allocate resources for efficiency and cost-effectiveness.

Get the query

16. Which APC clusters are likely under-utilized?

Detecting under-utilized resources is key for waste management. This query pinpoints your APC clusters that have most likely been under-utilized in the past 30 days, their owners and the relative costs they have incurred.

Use this query to create a shortlist of APC clusters to downsize, connect the dots between costly resources and their owners and detect issues and overarching patterns. 

Get the query

17. What are the most expensive DLT clusters over the past 30 days?

If you use DLT you need to know which clusters are driving up your costs. This query helps you pinpoint DLT clusters that are prime for optimization.

Use this query to pinpoint tables for further analysis and detect issues or anomalies before their costs become too high. 

Get the query

Performance management

Cost optimization is important, but so is performance. The following two queries can help you optimize the performance of notebooks and SQL queries. 

18. Which notebooks are consuming the most DBUs?   

Identifying the notebooks that consume the most DBU over time is critical, especially if you have an active data science team.

Use this query to target your optimization efforts for maximum impact, detect issues or misuse and attribute costs to notebook owners.

Here’s the query:

    -- Notebooks that have consumed the most DBUs

SELECT
  usage_metadata.notebook_id AS notebook_id,
  usage_metadata.notebook_path AS notebook_path,
  SUM(usage_quantity) AS total_usage
FROM
  system.billing.usage
WHERE
  usage_metadata.notebook_id IS NOT NULL
GROUP BY
  notebook_id,
  notebook_path
ORDER BY
  total_usage DESC
LIMIT 10;
  

19. What are my longest running queries?

Identifying performance bottlenecks is key to keeping your data infrastructure efficient and responsive. This query produces a shortlist of your SQL queries with the longest runtimes.

Use this query to pinpoint queries that require more resources vs those that might need code optimization, identify queries that might be slowing down your system and detect patterns that could indicate underlying issues. 

Get the query

Data and workflow management

Two queries to help manage your datasets and workflows on Databricks by shedding light on your most commonly used data sets and visualizing data lineage.

20. What are my most commonly used datasets? Where are they stored?

Optimizing most-used datasets helps improve overall performance and helps reduce data transfer cost across the entire Databricks environment.

Use this query to prioritize optimization for lowered storage costs, apply caching strategies for frequently accessed data and make informed decisions about data placement and replication.

Here’s the query:

    -- Most Used Data Sources by Jobs

SELECT
  coalesce(source_table_full_name, source_path) source,
  count(distinct entity_id) distinct_jobs
from
  system.access.table_lineage
where
  entity_type = 'JOB'
  and coalesce(source_table_full_name, source_path) is not null
group by
  1
order by
  2 desc
LIMIT 10;
  

21. How do I track lineage for data transformations?

Easily map out the relationships between source and target tables in your workspace. By visualizing data lineage, you gain clarity into how data flows through your workspace. 

Use this query to understand the dependencies that exist between datasets and transformations, where the data used in critical workflows originates from and upstream or downstream effects of changes.

Get the query

All the queries in a comprehensive dashboard

We’ve aggregated these +20 SQL queries into a comprehensive 1-click dashboard visualizing all the insights. Go to the Databricks health check dashboard page to get it.

This toolkit provides you with ready-to-use queries and a powerful dashboard to help you gain deep insights into your Databricks environment quickly and easily. Paste these queries into your Databricks SQL Editor to see the results, or fill out the form to get a comprehensive dashboard with answers to all these questions.

By consolidating these crucial insights, the Databricks health check dashboard acts as your operational telemetry system, supporting data-driven decisions that translate directly into a more efficient, cost-effective and robust Databricks workspace.

Databricks health check dashboard

Snowflake health check coming soon

Our commitment to empowering data engineers and platform managers extends across the data clouds. Just as we've detailed the value of a Databricks health check, we recognize the similar need for deep observability for Snowflake as well. This is why we are actively working on releasing a Snowflake health check marketplace app. Stay tuned for further announcements and the official release, as we aim to equip your teams with the tools to master your Snowflake environment.

Conclusion

In today's data-driven landscape, proactive platform health management is no longer optional; it's a competitive advantage. This free Databricks health check dashboard provides data engineers and platform leaders with actionable insights needed to optimize performance, control costs and maintain a resilient Databricks environment.

By leveraging the power of SQL, you can transform raw metadata from Systems Tables into strategic intelligence, ensuring your data initiatives continue to drive innovation. We're excited to offer this level of detailed analysis, and we look forward to bringing similar capabilities to the Snowflake ecosystem soon.

This is another step in our journey to enable data engineers and platform leaders to maximize their investment in their data clouds with Capital One Slingshot

Book some time with the team to learn more about Slingshot and how it can help your organization.


Noa Shavit, Senior Marketing Manager

Noa is a full-stack marketer specializing in infrastructure products and developer tools. She drives adoption and growth for technical products through strategic marketing. Her expertise lies in bridging the gap between innovative software and its users, ensuring that innovation translates into tangible value. Prior to Capital One, Noa led marketing and shaped GTM motions for Sync Computing, Builder.io, and Layer0.