monitoring

What’s the difference between Databricks’s Overwatch and System Tables tools?

Databricks recently released System Tables, to help users understand and monitor their usage.  They also had the older and more skunk works project called Overwatch, which also provides usage and infrastructure information.  So what’s the difference between the two?  When should I use either one?  Let’s dive in!

Introduction

Databricks as a whole is a complex ecosystem from both technical compute intricacies to confusing pricing structures.  As a user, you likely have questions such as: What are my spark jobs doing? How high is my utilization and is it efficient? Who are the other users on my account and what are their compute costs? Which tables are they accessing?

Databricks currently has two tools that people can use to help answer these questions and monitor metrics.  The first, and older one, is called Overwatch, and the other is the more recently released product System Tables.  There are some pros and cons of both platforms, and we’ll walk through the points we found relevant below:

High level overview


Overwatch:  Was developed internally within the Databricks Labs, and was released several years ago.  It’s an open source side project, vs. a polished product out of Databricks.  

  • Pros: Overwatch provides many metrics, from low level Spark event metrics to cost data – we recommend viewing their docs to see more details as to what metrics it surfaces.  Since it’s open source, it’s incredibly customizable and can work in real-time.
  • Cons:  It’s very complicated to set up, especially the real-time monitoring service.  This is not an easy tool to use, and likely only for advanced Databricks users. While it contains a lot of data, it is difficult to extract actionable insights from the data without Spark expertise.  Users also accrue compute costs to host Overwatch (admittedly it seems like a small overhead cost relative to your total Databricks costs).

System TablesAn official product out of Databricks launched in 2023, and has a focus on high level usage, costs, and access.

  • Pros:  It’s fully integrated with their Unity catalog, and users can get up and running pretty quickly within the Databricks platform.  It is more geared for analyzing usage and cost information.
  • Cons:  It is less focused on low level Spark metrics, which means it will be less useful if you’re looking for a tool to help you optimize your code.  One other big note is System tables is not a real-time service, data is only updated several times a day (as of the date of this blog).  One other note, System tables doesn’t seem to feed in your cloud costs, and only reports on DBU costs – this is a bit disappointing since cloud costs can often be greater than your DBU costs.

What information is monitored?

  • Overwatch – The full and official list of the parameters monitored can be found on their docs page.  The broad categories are Databricks workflows, compute, Spark UI, Databricks SQL, and security. For a more opinionated view of what information they provide, I recommend checking out their pre-built dashboards page which describes popular metrics to track.  I also recommend seeing their scopes and modules page to see the granular Spark metrics that are reported.  Since there are more similarities than differences between the two systems, let’s identify what’s unique in Overwatch and not in System Tables:
    • Failed Jobs
    • Spark Eventlogs
    • Cloud costs (need to set up manually)
    • Cluster metrics (need to set up manually)
    • Real-time cost estimates 
  • System Tables – The list of information provided by System Tables can be found on their main page.  At the highest level there are several tables provided today: audit logs, billable usage logs, pricing table, table and column lineage, and marketplace listing access.  Some of the unique pieces of information that System tables have is:
    • Data lineage
    • System Access
    • Marketplace pricing

What is the setup process like?

  • Overwatch – The main instructions for installing Overwatch can be found on the deploy page.  To say the least, it’s complicated.  It requires you to configure your cloud, security, secrets, audit logs, and a host of other options.  One thing we also found was if you did want to pipe in your actual cloud costs, the user has to do a lot of pipeline building to feed your cloud cost information to Overwatch.  If a user wants to collect real-time cluster metrics, the user will have to write a script to collect the metrics and set up a time-series database.
  • System Tables – Enabling system tables is pretty easy, you do need to have at least one unity-catalog enabled workspace.  System tables must be enabled by an account admin. You can enable system tables using the Unity Catalog REST API.  Once enabled, the tables will appear in a catalog called system, which is included in every Unity Catalog metastore. In the system catalog you’ll see schemas such as access and billing that contain the system tables.

What are some use cases?

  • Common Use Cases
    • Monitoring Databricks costs –  Break down your costs by user, tags, notebook, job, etc.  
    • Finding the “expensive” jobs – Locate cost hotspots and identify who is responsible.
  • Overwatch 
    • Low level Spark metrics to optimize code – An expert user can inspect the detailed Spark job information that overwatch surfaces (i.e. task, stage, executor metrics, etc.) to design an optimization plan specific to their jobs. 
    • Real-time cluster information for monitoring – Users can check the utilization of their clusters and make case-by-case decisions on cluster and instance sizing, for example. 
  • System Tables
    • Monitoring who is accessing which table
    • Monitoring usage with specific tags
    • Viewing the lineage of my tables

What do neither of them do?

  • No actionable decisions  – Both of these systems surface excellent and useful information for companies to review.  However, there is always the question of “what do I do with this data?”  That is still left for the user to decide what to do – as there are plenty of great options.  So if users are looking for a solution that goes beyond monitoring and dashboards, they’ll need to look elsewhere.
  • No job optimizations – Neither platform actually optimizes your jobs, they simply report metrics.  They can help flag problematic jobs, but then a human engineer is still required to go in and optimize things.  Whether or not companies have bandwidth to send an engineer to go chasing random problems here and there is another question.  While these platforms are great for identifying large hotspots, they aren’t great at optimizing a large number of jobs simultaneously.

Conclusion

Here at Sync, we are huge fans of both platforms.  My opinion is the System Tables will eventually take over as the “real” product for monitoring Databricks usage, as it’s probably fairly trivial to add in the Spark metrics to system tables.  So if I were a company aiming to invest in one or the other, i’d go with System tables.  They are also massively easier to set up to get up and running.  

However, the drawback of both platforms are both real – basically neither of them do any active optimization.  That’s why we built Gradient, a closed-loop feedback system that continuously optimizes your Databricks jobs to hit your business goals.  We built the missing intelligence that mines much of the same data as Overwatch, and provides automatic recommendations that data engineers can instantly apply.

Interested in learning more about Gradient to optimize your Databricks jobs? Reach out to Jeff Chou and the rest of the Sync Team.