Releases

Why are your Databricks jobs performances changing over time?

For those running and tracking their production Databricks jobs, many may often see “random” fluctuations in runtime or slowly changing performance over days and weeks.

Immediately, people may often wonder:

  • “Why is my runtime increasing since last week?”
  • “Is the cost of this job also increasing?”
  • “Is the input data size changing?”
  • “Is my job spilling to disk more than before?”
  • “Is my job in danger of crashing?”

To help give engineers and managers more visibility into how their production jobs are performing over time, we just launched a new visualization feature in Gradient that will hopefully help provide quick answers to engineers.

A snapshot of the new visualizations is shown below, where on a single page you can see and correlate all the various metrics that may be impacting your cost and runtime performance.  In the visualization below, we track the following parameters:

  • Job cost (DBU + Cloud fees)
  • Job Runtime
  • Number of core*hours
  • Number of workers
  • Input data size
  • Spill to disk
  • Shuffle read/write

Why does job performance even change?

The main three reasons why we see job performance change over time are:

  1. Code changes – Obviously with any significant code changes, your entire job could behave differently.  Tracking and understanding how your new code changes impact the business output however is less clear.   With these new visualizations, engineers can quickly see the “before and after” impact of any code changes they implement

  1. Data size changes – When your data increases or decreases, this can impact your job runtime (and hence costs).  While this makes sense, tracking and seeing how it changes over time is much more subtle.  It may be a slowly varying amount, or it could be very spiky data events with sudden changes.  Understanding how your data size impacts your runtime is a critical “first check”

  1. Spot instances revoking – When spot nodes are randomly pulled during your job execution, it can cause significant impact on the runtime of your job.  Since Spark has to essentially “recover” from a worker being pulled, the impact on runtime can be significant.  We’ve seen runtimes go up 2-3X simply from 1 worker being pulled.  It all depends on at what point the Spot instance is being pulled and the impact.  Since this is basically random, your overall spot runtimes can have wildly varying runtimes. 


As a side note, we’ve observed that an optimized on-demand cluster can often beat Spot pricing because of this very reason.  Over the long haul, a stable and optimized on demand cluster is better than a wildly varying Spot cluster.

How does this feature differ from Overwatch?

For those in the know, Overwatch is a great open source tool built by Databricks to help plot all sorts of metrics to help teams monitor their jobs.  The main differences and advantages of the visualizations we show are:

1)  Total cost metrics – Gradient pulls both the DBU and estimated Cloud costs for your clusters and shows you the total cost.  Cost data from Databricks only includes DBUs.  While it is possible to wrap in cloud costs with overwatch, it’s a huge pain to set up and configure.  Gradient does this “out of the box”

2)  “Out of the box” Ready – While Overwatch does technically contain the same data that Gradient shows, users would still have to write queries to do all the aggregations properly by pulling the task level or stage level tables as required. Overwatch is best considered a “data dump” and then users will have to wrangle it correctly and do all the dashboarding work in a way that meets their needs. Our value add is that we do all this leg work for you and just present the metrics from day 1.

3)  Trends over time – Gradient aggregates the data to show users how all these various metrics are changing over time, so users can quickly understand what has changed recently.  Looking at a single snapshot in time is often not very useful, as users need to see “what happened before?”  to really understand what has changed and what they can do about it.  While technically this is do-able with Overwatch, it requires users to do the work in building and collecting the data.  Gradient does this “out of the box”

How does this help with stability and reliability?

Beyond cost efficiency, stability is often a higher priority than everything.  Nobody wants a crashed job.  These metrics help give engineers “early signals” if their cluster is headed towards a dangerous cliff.  For example, data sizes may start growing beyond the memory of your cluster, which could cause the dreaded “out of memory” error.

Seeing how your performance is trending over time is a critical piece of information users need to help prevent dangerous crashes from happening.

Conclusion

We hope this new feature makes life a lot easier for all the data engineers and platform managers out there.  This feature comes included with Gradient and is live today!  We’d love your feedback.

It probably goes without saying that this feature is in addition to our active optimization solution that can auto-tune your clusters to hit your cost or runtime goals.  The way we look at it, is we’re expanding our value to our users by providing critical metrics.

We’d love your feedback and requests for what other metrics you’d love to see.  Try it out today or reach out for a demo!

February 2024 Release Notes

release notes

We’re excited to share all the new and improved features that our team has recently released to help our customers gain full governance over their Databricks infrastructure.

Databricks Workspace Integration
Introducing the Databricks Workspace Integration for Gradient. With this new feature, you’re able to further simplify the process of connecting your Databricks Workspace to the Sync platform. This capability eases the tedious process of consolidating with the Gradient UI without the use of the Sync CLI.

To get started, head to the integrations tab in your Sync dashboard. Here you’ll see a list that includes Databricks Workspace. Navigate to the Add dropdown menu and click on the Databricks Workspace dropdown option to trigger the integration flow.


Log in to Gradient to get started.

Project Reset Data
As users integrate their projects into Sync, they are often faced with sudden config changes. Project Reset is a capability built directly into the Sync platform in which users will be able to perform a hard  “reset” on the data for a project, ultimately triggering the build of a new custom model for the related job.

Now available via the Sync API, coming soon to the Sync UI


With this new capability, you’re able to reset the following directly from the Sync UI:

  • Historical logs
  • Resets the selected project back to “learning” mode
  • Clears project graphs
  • Clears the project’s history table
{
  "result": [
    {
      "created_at": "2024-02-21T02:35:46.806Z",
      "updated_at": "2024-02-21T02:35:46.806Z",
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "name": "string",
      "app_id": "string",
      "cluster_path": "string",
      "job_id": "string",
      "workspace_id": "string",
      "workflow_id": "string",
      "creator_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "product_code": "aws-emr",
      "description": "string",
      "status": "Pending Setup",
      "cluster_log_url": "string",
      "prediction_preference": "performance",
      "auto_apply_recs": true,
      "prediction_params": {
        "sla_minutes": 0,
        "force_ondemand_workers": true,
        "fix_worker_family": true,
        "fix_driver_type": true,
        "fix_scaling_type": true
      },
      "tuned_cost": 0,
      "tuned_runtime": 0,
      "project_model_id": "UNASSIGNED",
      "metrics": {
        "job_success_rate_percent": 0,
        "sla_met_percent": 0
      },
      "latest_prediction_id": "string",
      "latest_prediction_created_at": "string",
      "creator": {
        "created_at": "2024-02-21T02:35:46.806Z",
        "updated_at": "2024-02-21T02:35:46.806Z",
        "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "sync_tenant_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "email": "string",
        "name": "string",
        "last_login": "string"
      },
      "phase": "LEARNING",
      "optimize_instance_size": true,
      "project_periodicity_type": "DAILY_SINE",
      "product_name": "string"
    }
  ]
}


User Management
With User Management, you’re able to take a hands-on approach to managing your users in Gradient. With this feature, account owners can:

  • Add a user
  • Deactivate a user
  • Assign a specific role to a user

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.

January 2024 Release Notes

release notes

Exciting things are happening at Sync as we move further into the new year!

Ensuring that our users are equipped with the tools to fully manage the automation of their infrastructure is always top of mind. With the most recent iteration of Gradient, Sync users are able to take advantage of a toolkit that makes optimizing Databricks clusters even better.

Here’s what’s new in the latest version of Gradient:

Org Settings

Org Settings is now available in the main navigation bar in the Sync Dashboard. Users are able to navigate to the Org Settings tab to find personal user information, a comprehensive list of API keys, and a list of organization users with their user details.

With Org Settings, users will see a consolidated list of personal information, API keys, and account users directly in the Sync UI.

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation. Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.