Releases

Gradient Product Update— Discover, Monitor, and Automate Databricks Clusters

Jeffrey Chou
06.06.2024

With the Databricks Data+AI Summit 2024 just around the corner, we of course had to have a major product launch to go with it!

We’re super excited to announce an entirely new user flow and features to the product, making it faster to get started and providing a more comprehensive management solution. At a high level, the new expansion involves these new features:

Discover – Find new jobs to optimize
Monitor – Track all job costs and metrics over time
Automate – Auto-apply to save time, costs, and hit SLAs

With ballooning Databricks costs and constrained budgets, Databricks efficiency is crucial for sustainable growth for any company. However, optimizing Databricks clusters is a difficult and time consuming task riddled with low level complexity and red tape.

Our goal with Gradient is to make it as easy and painless as possible to identify jobs to optimize, track overall ROI, and automate the optimization process.

The last automation piece is what sets Gradient apart. Gradient is designed to optimize at scale, for companies that have 100+ production jobs. At that scale, automation is a must, and is where we shine. Gradient provides a new level of efficiency unobtainable with any other tool on this planet.

With automatic cluster management, engineering teams are free to pursue more important business goals while Gradient works around the clock.

Let’s drill in a bit deeper into what these new features are:

Discover

Find your top jobs to optimize as well as discover new opportunities to improve your efficiency even more. This page is refreshed daily so you always get up-to date insights and historical tracking.

How to get started – Simply enter your Databricks credentials and click go! You can get running from scratch in less than a minute

What is shown:

Top jobs to optimize with Gradient
Jobs with Photon enabled
Jobs with Autoscaling enabled
All purpose compute jobs
Jobs with no-job id (meaning they could come from an external orchestrator like Airflow)

To see how fast and easy it is to get the Discover page up and running, check out the video below:

Monitor

Track Spark metrics and costs of all of your jobs managed with Gradient in a single pane of glass view. Use this view to get a bird’s eye view on all of your jobs and track the overall ROI of Gradient with the “Total Savings” view.

How to get started – Onboard your Databricks workspace in the integration page. This may require involving your devops teams as various cloud permissions are required.

What is shown:

Total core hours
Total Spend
Total recommendations applied
Total cost savings
Total estimated developer time saved
Total number of projects
Number of SLAs met

Automate

Enable auto-apply to automatically optimize your Databricks jobs clusters to hit your cost and runtime goals. Save time and money with automation.

How to get started – Onboard your Databricks workspace in the integration page (no need to repeat if already done above)

What is shown:

Job costs over time
Job runtime over time
Job configuration parameters
Cluster configurations
Spark metrics
Input data size

Conclusion

Get started in a minute yourself with the Discover page and start finding new opportunities to optimize your Databricks environment. Login yourself to get started!

Or if you’d prefer a hands on demo, we’d be happy to chat. Schedule a demo here

May 2024 Release Notes

Jeffrey Chou
05.29.2024

April showers bring May product updates! Take a look at Sync’s latest product releases and features. 💐

The Sync team is heading to San Francisco for the Databricks Data+AI Summit 2024! We’ll be at Booth #44 talking all things Gradient with a few new surprise features in store.

Want to get ahead of the crowd? Book a meeting with our team before the event here.

Download our Databricks health check notebook

Have you taken advantage of our fully customizable health check notebook yet?

With the notebook, you’ll be able to answer questions such as:
⚙️ What is the distribution of job runs by compute type?
⚙️ What does Photon usage look like?
⚙️ What are the most frequently used instance types?
⚙️ Are APC clusters being auto-terminated or sitting idle?
⚙️ What are my most expensive jobs?

The best part? It’s a free tool that gives you actionable insights so you can work toward optimally managing your Databricks jobs clusters.

Head here to get started.

Apache Airflow Integration

Apache Airflow for Databricks now directly integrates with Gradient. Via the Sync Python Library, users are able to integrate Databricks pipelines when using 3rd party tools like Airflow.

To get started simply integrate your Databricks Workspace with Gradient via the Databricks Workspace Integration. Then, configure your Airflow instance and ensure that the syncsparkpy library has been installed using the Sync CLI.

Take a look at an example Airflow DAG below:

from airflow import DAG
from airflow.decorators import task
from airflow.operators.python import PythonVirtualenvOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from airflow.utils.dates import days_ago
from airflow.models.variable import Variable
from airflow.models import TaskInstance

from sync.databricks.integrations.airflow import airflow_gradient_pre_execute_hook



default_args = {
    'owner': 'airflow'
}

with DAG(
    dag_id='gradient_databricks_multitask',
    default_args=default_args,
    schedule_interval = None,
    start_date=days_ago(2),
    tags=['demo'],
    params={
        'gradient_app_id': 'gradient_databricks_multitask',
        'gradient_auto_apply': True,
        'cluster_log_url': 'dbfs:/cluster-logs',
        'databricks_workspace_id': '10295812058'
    }
) as dag:

    def get_task_params():
        task_params = {
            "new_cluster":{
                  "node_type_id":"i3.xlarge",
                  "driver_node_type_id":"i3.xlarge",
                  "custom_tags":{},
                  "num_workers":4,
                  "spark_version":"14.0.x-scala2.12",
                  "runtime_engine":"STANDARD",
                  "aws_attributes":{
                     "first_on_demand":0,
                     "availability":"SPOT_WITH_FALLBACK",
                     "spot_bid_price_percent":100
                  }
               },
               "notebook_task":{
                  "notebook_path":"/Users/pete.tamisin@synccomputing.com/gradient_databricks_multitask",
                  "source":"WORKSPACE"
               }
        }

        return task_params

    notebook_task = DatabricksSubmitRunOperator(
        pre_execute=airflow_gradient_pre_execute_hook,
        task_id="notebook_task",
        dag=dag,
        json=get_task_params(),
    )

##################################################################


    notebook_task

And voila! After implementing your DAG, head to the Projects dashboard in Gradient to review recommendations and make any necessary changes to your cluster config.

Take a look at our documentation to get started.

April 2024 Release Notes

Jeffrey Chou
04.29.2024

Our April releases are here! Take a look at Sync’s latest product updates and features.

Sync’s Databricks Workspace health check is now self-serve and available as a notebook that you simply download and run on your own.

With the notebook, you’ll be able to answer questions such as:

⚙️ What is the distribution of job runs by compute type?
⚙️ What does Photon usage look like?
⚙️ What are the most frequently used instance types?
⚙️ Are APC clusters being auto-terminated or sitting idle?
⚙️ What are my most expensive jobs?

The best part? It’s a free tool that gives you actionable insights so you can work toward optimally managing your Databricks jobs clusters. Head here to get started.

Hosted Log Collection for Microsoft Azure

You’re now able to easily onboard your Databricks jobs on Azure. With Sync-hosted collection within Gradient, users are able to minimize onboarding errors with a “low-touch” integration process.

Want to give new features a try and learn more about the latest Gradient updates? Get started for free here.

Job Metrics Timeline View

Track custom Spark and Gradient metrics for your projects directly from the Gradient dashboard. With this enhanced view, you’re able to visualize metrics like core hours, number of workers, input data, and more!

Why are your Databricks jobs performances changing over time?

Jeffrey Chou
04.11.2024

For those running and tracking their production Databricks jobs, many may often see “random” fluctuations in runtime or slowly changing performance over days and weeks.

Immediately, people may often wonder:

“Why is my runtime increasing since last week?”
“Is the cost of this job also increasing?”
“Is the input data size changing?”
“Is my job spilling to disk more than before?”
“Is my job in danger of crashing?”

To help give engineers and managers more visibility into how their production jobs are performing over time, we just launched a new visualization feature in Gradient that will hopefully help provide quick answers to engineers.

A snapshot of the new visualizations is shown below, where on a single page you can see and correlate all the various metrics that may be impacting your cost and runtime performance. In the visualization below, we track the following parameters:

Job cost (DBU + Cloud fees)
Job Runtime
Number of core*hours
Number of workers
Input data size
Spill to disk
Shuffle read/write

Why does job performance even change?

The main three reasons why we see job performance change over time are:

Code changes – Obviously with any significant code changes, your entire job could behave differently. Tracking and understanding how your new code changes impact the business output however is less clear. With these new visualizations, engineers can quickly see the “before and after” impact of any code changes they implement

Data size changes – When your data increases or decreases, this can impact your job runtime (and hence costs). While this makes sense, tracking and seeing how it changes over time is much more subtle. It may be a slowly varying amount, or it could be very spiky data events with sudden changes. Understanding how your data size impacts your runtime is a critical “first check”

Spot instances revoking – When spot nodes are randomly pulled during your job execution, it can cause significant impact on the runtime of your job. Since Spark has to essentially “recover” from a worker being pulled, the impact on runtime can be significant. We’ve seen runtimes go up 2-3X simply from 1 worker being pulled. It all depends on at what point the Spot instance is being pulled and the impact. Since this is basically random, your overall spot runtimes can have wildly varying runtimes.

As a side note, we’ve observed that an optimized on-demand cluster can often beat Spot pricing because of this very reason. Over the long haul, a stable and optimized on demand cluster is better than a wildly varying Spot cluster.

How does this feature differ from Overwatch?

For those in the know, Overwatch is a great open source tool built by Databricks to help plot all sorts of metrics to help teams monitor their jobs. The main differences and advantages of the visualizations we show are:

1) Total cost metrics – Gradient pulls both the DBU and estimated Cloud costs for your clusters and shows you the total cost. Cost data from Databricks only includes DBUs. While it is possible to wrap in cloud costs with overwatch, it’s a huge pain to set up and configure. Gradient does this “out of the box”

2) “Out of the box” Ready – While Overwatch does technically contain the same data that Gradient shows, users would still have to write queries to do all the aggregations properly by pulling the task level or stage level tables as required. Overwatch is best considered a “data dump” and then users will have to wrangle it correctly and do all the dashboarding work in a way that meets their needs. Our value add is that we do all this leg work for you and just present the metrics from day 1.

3) Trends over time – Gradient aggregates the data to show users how all these various metrics are changing over time, so users can quickly understand what has changed recently. Looking at a single snapshot in time is often not very useful, as users need to see “what happened before?” to really understand what has changed and what they can do about it. While technically this is do-able with Overwatch, it requires users to do the work in building and collecting the data. Gradient does this “out of the box”

How does this help with stability and reliability?

Beyond cost efficiency, stability is often a higher priority than everything. Nobody wants a crashed job. These metrics help give engineers “early signals” if their cluster is headed towards a dangerous cliff. For example, data sizes may start growing beyond the memory of your cluster, which could cause the dreaded “out of memory” error.

Seeing how your performance is trending over time is a critical piece of information users need to help prevent dangerous crashes from happening.

Conclusion

We hope this new feature makes life a lot easier for all the data engineers and platform managers out there. This feature comes included with Gradient and is live today! We’d love your feedback.

It probably goes without saying that this feature is in addition to our active optimization solution that can auto-tune your clusters to hit your cost or runtime goals. The way we look at it, is we’re expanding our value to our users by providing critical metrics.

We’d love your feedback and requests for what other metrics you’d love to see. Try it out today or reach out for a demo!

February 2024 Release Notes

Jeffrey Chou
02.21.2024

We’re excited to share all the new and improved features that our team has recently released to help our customers gain full governance over their Databricks infrastructure.

Databricks Workspace Integration
Introducing the Databricks Workspace Integration for Gradient. With this new feature, you’re able to further simplify the process of connecting your Databricks Workspace to the Sync platform. This capability eases the tedious process of consolidating with the Gradient UI without the use of the Sync CLI.

To get started, head to the integrations tab in your Sync dashboard. Here you’ll see a list that includes Databricks Workspace. Navigate to the Add dropdown menu and click on the Databricks Workspace dropdown option to trigger the integration flow.

Project Reset Data
As users integrate their projects into Sync, they are often faced with sudden config changes. Project Reset is a capability built directly into the Sync platform in which users will be able to perform a hard “reset” on the data for a project, ultimately triggering the build of a new custom model for the related job.

*Now available via the Sync API, coming soon to the Sync UI*

With this new capability, you’re able to reset the following directly from the Sync UI:

Historical logs
Resets the selected project back to “learning” mode
Clears project graphs
Clears the project’s history table

Successful response

{
  "result": [
    {
      "created_at": "2024-02-21T02:35:46.806Z",
      "updated_at": "2024-02-21T02:35:46.806Z",
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "name": "string",
      "app_id": "string",
      "cluster_path": "string",
      "job_id": "string",
      "workspace_id": "string",
      "workflow_id": "string",
      "creator_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "product_code": "aws-emr",
      "description": "string",
      "status": "Pending Setup",
      "cluster_log_url": "string",
      "prediction_preference": "performance",
      "auto_apply_recs": true,
      "prediction_params": {
        "sla_minutes": 0,
        "force_ondemand_workers": true,
        "fix_worker_family": true,
        "fix_driver_type": true,
        "fix_scaling_type": true
      },
      "tuned_cost": 0,
      "tuned_runtime": 0,
      "project_model_id": "UNASSIGNED",
      "metrics": {
        "job_success_rate_percent": 0,
        "sla_met_percent": 0
      },
      "latest_prediction_id": "string",
      "latest_prediction_created_at": "string",
      "creator": {
        "created_at": "2024-02-21T02:35:46.806Z",
        "updated_at": "2024-02-21T02:35:46.806Z",
        "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "sync_tenant_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "email": "string",
        "name": "string",
        "last_login": "string"
      },
      "phase": "LEARNING",
      "optimize_instance_size": true,
      "project_periodicity_type": "DAILY_SINE",
      "product_name": "string"
    }
  ]
}

User Management
With User Management, you’re able to take a hands-on approach to managing your users in Gradient. With this feature, account owners can:

Add a user
Deactivate a user
Assign a specific role to a user

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.

January 2024 Release Notes

Jeffrey Chou
01.31.2024

Exciting things are happening at Sync as we move further into the new year!

Ensuring that our users are equipped with the tools to fully manage the automation of their infrastructure is always top of mind. With the most recent iteration of Gradient, Sync users are able to take advantage of a toolkit that makes optimizing Databricks clusters even better.

Here’s what’s new in the latest version of Gradient:

Org Settings

Org Settings is now available in the main navigation bar in the Sync Dashboard. Users are able to navigate to the Org Settings tab to find personal user information, a comprehensive list of API keys, and a list of organization users with their user details.

With Org Settings, users will see a consolidated list of personal information, API keys, and account users directly in the Sync UI.

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation. Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.