Blog

How to Use the Gradient CLI Tool to Optimize Databricks / EMR Programmatically

Introduction:

The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts. 

Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs. 

If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.

Pre Work

This tutorial assumes that you have already created a Gradient account and generated your

Sync API keys. If you haven’t generated your key yet, you can do so on the Accounts tab of the Gradient UI.

Step 1: Setting up your Environment

Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.

python --version

I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.

python -m venv ~/VirtualEnvironments/gradient-cli

Step 2: Install the Sync Library

Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.

source ~/VirtualEnvironments/gradient-cli/bin/activate

Next use the pip package installer to install the latest version of the Sync Library.

pip install https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz

You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.

sync-cli --help

Step 3. Configure the Sync Library

Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:

sync-cli configure

You will be prompted for the following values:

Sync API key ID:

Sync API key secret:

Default prediction preference (performance, balanced, economy) [economy]:

Would you like to configure a Databricks workspace? [y/n]:

Databricks host (prefix with https://):

Databricks token:

Databricks AWS region name:

If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.


Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups.

These values are stored in ~/.sync/config.

Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.

Step 4. Example Uses

Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.

Projects

All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.

Project Commands:

create – Create a project

sync-cli projects create --description [TEXT] --job-id [Databricks Job ID] PROJECT_NAME

delete – Delete a project

sync-cli projects delete PROJECT_ID

get – Get info on a project

sync-cli projects get PROJECT_ID

list – List all projects for account

sync-cli projects list

Predictions

You can also use the CLI to manage, generate and retrieve predictions. This is useful when you want to automate the implementation of recommendations within your Databricks or EMR environments.

Prediction commands:

get – Retrieve a specific prediction

sync-cli predictions get --preference [performance|balanced|economy] PREDICTION_ID

list – List all predictions for account or project

sync-cli predictions list --platform [aws-emr|aws-databricks] --project TEXT

status – Get the status of a previously initiated prediction

sync-cli predictions status PREDICTION_ID

The CLI also provides platform specific commands to generate and retrieve predictions.

Databricks

For Databricks you can generate a recommendation for a previously completed job run with the following command:

sync-cli aws-databricks create-prediction --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] RUN_ID

If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.

Alternatively, you can use the following command to run the job and request a recommendation with a single command:

sync-cli aws-databricks run-job --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] JOB_ID

This method is useful in cases when you are able to manually run your job without interfering with scheduled runs.

Finally, to implement a recommendation and run the job with the new configuration, you can issue the following command:

sync-cli aws-databricks run-prediction --preference [performance|balanced|economy] JOB_ID PREDICTION_ID

EMR

Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.

sync-cli aws-emr create-prediction --region [Your AWS Region] CLUSTER_ID

Use the following command to do so:

If you want to manually rerun the EMR job and immediately request a Gradient recommendation, use the following command:

sync-cli aws-emr record-run --region [Your AWS Region] CLUSTER_ID PROJECT

To execute the EMR job using the recommended configuration, use the following command:

sync-cli aws-emr run-prediction --region [Your AWS Region] PREDICTION_ID

Products

Gradient is constantly working on adding support for new data engineering platforms. To see which platforms are supported by your version of the CLI, you can use the following command:

sync-cli products

Configuration

Should you ever need to update your CLI configurations, you can call config again to change one or more your values.

sync-cli configure --api-key-id TEXT --api-key-secret TEXT --prediction-preference TEXT --databricks-host TEXT --databricks-token TEXT --databricks-region TEXT

Token

The Token command returns an access token that you can use against our REST API with clients like postman

sync-cli token

Conclusion

With these simple commands, you can automate the end to end optimization of all your Databricks or EMR workloads, dramatically reducing your costs and improving the performance. For more information refer to our developer docs or reach out to us at info@synccomputing.com.

Integrating Gradient into Apache Airflow

Summary

In this blog post, we’ll explore how you can integrate Sync’s Gradient with Airflow. We’ll walk through the steps to create a DAG that will submit a run to Databricks, and then make a call through Sync’s library to generate a recommendation for an optimized cluster for that task. This DAG example can be used to automate the process of requesting recommendations for tasks that are submitted as jobs to Databricks.

A Common Use Case And It’s Challenges

Use Case:

A common implementation of Databricks within Airflow consists of using the DatabricksSubmitRunOperator to submit a pre-configured notebook to Databricks.

Challenges:

  • Due to orchestration outside of Databricks’ ecosystem, these jobs are reflected as one-time runs
  • It’s difficult to track cluster performance across multiple runs
  • This is exacerbated by the fact that a dag can have multiple tasks that submit these one-off ‘jobs’ to Databricks.

How Can We Fix This?

We’ll set up a python operator to utilize Sync’s Library so we can generate recommendations and view them in Gradient’s UI. From there we can see the changes we need to make to have cost reductions in our cluster definitions. Let’s dive in!

Preparing Your Airflow Environment

Prerequisites

  • Airflow (This tutorial uses 2.0+)
  • Python 3.7+
  • Sync Library installed and environment variables configured on the airflow instance (details below)
  • An s3 path you would like to use for cluster logs – your databricks ARN will need access to this path so it can save the cluster logs there.
  • An account with Sync and a Project created to track the task you would like to optimize.

Sync Account Setup And Library Installation

Quick start instructions on how to create an account, project, and install the Sync Library can be found here. Please configure the cli on your airflow instance. When going through the configuration steps, be sure to choose yes when prompted to configure the Databricks variables.

Note: In the quickstart above, there are instructions on using an init script. Copy the contents of the init script into a file on a shared or personal workspace accessible by the account the Databricks job will run as.

Variables

Certain variables are generated and stored during installation of the sync library. For transparency, they are:

Besides the variables generated by the library, you’ll need the following ENV variables. These are necessary to use the AWS API to retrieve cluster logs when requesting a prediction. DBFS is supported, however, it is not recommended as it goes against Databrick’ best practices. As mentioned in the quick start, it’s best to set these via the AWS CLI.

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_DEFAULT_REGION

Cluster Configuration

Referring back to our common use case, often a static cluster configuration is either defined within the dag or dynamically within a helper function that returns the cluster dictionary to be passed into the DatabricksSubmitRunOperator. In preparation for the first run, some specific cluster details need to be configured. 

What are we adding?

  • Cluster_log_conf:  An s3 path to send our cluster logs. These will be used to generate an optimized recommendation
  • Custom_tags: the sync:project_id tag is added so we can assign the run to a sync project
  • Init_scripts: identifies the init script path that we copied into our Databricks workspace during the quick start setup
  • spark_env_vars: environment variables passed to the cluster that the init script will use. Note: the retrieval of tokens/keys in this tutorial is simplified to use the information configured during the sync-cli setup process. Passing them in this manner will result in tokens being visible in plaintext when viewing the cluster in Databricks. Please use Databricks Secrets when productionalizing this code.

The rest of the cluster configuration dictionary comprises the typical settings you normally pass into the DatabricksSubmitRunOperator.

from sync.config import DatabricksConf as sync_databricks_conf
from sync.config import get_api_key

{
    "spark_version": "13.0.x-scala2.12",
    ...
    "cluster_log_conf": {
        "s3": {
            "destination": "", # Add the s3 path for the cluster logs
            "enable_encryption": True,
            "region": "", # Add your aws region ie: us-east-1
            "canned_acl": "bucket-owner-full-control",
        }
    },
    "custom_tags": {"sync:project-id": "",}, # Add the project id from Gradient
    ...
    "init_scripts": [
        {"workspace": {
            "destination": "" # Path to the init script in the workspace ie: Shared/init_scripts/init.sh
            }
        }
    ],
    "spark_env_vars": {
        "DATABRICKS_HOST": f"{sync_databricks_conf().host}",
        "DATABRICKS_TOKEN": f"{sync_databricks_conf().token}",
        "SYNC_API_KEY_ID": f"{get_api_key().id}",
        "SYNC_API_KEY_SECRET": f"{get_api_key().secret}",
        "AWS_DEFAULT_REGION": f"{os.environ['AWS_DEFAULT_REGION']}",
        "AWS_ACCESS_KEY_ID": f"{os.environ['AWS_ACCESS_KEY_ID']}",
        "AWS_SECRET_ACCESS_KEY": f"{os.environ['AWS_SECRET_ACCESS_KEY']}",
    }
}

Reminder: the Databricks ARN attached to the cluster will need access to the s3 path specified in the cluster_log_conf.

Databricks Submit Run Operator Changes

Next, we’ll ensure the Databricks Operator passes the run_id of the created job back to xcom. This is needed in the subsequent task to request a prediction for the run. Just enable the do_xcom_push parameter.

# DAG code
    ...
    # Submit the Databricks run
    run_operator = DatabricksSubmitRunOperator(
        task_id=...,
        do_xcom_push=True,
    )
    ...

Create A Recommendation You Can View In Gradient!

Upon successful completion of the DatabricksSubmitRunOperator task, we’ll have the run_id we need to create a recommendation for optimal cluster configuration. We’ll utilize the PythonOperator to call the create_prediction_for_run method from the Sync Library. Within the library, this method will connect to the Databricks instance to gather the cluster log location, fetch the logs, and generate the recommendation.

Below is an example of how to call the create_prediction_for_run method from the Sync Library. 

from sync.awsdatabricks import create_prediction_for_run


def submit_run_for_recommendation(task_to_submit: str, **kwargs):
    run_id = kwargs["ti"].xcom_pull(task_ids=task_to_submit, key="run_id")
    project_id = "Project_Id_Goes_Here"
    create_prediction_for_run(
        run_id=run_id,
        plan_type="Premium",
        project_id=project_id,
        compute_type="Jobs Compute",
     )

What this code block does:

  • wraps and implements create_prediction_for_run
  • pulls the run_id for the previous task from xcom. We supply the task_to_submit as the task_id that we named the DatabricksSubmitRunOperator.
  • We assign the project id for that task to the project_id variable.
  • We pass our project id, supplied on the project details page in Gradient, to the Sync library method.

Optionally, add a parameter to the submit_run_for_recommendation if you’d like to extract this out to the python operator. Edit plan_type and compute_type as needed, these reference your Databricks settings.

To call the submit_run_for_recommendation method we defined, implement the python operator as follows:

    submit_for_recommendation = PythonOperator(
        task_id="submit_for_recommendation",
        python_callable=submit_run_for_recommendation,
        op_kwargs={
            "task_to_submit": "Task_id of the DatabricksSubmitRunOperator of which to generate a recommendation for",
        },
        provide_context=True,
        retries=0,
    )

Putting It All Together

Let’s combine all of the above together in a DAG. The DAG will submit a run to Databricks, and then make a call through Sync’s library to generate a prediction for an optimized cluster for that task.

# DAG .py code
from airflow.operators.python_operator import PythonOperator
from airflow.providers.databricks.operators.databricks import DatabricksSubmitRunOperator
from sync.awsdatabricks import create_prediction_for_run
from sync.config import DatabricksConf as sync_databricks_conf
from sync.config import get_api_key


with DAG(
    dag_id="example_dag",
    default_args=default_args,
    tags=["example"],
    ...
) as dag:

    # define the cluster configuration
        cluster_config = {
        "spark_version": "13.0.x-scala2.12",
        ...
        "cluster_log_conf": {
            "s3": {
                "destination": "", # Add the s3 path for the cluster logs
                "enable_encryption": True,
                "region": "", # Add your aws region ie: us-east-1
                "canned_acl": "bucket-owner-full-control",
            }
        },
        "custom_tags": {"sync:project-id": "",}, # Add the project id from Gradient
        ...
        "init_scripts": [
            {"workspace": {
                "destination": "" # Path to the init script in the workspace ie: Shared/init_scripts/init.sh
                }
            }
        ],
        "spark_env_vars": {
            "DATABRICKS_HOST": "", # f"{sync_databricks_conf().host}"
            "DATABRICKS_TOKEN": "", # f"{sync_databricks_conf().token}"
            "SYNC_API_KEY_ID": "", # f"{get_api_key().id}"
            "SYNC_API_KEY_SECRET": "", # f"{get_api_key().secret}"
            "AWS_DEFAULT_REGION": "", # f"{os.environ['AWS_DEFAULT_REGION']}"
            "AWS_ACCESS_KEY_ID": "", # f"{os.environ['AWS_ACCESS_KEY_ID']}"
            "AWS_SECRET_ACCESS_KEY": "", # f"{os.environ['AWS_SECRET_ACCESS_KEY']}",
        }
    }

    # define your databricks operator
    dbx_operator = DatabricksSubmitRunOperator(
        task_id="dbx_operator",
        do_xcom_push=True,
        ...
        new_cluster=cluster_config,
        ...
    )

    # define the submit function to pass to the PythonOperator
    def submit_run_for_recommendation(task_to_submit: str, **kwargs):
    run_id = kwargs["ti"].xcom_pull(task_ids=task_to_submit, key="run_id")
    project_id = "Project_Id_Goes_Here"
    create_prediction_for_run(
        run_id=run_id,
        plan_type="Premium",
        project_id=project_id,
        compute_type="Jobs Compute",
    )

    # define the python operator
    submit_for_recommendation = PythonOperator(
        task_id="submit_for_recommendation",
        python_callable=submit_run_for_recommendation,
        op_kwargs={
            "task_to_submit": "dbx_operator",
        },
        provide_context=True,
        retries=0,
    )

    # define dag dependency
    dbx_operator >> submit_for_recommendation

Viewing Your Recommendation

Once the code above is implemented into your DAG, head over to the Projects dashboard in Gradient. There you’ll be able to easily review recommendations and can make changes to the cluster configuration as needed.

Developing Gradient Part II

Introduction: Using Gradient in a Workflow

Gradient, the latest product release from Sync Computing, helps customers manage the infrastructure behind their recurring Apache Spark applications. Gradient gives infrastructure recommendations for each job to lower the cost of their Production jobs while hitting their target SLA’s. We’ve been hard at work on this project for a long time and we’re excited for people to use it and realize real cost savings in their Apache Spark jobs running on EMR or Databricks!

The key feature of Gradient is a Project, which in Databricks manages the lifecycle of a single Job. With the integration of Sync’s Python library, each Job run produces: 

  1. An Apache Spark eventlog
  1. A “Cluster Report” that includes Databricks and EC2 API response data about that run and its associated cluster

These outputs then get fed into Gradient’s recommendation engine which performs runtime and cost prediction for that same job if it were to run on different hardware. Within Gradient’s recommendation engine there are two key steps: (1) runtime prediction modeling, and (2) cost estimation modeling. These two steps are repeated for a variety of potential hardware configurations, and then one that yields the lowest cost given some time constraint is recommended for the job run.

Figure: Diagram of a Gradient Project. After each Job Run a cluster report and Eventlog are produced which get fed into Gradient’s recommendation engine. The output of this process is a new cluster recommendation, one that should reduce cost while maintaining a runtime requirement, that informs the subsequent Job run.

Internal Testing

In a separate blog post [link to Part I] we discuss in some detail the internal testing process we used to develop Gradient. A huge benefit to internal testing is that we have access to the cost-actual data of each Job we run via Databricks and AWS cost and usage reporting. This data is critical in assessing whether or not Gradient behaves as it should from a customer’s perspective. In other words, do customers still save money in spite of imperfect runtime prediction and cost estimation? The answer, we found, is yes!

In the following figure we show a snapshot of data we generated that gave us great confidence in the solution we developed for Gradient. The figure displays a histogram of the percent change in cost between an initial “parent run” (with random hardware configuration), and a subsequent “child run” informed by Gradient’s recommendation engine. In total, 80% of runs showed a cost reduction, and the median cost reduction was 30%.

It’s worth emphasizing that these results are the single shot improvements going from an initial cold-start run to the first recommendation produced by Gradient. Given how complex predicting runtime is, these are excellent results, and it’s not surprising that 20% of the runs increased in cost (a less than ideal outcome). That said, it’s the feedback loop that really unlocks the capability of Gradient. Every time your Job runs more data is added to your project, and that history of runs will be used to improve the recommendation quality of future recommendations. It’s a sure bet that when you spin up your first Job there’ll be a Sync engineer assessing the quality of your recommendations!

Figure: Internal testing results show after 1 iteration through Gradient, a cost difference of jobs before and after using Gradient’s recommendation engine. Cost was reduced in 80% of runs, with a median reduction of 30% when compared to the parent run cost.

External Savings

When it comes to demonstrating value, nothing beats a positive user experience. Early customer interactions have yielded huge savings and positive experiences from a number of different companies. Wanna see their story using Gradient? Check out some of the blogs they’ve written!

Conclusion

Since we first got started, Sync has felt confident that we’re tackling a real issue that’s pervasive in the cloud infrastructure world. Making infrastructure decisions and being uncertain about how to reduce costs is a ubiquitous story when we talk to folks. With Gradient we feel we finally tackled a real solution to this problem. A solution that will make the lives of developers around the world easier, enabling them to focus their efforts on more important tasks, all while saving companies money on their Apache Spark workloads.

Developing Gradient Part I

Introduction

Sync recently introduced Gradient, a tool that helps data engineers manage and optimize their compute infrastructure. The primary facet of Gradient is a Project which groups a sequence of runs of a Databricks job. After each run, the Spark eventlog and cluster information is sent to Sync. That accumulated project data is then fed into our recommendation engine, which gives back an optimized cluster configuration to make the next run of your job run at lower cost while keeping you on target for your SLA.

If that sounds like a challenging product to develop, then you’d be right! In fact, much of the challenge comes from getting the data needed to create the predictive engine, and we thought that this story and our strategy for tackling it was worth sharing.

In this blog post we give some insight into the development process of Gradient, with focus on the internal testing infrastructure that enabled the early development and validation of the product.

The Challenge

In the early days of development, the Gradient team was in a really tight spot. You see, at its heart, Gradient is a data insights product that requires a feedback loop of job runs and recommendations that get implemented in subsequent runs. You need this feedback loop to both assess the performance of the product and have a hope at improving the quality of recommendations. 

However, injecting yourself into a workflow is a tall order for customers who are rightly protective of their data pipelines, especially when the request comes from a young company with an unproved track record (even if our team – being real here – is totally awesome). Consultant-like interactions were mildly successful, but the update cycle was often a week or more, much too slow to have a hope of making meaningful progress. It was obvious then that at the start, we needed a solution that could get us a lot of data for many different jobs and many different cluster configurations … and didn’t source from customers.

The Solution

So how did we design, test, and validate our recommendation engine? We built infrastructure that would operate a Gradient Project using our own applications as a source. This system would select a Databricks Job (e.g. a TPC-DS query), create a Project, choose an initial “cold start” cluster configuration, and then cyclically run the job using Gradient’s recommendations to inform subsequent runs. Effectively we became a consumer of our own product. 

We also built an orchestrator that could run these Projects at scale using a variety of hardware configurations and applications. After selecting a few configurations, a team member could be spinning up hundreds of Projects with data available for analysis the next day. 

At the foundation of the whole system is a bank of jobs that ensures we capture a variety of workloads in our testing, mitigating as much bias as possible from our data. And in a stroke of genius we named this testing project the Job Bank.

Figure: Job Bank diagram. A Spark application is selected from the Job Bank vault and an initial hardware configuration is selected. After an initial run of the job, the results are passed into Gradient which then informs the subsequent run. Many instances of this process are orchestrated together so testing can occur at scale.

The Job Bank enabled two key vectors for performance assessment and improvement. 

  1. Having a parent run, its recommendation, and the informed child run allow us to assess the runtime prediction accuracy. Our data science team can then dig into these results, looking for scenarios where we perform better or worse. Naturally, this informs modeling and constraints to improve the quality of our recommendations.
  1. Since all of these jobs are run internally at Sync, we have full access to the cost and usage reports from both AWS and Databricks. Not only does this allow us to validate our cost modeling, but it lets us assess Gradient’s performance using cost-actual data which can be considered independently from our prediction accuracy. Look out for more details on cost performance in an upcoming blog!

The benefits of having a system like this were apparent almost immediately. It helped us uncover bugs and new edge cases, improve the engineering, and gave us clear direction in improving the runtime prediction accuracy. Most importantly, it enabled us to develop the recommendation engine into a state that we feel confident will provide customers reliable job management and real cost savings.

Some example data generated with the Job Bank is shown below. The general process to generate this data was

  1. Cold-start by selecting a Spark application and running it with an initial hardware configuration.
  2. Submit the resulting run data to Gradient and get a recommendation.
  3. Generate a child run using the recommended hardware configuration. 

The plot shows the runtime prediction error (predicted runtime minus child runtime) for about 20 unique applications with 3-5 cold-starts each. The error is plotted against the change in instance size from the parent to the child.  In this particular visualization, we can see that predicted error is correlated with how much the instance size changes.  These kinds of insights help us determine where to focus our efforts in improving the algorithm.

Figure: Example runtime prediction accuracy data generated using Job Bank. Each point is the runtime prediction error calculated by differencing a predicted runtime with the actual runtime using the predicted configuration.

Looking Forward

In the world of data it’s customer data that rules. As our user base grows and we have more feedback from those users, our internal system will become less relevant and it’s the customer recommendation cycles that will drive future development. After all, the variety of jobs and configurations that we might assess in the Job Bank is negligible compared to the domain of jobs that exist out in the wild. We look forward with eagerness and excitement to seeing how best we can improve Gradient in the future with customers at the center of mind.

Continue reading part II: How Gradient saves you money

Introducing: Gradient for Databricks

Wow the day is finally here! It’s been a long journey, but we’re so excited to announce our newest product: Gradient for Databricks.

Checkout our promo video here!

The quick pitch

Gradient is a new tool to help data engineers know when and how to optimize and lower their Databricks costs – without sacrificing performance.

For the math geeks out there, the name Gradient comes from the mathematical operator from vector calculus that is commonly used in optimization algorithms (e.g. gradient descent).

Over the past 18 months of development we’ve worked with data engineers around the world to understand their frustrations when trying to optimize their Databricks jobs. Some of the top pains we heard were:

  • “I have no idea how to tune Apache Spark”
  • “Tuning is annoying, I’d rather focus on development”
  • “There are too many jobs at my company. Manual tuning does not scale”
  • “But our Databricks costs are through the roof and I need help”

How did companies get here?

We’ve worked with companies around the world who absolutely love using Databricks. So how did so many companies (and their engineers) get to this efficiency pain point? At a high level, the story typically goes like this:

  • “The Honeymoon” phase: We are starting to use Databricks and the engineers love it
  • “The YOLO” phase: We need to build faster, let’s scale up ASAP. Don’t worry about efficiency.
  • “The Tantrum” phase: Uh oh. Everything on Databricks is exploding, especially our costs! Help!

So what did we do?

We wanted to attack the “Tantrum” problem head on. Internally three teams of data science, engineering, and product worked hand in hand with early design partners using our Spark Autotuner to figure out how to deliver a deeply technical solution that was also easy and intuitive. We used all the feedback on the biggest problems to build Gradient:

User Problem What Gradient Does
I don’t know when, why, or how to optimize my jobsGradient continuously monitors your clusters to notify you of when a new optimization is detected, estimate the cost/performance impact, and output a JSON configuration file to easily make the change.
I use Airflow or Databricks Workflows to orchestrate our jobs, everything I use must easily integrate.Our new python libraries and quick-start tutorials for Airflow and Databricks Workflows make it easy to integrate Gradient into your favorite orchestrators.
I just want to state my runtime requirements, and then still have my costs loweredSimply set your ideal max runtime and we’ll configure the cluster to hit your goals at the lowest possible cost.
My company wants us to use Autoscaling for our jobs clustersWhether you use auto-scaled or fixed clusters, Gradient supports optimizing both (or even switching from one to the other). 
I have hundreds of Databricks jobs. I need batch importing for optimizing to workProvide your Databricks token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.

We want to hear from you!

Our early customers made Gradient what it is today, and we want to make sure it’s meeting companies’ needs. We made getting started super easy (you can Just login to the app here). Feel free to browse the docs here. Please let us know how we did via Intercom (in the docs and app).