Case Study

How Abnormal Reduced Databricks Costs by 38% with Gradient

Who is Abnormal?

Abnormal is a hypergrowth company in the email security space that helps companies worldwide prevent email attacks while automating security operations. They rely on Databricks extensively to help process terabytes of data across thousands of jobs daily, translating to an enormous amount of daily Databricks usage. From ETL jobs, streaming, SQL, to machine learning – it’s safe to say that Abnormal has a diverse set of applications run on Databricks.

Abnormal’s Problem with Databricks

Abnormal’s data platform team came to Sync with a common problem — they simply had too many jobs for a platform team to control in terms of cost, efficiency, and runtime SLAs, all while operating at scale. This problem wasn’t getting easier with their business growing each month – the number of jobs to manage was only increasing.

To put some numbers behind the scale of the problem, let’s say each Databricks cluster can have 10 options to configure. That times a thousand is 10,000 parameters each day that need to be monitored and optimized. Keep in mind, many things can change as well from data size to new code being pushed – so a configuration that worked yesterday can be wrong today.  

And to make the problem harder, a bad configuration can lead to an out of memory error and result in a crashed production job. Making a mistake with a configuration is no joke, and that’s why many teams are hesitant to try and optimize themselves.

The Abnormal team was looking for a solution that could manage and optimize many of their production jobs automatically and at scale without any crashes. In some sense, they wanted a tool that would allow their small platform team to seem like an army capable of successfully managing thousands of jobs at once with high fidelity.

How Gradient Helped

Abnormal’s use case is one we’ve heard many times, and one that really strikes at the true value of what Gradient can do – automatically manage and optimize Databricks jobs at scale. With Gradient, a single person can potentially manage thousands of jobs, without breaking a sweat.

How Gradient Works

Gradient builds a custom model for each job it manages, and trains its model on historical data, keeping track of any statistical variations that may occur. With this information, Gradient can confidently apply changes to a user’s cluster automatically to steer it towards the desired performance, learning with each iteration. 

Gradient’s model provides the ultimate infrastructure management solution with the following value points relevant for Abnormal:

  • Automatically maintains optimal performance – Gradient continuously monitors your jobs and configurations to ensure mathematically optimal performance despite any variations that may occur with no code changes.
  • Lowers costs – Gradient can tune clusters at scale to hit minimize costs
  • Hit SLAs – Gradient can tune clusters to ensure runtime SLAs are hit
  • Offloads maintenance work – Monitoring and tuning clusters can now be automated away, saving precious time for data and platform engineers
  • Beyond serverless – Similar to the value of Serverless clusters, Gradient removes all infrastructure decision making away from end users.  The one major step forward Gradient performs is it will actively optimize the configurations to hit end user goals – white still leaving everything exposed for users to observe and control if they want.

Automatic Results in Production – “Set and Forget”

An hourly production job with a runtime of about 60 minutes was selected to import into the Gradient system. Integrating Gradient into Abnormals production Airflow environment took under an hour for the one-time installation setup. After the installation, adding new jobs is as simple as clicking a few buttons in the Gradient UI.

After importing the job into Gradient, Abnormal simply enabled “auto-apply” and then walked away. A couple hours later they logged back into Gradient and saw the results below – a 38% cost savings and a 2x speedup of their jobs in production without lifting a finger.

Analyzing the Results

The image above shows an annotated screenshot from Abnormal’s Gradient dashboard for the job. The top graph shows the cost as a function of job run iteration. Each vertical line with a Gradient logo illustrates when a cluster recommendation was applied in production.

The gray area represents the “training” phase of the process while the green area represents the “optimizing” phase. The bottom graph shows the runtime of the application across iterations.

While Abnormal normally runs their job on Spot instances, for this case we switched it to an on-demand cluster to help mitigate any cluster performance noise that could be attributed to the randomness of spot clusters. Basically, it helps to keep the test clean to ensure any performance enhancements we achieve weren’t caused by other random fluctuations. Otherise, the clusters were identical. After optimization, the cluster can be switched back to Spot, where the savings will transfer directly.

In the above cost graph, once the cluster switches to on-demand, the training and optimizing begins. The starting cluster consisted of a r5.xlarge driver node and 10 c5.4xlarge worker nodes. Gradient, under the hood, optimized both cluster and EBS settings.  

What Gradient found in this particular example was that this job was under-provisioned and actually needed a larger cluster that resulted in both a 38% cost savings and a 2X speedup. This result can be counter-intuitive as most people may think cost and runtime savings can only occur when a cluster is over-provisioned and simply shrinking a cluster is the obvious way to cut costs.  

Towards the end of the image above, you’ll notice that Gradient will manage the cluster and keep it at optimal performance, with no human intervention needed. Even if things change, Gradient will be able to adjust and accommodate.

Was the reduction due to decreasing data sizes?

One question Abnormal had immediately was, maybe this cost reduction was due to a decreasing data size – a completely valid thought. We of course want to ensure that the performance enhancements observed were due to our optimizations and not some external factor.  

Fortunately, when we analyzed the input data size during these job runs, they were all consistent across job runs. Proving that this performance boost was due to Gradient’s actions.

How did Gradient know what to do?

As mentioned earlier, Gradient uses a proprietary mathematical model of Databricks.  With a few training points, the system can fit model coefficients to quickly provide an accurate prediction of which way to tune a cluster. The best part is, with each new iteration data point, the model only gets more accurate over time.

In this particular scenario, the data was informing the model that it was indeed an under-provisioned cluster and that increasing the size was the right path.

To ensure job run safety, Gradient will gradually walk clusters towards optimal configurations – monitoring the effects of the change with each iteration. This is an important step to prevent catastrophic failures. Changing a cluster too drastically in a single shot is a high risk maneuver due to the unpredictability of Spark jobs.

So what did Abnormal think?

“Gradient is indispensable in our data-driven landscape, where the ever-expanding data volumes require constant monitoring and optimizations. Without Gradient, these oversights can lead to expensive errors, system failures, or excessive costs. With its automated optimization of distributed Spark jobs and scalable solutions, Gradient guarantees seamless pipeline operation, enabling us to focus on delivering products and features to our customers.” – Blaine Elliot – Platform Engineer @Abnormal

Try it yourself today

Gradient is available to try today for users on Databricks AWS and Azure. With each passing month, we’re releasing new features and new optimization paths. If you want help managing and optimizing your Databricks clusters, request a demo or see our site for more information.

How Forma.ai improved their Databricks costs quickly and easily with Gradient

Forma.ai is a B2B SaaS startup based in Toronto, Canada building an AI powered sales compensation system for enterprise. Specifically, they seamlessly unify the design, execution, and orchestration of sales compensation to better mobilize sales teams and optimize go-to-market performance.

Behind the scenes, Forma.ai deploys their pipelines on Databricks to process sales compensation pipelines for their customers. They process hundreds of terabytes of data per month across Databricks Jobs clusters and ad-hoc all-purpose compute clusters.  

As their customer count grows, so will their data processing volumes. The cost and performance of their Databricks jobs directly impacts their cost of goods (COGs) and thus their bottom line.  As a result, the efficiency of their jobs is of the utmost importance today and for their future sustainable growth.

What is their problem with Databricks?

Forma.ai came to Sync with one fundamental problem – how can they optimize their processing costs with minimal time investment? Thanks to their customer growth, their Databricks usage and costs were only increasing. They were looking for a scalable solution to help keep their clusters optimized without high overhead on the DevOps and Development teams.

Previously they had put some work into trying to optimize their jobs clusters, such as moving to different instance types for the most expensive pipelines. These pipelines and their clusters are updated frequently however, and manually reviewing configuration of every cluster regularly is simply not cost or time effective.

How Gradient Helps

Gradient provided the solution they were looking for – a way to achieve optimal clusters without the need to manually tune – freeing up their engineers to focus on building new features and accelerate development.  

Furthermore, the configurations that Gradient does make are fully exposed to their engineers, so their team can actually learn and see what configurations actually matter and what the impact is.  Enriching their engineers and leveling up their own Databricks experience.

Initial Results with Gradient

For a first test, Forma onboarded a real job they run in production with Gradient, enabled ‘auto-apply’ and then let Gradient control their cluster for each recurring run.  After a couple cycles of learning and optimizing, the first results are shown below:  an 18% cost savings and a  19% speedup without lifting a finger. 

“Cost and cost control of data pipelines is always a factor, and Databricks and cloud providers generally make it really easy to spend money and pretty labor intensive to save money, which can end up meaning you spend more on the time spent optimizing than you end up saving. Gradient solves this dilemma by removing the bulk of the time spent on analysis and inspection. I’d be surprised if there was any data team on the planet that wouldn’t save money and time by using Gradient.”

Jesse Lancaster VP, Data Platform

So what did Gradient do actually?

In this first initial result, the change that had the most impact was tuning the cluster’s EBS settings (AWS only).  These settings are often overlooked in favor of CPU and Memory settings. 

A table of the specific parameters before and after Gradient is shown below:

Initial SettingsOptimized Settings
ebs_volume_typeGENERAL_PURPOSE_SSD”GENERAL_PURPOSE_SSD”
ebs_volume_count14
ebs_volume_size10032
ebs_volume_iops<not set>3000
ebs_volume_throughput<not set>312

The initial settings reflect the typical settings Databricks provides, and is what most people use.  The automatic EBS settings depend on the size of the instance chosen, with bigger instances getting more baseline storage according to AWS’s best practices. While these baseline settings are sufficient for running applications, they are often suboptimal.

We can see low level settings like IOPS and throughput are usually not set.  In fact, they aren’t even available in the cluster creation Databricks console.  You have to adjust these specific settings in the cluster JSON or with the Jobs API

If you’d like to try out Gradient for your workloads, checkout the resources below:

How a Disney Senior Data Engineer Obtained 80% Cost Savings using Gradient

Sr. Data Engineer at Disney Streaming

In the self-written blog post below, a Sr. Data Engineer chronicles his experience with the Spark Gradient for EMR. In the blog post we helped accelerate a job from 90 to 24 minutes, which was amazing to see!

The first job I put into the gradient went from processing in around 90 minutes to 25 minutes after I changed the configurations, only using a slightly larger cluster. However, that time save makes up for using more nodes, so it definitely worked to our advantage.

Matthew Weingarten

Extrapolated over a full year, our anticipated savings to his company was over $100K on AWS! Obviously this doesn’t include the extra time savings of removing the current manual guesswork of provisioning clusters.

User’s blogpost here

5 Lessons learned from testing Databricks SQL Serverless + DBT

Databricks’ SQL warehouse products are a compelling offering for companies looking to streamline their production SQL queries.  However, as usage scales up, the cost and performance of these systems become crucial to analyze.  

In this blog we take a technical deep dive into the cost and performance of their serverless SQL warehouse product by utilizing the industry standard TPC-DI benchmark. We hope data engineers and data platform managers can use the data presented here to make better decisions when it comes to their data infrastructure choices.

What are Databricks’ SQL warehouse offerings?

Before we dive into a specific product, let’s take a step back and look at the different options available today.  Databricks currently offers 3 different warehouse options

  • SQL Classic – Most basic warehouse, runs inside customer’s cloud environment
  • SQL Pro – Improved performance and good for exploratory data science, runs inside customer’s cloud environment
  • SQL Serverless – “Best” performance, and the compute is fully managed by Databricks.

From a cost perspective, both classic and pro run inside the user’s cloud environment.  What this means is you will get 2 bills for your databricks usage – one is your pure Databricks cost (DBU’s) and the other is from your cloud provider (e.g. AWS EC2 bill).

To really understand the cost comparison, let’s just look at an example cost breakdown of running on a Small warehouse based on their reported instance types:

In the table above, we look at the cost comparison of on-demand vs. spot costs as well.  You can see from the table that the Serverless option has no cloud component, because it’s all managed by Databricks.  

Serverless could be cost effective compared to pro, if you were using all on-demand instances.  But if there are cheap spot nodes available, then Pro may be cheaper.  Overall, the pricing for serverless is pretty reasonable in my opinion since it also includes the cloud costs

We also included the equivalent jobs compute cluster, which is the cheapest option across the board.  If cost is a concern to you, you can run SQL queries in jobs compute as well!

Pros and cons of Serverless

The Databricks serverless option is a fully managed compute platform.  This is pretty much identical to how Snowflake runs, where all of the compute details are hidden from users.  At a high level there are pros and cons to this:

Pros:  

  • You don’t have to think about instances or configurations
  • Spin up time is much less than starting up a cluster from scratch (5-10 seconds from our observations)

Cons:

  • Enterprises may have security issues with all of the compute running inside of Databricks
  • Enterprises may not be able to leverage their cloud contracts which may have special discounts on specific instances
  • No ability to optimize the cluster, so you don’t know if the instances and configurations picked by Databricks are actually good for your job
  • The compute is a black box – users have no idea what is going on or what changes Databricks is implementing underneath the hood.

Because of the inherent black box nature of serverless, we were curious to explore the various tunable parameters people do still have and their impact on performance.  So let’s drive into what we explored:

Experiment Setup

We tried to take a “practical” approach to this study, and simulate what a real company might do when they want to run a SQL warehouse.  Since DBT is such a popular tool in the modern data stack, we decided to look at 2 parameters to sweep and evaluate:

  • Warehouse size – [‘2X-Small’, ‘X-Small’, ‘Small’, ‘Medium’, ‘Large’, ‘X-Large’, ‘2X-Large’, ‘3X-Large’, ‘4X-Large’]
  • DBT Threads – [‘4’, ‘8’, ’16’, ’24’, ’32’, ’40’, ’48’]

The reason why we picked these two is they are both “universal” tuning parameters for any workload, and they both impact the compute side of the job.  DBT threads in particular effectively tune the parallelism of your job as it runs through your DAG.

The workload we selected is the popular TPC-DI  benchmark, with a scale factor of 1000.  This workload in particular is interesting because it’s actually an entire pipeline which mimics more real-world data workloads.  For example, a screenshot of our DBT DAG is below, as you can see it’s quite complicated and changing the number of DBT threads could have an impact here.

As a side note, Databricks has a fantastic open source repo that will help quickly set up the TPC-DI benchmark within Databricks entirely.  (We did not use this since we are running with DBT).  

To get into the weeds of how we ran the experiment, we used Databricks Workflows with a Task Type of dbt as the “runner” for the dbt CLI, and all the jobs were executed concurrently; there should be no variance due to unknown environmental conditions on the Databricks side. 

Each job spun up a new SQL warehouse and tore it down afterwards, and ran in unique schemas in the same Unity Catalog. We used the Elementary dbt package to collect the execution results and ran a Python notebook at the end of each run to collect those metrics into a centralized schema.

Costs were extracted via Databricks System Tables, specifically those for Billable Usage.

Try this experiment yourself and clone the Github repo here

Results

Below are the cost and runtime vs. warehouse size graphs.  We can see below that the runtime stops scaling when you get the medium sized warehouses.  Anything larger than a medium pretty much had no impact on runtime (or perhaps were worse).  This is a typical scaling trend which shows that scaling cluster size is not infinite, they always have some point at which adding more compute provides diminishing returns.

For the CS enthusiasts out there, this is just the fundamental CS principal – Amdahls Law.

One unusual observation is that the medium warehouse outperformed the next 3 sizes up (large to 2xlarge).  We repeated this particular data point a few times, and obtained consistent results so it is not a strange fluke.  Because of the black box nature of serverless, we unfortunately don’t know what’s going on under the hood and are unable to give an explanation.

Runtime in Minutes across Warehouse Sizes.

Because scaling stops at medium, we can see in the cost graph below that the costs start to skyrocket after the medium warehouse size, because well basically you’re throwing more expensive machines while the runtime remains constant.  So, you’re paying for extra horsepower with zero benefit.

Costs across Warehouse Sizes.

The graph below shows the relative change in runtime as we change the number of threads and warehouse size.  For values greater than the zero horizontal line, the runtime increased (a bad thing).

The data here is a bit noisy, but there are some interesting insights based on the size of the warehouse:

  • 2x-small – Increasing the number of threads usually made the job run longer.  
  • X-small to large – Increasing the number of threads usually helped make the job run about 10% faster, although the gains were pretty flat so continuing to increase thread count had no value.
  • 2x-large – There was an actual optimal number of threads, which was 24, as seen in the clear parabolic line
  • 3x-large – had a very unusual spike in runtime with a thread count of 8, why? No clue.

The Percent Change in Runtime as Threads Increase.

To put everything together into one comprehensive plot, we can see the plot below which plots the cost vs. duration of the total job.  The different colors represent the different warehouse sizes, and the size of the bubbles are the number of DBT threads.

Cost vs duration of the jobs. Size of the bubbles represents the number of threads. Image by author

In the plot above we see the typical trend that larger warehouses typically lead to shorter durations but higher costs.  However, we do spot a few unusual points:

  • Medium is the best – From a pure cost and runtime perspective, medium is the best warehouse to choose
  • Impact of DBT threads –  For the smaller warehouses, changing the number of threads appeared to have changed the duration by about +/- 10%, but not the cost much.  For larger warehouses, the number of threads impacted both cost and runtime quite significantly.

Conclusion

In summary, our top 5 lessons learned about Databricks SQL serverless + DBT products are:

  1. Rules of thumbs are bad – We cannot simply rely on “rules of thumb” about warehouse size or the number of dbt threads. Some expected trends do exist, but they are not consistent or predictable and it is entirely dependent on your workload and data.
  2. Huge variance – The costs ranged from $5 – $45, and runtimes from 2 minutes to 90 minutes, all due to different combinations of number of threads and warehouse size.
  3. Serverless scaling has limits – Serverless warehouses do not scale infinitely and eventually larger warehouses will cease to provide any speedup and only end up causing increased costs with no benefit.
  4. Medium is great – We found the Medium Serverless SQL Warehouse outperformed many of the larger warehouse sizes on both cost and job duration for the TPC-DI benchmark.  We have no clue why.
  5. Jobs clusters may be cheapest – If costs are a concern, switching to just standard jobs compute with notebooks may be substantially cheaper

The results reported here reveal that the performance of black box “serverless” systems can result in some unusual anomalies.  Since it’s all behind Databrick’s walls, we have no idea what is happening.  Perhaps it’s all running on giant Spark on Kubernetes clusters, maybe they have special deals with Amazon on certain instances?  Either way, the unpredictable nature makes controlling cost and performance tricky.

Because each workload is unique across so many dimensions, we can’t rely on “rules of thumb”, or costly experiments that are only true for a workload in its current state.  The more chaotic nature of serverless system does beg the question if these systems need a closed loop control system to keep them at bay? 

As an introspective note – the business model of serverless is truly compelling.  Assuming Databricks is a rational business and does not want to decrease their revenue, and they want to lower their costs, one must ask the question: “Is Databricks incentivized to improve the compute under the hood?:

The problem is this – if they make serverless 2x faster, then all of sudden their revenue from serverless drops by 50% – that’s a very bad day.  If they could make it 2x faster, and then increase the DBU costs by 2x to counteract the speedup, then they would remain revenue neutral (this is what they did for Photon actually).  

So Databricks is really incentivized to decrease their internal costs while keeping customer runtimes more or less the same.  While this is great for Databricks, it’s difficult to pass on any acceleration technology to the user that results in a cost reduction.

Interested in learning more about how to improve your Databricks pipelines? Reach out to Jeff Chou and the rest of the Sync Team.

Resources

How to Use the Gradient CLI Tool to Optimize Databricks / EMR Programmatically

Introduction:

The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts. 

Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs. 

If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.

Pre Work

This tutorial assumes that you have already created a Gradient account and generated your

Sync API keys. If you haven’t generated your key yet, you can do so on the Accounts tab of the Gradient UI.

Step 1: Setting up your Environment

Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.

python --version

I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.

python -m venv ~/VirtualEnvironments/gradient-cli

Step 2: Install the Sync Library

Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.

source ~/VirtualEnvironments/gradient-cli/bin/activate

Next use the pip package installer to install the latest version of the Sync Library.

pip install https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz

You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.

sync-cli --help

Step 3. Configure the Sync Library

Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:

sync-cli configure

You will be prompted for the following values:

Sync API key ID:

Sync API key secret:

Default prediction preference (performance, balanced, economy) [economy]:

Would you like to configure a Databricks workspace? [y/n]:

Databricks host (prefix with https://):

Databricks token:

Databricks AWS region name:

If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.


Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups.

These values are stored in ~/.sync/config.

Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.

Step 4. Example Uses

Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.

Projects

All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.

Project Commands:

create – Create a project

sync-cli projects create --description [TEXT] --job-id [Databricks Job ID] PROJECT_NAME

delete – Delete a project

sync-cli projects delete PROJECT_ID

get – Get info on a project

sync-cli projects get PROJECT_ID

list – List all projects for account

sync-cli projects list

Predictions

You can also use the CLI to manage, generate and retrieve predictions. This is useful when you want to automate the implementation of recommendations within your Databricks or EMR environments.

Prediction commands:

get – Retrieve a specific prediction

sync-cli predictions get --preference [performance|balanced|economy] PREDICTION_ID

list – List all predictions for account or project

sync-cli predictions list --platform [aws-emr|aws-databricks] --project TEXT

status – Get the status of a previously initiated prediction

sync-cli predictions status PREDICTION_ID

The CLI also provides platform specific commands to generate and retrieve predictions.

Databricks

For Databricks you can generate a recommendation for a previously completed job run with the following command:

sync-cli aws-databricks create-prediction --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] RUN_ID

If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.

Alternatively, you can use the following command to run the job and request a recommendation with a single command:

sync-cli aws-databricks run-job --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] JOB_ID

This method is useful in cases when you are able to manually run your job without interfering with scheduled runs.

Finally, to implement a recommendation and run the job with the new configuration, you can issue the following command:

sync-cli aws-databricks run-prediction --preference [performance|balanced|economy] JOB_ID PREDICTION_ID

EMR

Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.

sync-cli aws-emr create-prediction --region [Your AWS Region] CLUSTER_ID

Use the following command to do so:

If you want to manually rerun the EMR job and immediately request a Gradient recommendation, use the following command:

sync-cli aws-emr record-run --region [Your AWS Region] CLUSTER_ID PROJECT

To execute the EMR job using the recommended configuration, use the following command:

sync-cli aws-emr run-prediction --region [Your AWS Region] PREDICTION_ID

Products

Gradient is constantly working on adding support for new data engineering platforms. To see which platforms are supported by your version of the CLI, you can use the following command:

sync-cli products

Configuration

Should you ever need to update your CLI configurations, you can call config again to change one or more your values.

sync-cli configure --api-key-id TEXT --api-key-secret TEXT --prediction-preference TEXT --databricks-host TEXT --databricks-token TEXT --databricks-region TEXT

Token

The Token command returns an access token that you can use against our REST API with clients like postman

sync-cli token

Conclusion

With these simple commands, you can automate the end to end optimization of all your Databricks or EMR workloads, dramatically reducing your costs and improving the performance. For more information refer to our developer docs or reach out to us at info@synccomputing.com.