How to Use the Gradient CLI Tool to Optimize Databricks / EMR Programmatically
Learn how to use the Gradient CLI to optimize your Databricks jobs programmatically
Introduction:
The Gradient Command Line Interface (CLI) is a powerful yet easy utility to automate the optimization of your Spark jobs from your terminal, command prompt, or automation scripts.
Whether you are a Data Engineer, SysDevOps administrator, or just an Apache Spark enthusiast, knowing how to use the Gradient CLI can be incredibly beneficial as it can dramatically reduce the cost of your Spark workloads and while helping you hit your pipeline SLAs.
If you are new to Gradient, you can learn more about it in the Sync Docs. In this tutorial, we’ll walk you through the Gradient CLI’s installation process and give you some examples of how to get started. This is meant to be a tour of the CLI’s overall capabilities. For an end to end recipe on how to integrate with Gradient take a look at our Quick Start and Integration Guides.
Pre Work
This tutorial assumes that you have already created a Gradient account and generated your
Sync API keys. If you haven’t generated your key yet, you can do so on the Accounts tab of the Gradient UI.
Step 1: Setting up your Environment
Let’s start by making sure our environment meets all the prerequisites. The Gradient CLI is actually part of the Sync Library, which requires Python v3.7 or above and which only runs on Linux/Unix based systems.
python --version
I am on a Mac and running python version 3.10, so I am good to go, but before we get started I am going to create a Python virtual environment with vEnv. This is a good practice for whenever you install any new Python tool, as it allows you to avoid conflicts between projects and makes environment management simpler. For this example, I am creating a virtual environment called gradient-cli that will reside under the ~/VirtualEnvironments path.
python -m venv ~/VirtualEnvironments/gradient-cli
Step 2: Install the Sync Library
Once you’ve confirmed that your system meets the prerequisites, it’s time to install the Sync Library. Start by activating your new virtual environment.
source ~/VirtualEnvironments/gradient-cli/bin/activate
Next use the pip package installer to install the latest version of the Sync Library.
pip install https://github.com/synccomputingcode/syncsparkpy/archive/latest.tar.gz
You can confirm that the installation was successful by viewing the CLI executable’s version by using the –version or –help options.
sync-cli --help
Step 3. Configure the Sync Library
Configuring the CLI with your credentials and preferences is the final step for the installation and setup for the Sync CLI. To do this, run the configure command:
sync-cli configure
You will be prompted for the following values:
Sync API key ID:
Sync API key secret:
Default prediction preference (performance, balanced, economy) [economy]:
Would you like to configure a Databricks workspace? [y/n]:
Databricks host (prefix with https://):
Databricks token:
Databricks AWS region name:
If you remember from the Pre Work, your Sync API key & secret are found on the Accounts tab of the Gradient UI. For this tutorial we are running on Databricks, so you will need to provide a Databricks Workspace and an Access token.
Databricks recommends that you set up a service principal for automation tasks. As noted in their docs, service principals give automated tools and scripts API-only access to Databricks resources, providing greater security than using users or groups. |
These values are stored in ~/.sync/config.
Congrats! You are now ready to interact with Gradient from your terminal, command prompt, or automation scripts.
Step 4. Example Uses
Below are some tasks you can complete using the CLI. This is useful when you want to automate Gradient processes and incorporate them into larger workflows.
Projects
All Gradient recommendations are stored in Projects. Projects are associated with a single Spark job or a group of jobs running on the same cluster. Here are some useful commands you can use to manage your projects with the CLI. For an exhaustive list of commands use the –help option.
Project Commands:
create – Create a project
sync-cli projects create --description [TEXT] --job-id [Databricks Job ID] PROJECT_NAME
delete – Delete a project
sync-cli projects delete PROJECT_ID
get – Get info on a project
sync-cli projects get PROJECT_ID
list – List all projects for account
sync-cli projects list
Predictions
You can also use the CLI to manage, generate and retrieve predictions. This is useful when you want to automate the implementation of recommendations within your Databricks or EMR environments.
Prediction commands:
get – Retrieve a specific prediction
sync-cli predictions get --preference [performance|balanced|economy] PREDICTION_ID
list – List all predictions for account or project
sync-cli predictions list --platform [aws-emr|aws-databricks] --project TEXT
status – Get the status of a previously initiated prediction
sync-cli predictions status PREDICTION_ID
The CLI also provides platform specific commands to generate and retrieve predictions.
Databricks
For Databricks you can generate a recommendation for a previously completed job run with the following command:
sync-cli aws-databricks create-prediction --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] RUN_ID
If the run you provided was not already configured with the Gradient agent when it executed, you can still generate a recommendation but the basis metrics may be missing some time sensitive information that may no longer be available. To enable evaluation of prior logs executed without the Gradient agent, you can add the –allow-incomplete-cluster-report option. However, to avoid this issue altogether, you can implement the agent and re-run the job.
Alternatively, you can use the following command to run the job and request a recommendation with a single command:
sync-cli aws-databricks run-job --plan [Standard|Premium|Enterprise] --compute ['Jobs Compute'|'All Purpose Compute'] --project [Your Project ID] JOB_ID
This method is useful in cases when you are able to manually run your job without interfering with scheduled runs.
Finally, to implement a recommendation and run the job with the new configuration, you can issue the following command:
sync-cli aws-databricks run-prediction --preference [performance|balanced|economy] JOB_ID PREDICTION_ID
EMR
Similarly, for Spark EMR, you can generate a recommendation for a previously completed job. EMR does not have the same issue with regard to ephemeral cost data not being available, so you can request a recommendation on a previous run without the Gradient agent.
sync-cli aws-emr create-prediction --region [Your AWS Region] CLUSTER_ID
Use the following command to do so:
If you want to manually rerun the EMR job and immediately request a Gradient recommendation, use the following command:
sync-cli aws-emr record-run --region [Your AWS Region] CLUSTER_ID PROJECT
To execute the EMR job using the recommended configuration, use the following command:
sync-cli aws-emr run-prediction --region [Your AWS Region] PREDICTION_ID
Products
Gradient is constantly working on adding support for new data engineering platforms. To see which platforms are supported by your version of the CLI, you can use the following command:
sync-cli products
Configuration
Should you ever need to update your CLI configurations, you can call config again to change one or more your values.
sync-cli configure --api-key-id TEXT --api-key-secret TEXT --prediction-preference TEXT --databricks-host TEXT --databricks-token TEXT --databricks-region TEXT
Token
The Token command returns an access token that you can use against our REST API with clients like postman
sync-cli token
Conclusion
With these simple commands, you can automate the end to end optimization of all your Databricks or EMR workloads, dramatically reducing your costs and improving the performance. For more information refer to our developer docs or reach out to us at info@synccomputing.com.
More from Sync:
Choosing the right Databricks cluster: Spot instances vs. on-demand clusters, All-Purpose Compute vs. Jobs Compute
Choosing the right Databricks cluster: Spot instances vs. on-demand clusters, All-Purpose Compute vs. Jobs Compute
Databricks Compute Comparison: Classic Jobs vs Serverless Jobs vs SQL Warehouses
Databricks Compute Comparison: Classic Jobs vs Serverless Jobs vs SQL Warehouses