clusters

What is the Databricks Job API?

Jeffrey Chou
01.29.2024

The Databricks Jobs API allows users to programmatically create, run, and delete Databricks Jobs via their REST API solution. This is an alternative to running Databricks jobs through their console UI system. For access to other Databricks platforms such as SQL warehouses, delta live tables, unity catalog, or others, users will have to implement other API solutions provided by Databricks.

The official Databricks Jobs API reference can be found here.

However, for newcomers to the Jobs API, I recommend starting with the Databricks Jobs documentation which has great examples and more detailed explanations.

Why should I use the Jobs API?

Users may want to use an API, vs. the UI, when they need to dynamically create jobs due to other events, or to integrate with other non-Databricks workflows, for example Airflow or Dagster. Users can implement job tasks using notebooks, Delta Live Tables pipelines, JARS, or Python, Scala, Spark submit, and Java applications.

Another reason to use the Jobs API is to retrieve and aggregate metrics about your jobs to monitor usage, performance, and costs. The information in the Jobs API is far more granular than those present in the currently available System Tables.

So if your organization is looking to monitor thousands of jobs at scale and build dashboards, you will have to use the Jobs API to collect all of the information.

What can I do with the Jobs API?

A full list of the Jobs API PUT and GET requests can be found in the table below, based on the official API documentation.

Action	Request	Description
Get job permissions	/api/2.0/permissions/jobs/{job_id}	Gets the permissions of a job such as ‘user name’, ‘group name’, ‘service principal’, ‘permission level’
Set job permissions	/api/2.0/permissions/jobs/{job_id}	Sets permissions on a job.
Update job permissions	/api/2.0/permissions/jobs/{job_id}	Updates the permissions on a job.
Get job permission levels	/api/2.0/permissions/jobs/{job_id}/permissionLevels	Gets the permission levels that a user can have on an object
Create a new job	/api/2.1/jobs/create	Create a new Databricks Job
List jobs	/api/2.1/jobs/list	Retrieves a list of jobs and their parameters such as ‘job id’, ‘creater’, ‘settings’, ‘tasks’
Get a single job	/api/2.1/jobs/get	Gets job details for a single job
Update all job settings (reset)	/api/2.1/jobs/reset	Overwrite all settings for the given job.
Update job settings partially	/api/2.1/jobs/update	Add, update, or remove specific settings of an existing job
Delete a job	/api/2.1/jobs/delete	Deletes a job
Trigger a new job run	/api/2.1/jobs/run-now	Runs a job with an existing job-id
Create and trigger a one-time run	/api/2.1/jobs/runs/submit	Submit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Runs submitted using this endpoint don’t display in the UI.
List job runs	/api/2.1/jobs/runs/list	List runs in descending order by start time. A run is a job that has already historically been run.
Get a single job run	/api/2.1/jobs/runs/get	Retrieve the metadata of a single run.
Export and retrieve a job run	/api/2.1/jobs/runs/export	Export and retrieve the job run task.
Cancel a run	/api/2.1/jobs/runs/cancel	Cancels a job run
Cancel all runs of a job	/api/2.1/jobs/runs/cancel-all	Cancels all job runs
Get the output for a single run	/api/2.1/jobs/runs/get-output	Retrieve the output and metadata of a single task run.
Delete a job run	/api/2.1/jobs/runs/delete	Deletes a job run
Repair a job run	/api/2.1/jobs/runs/repair	Repairs a job run by re-running it

Can I get cost information through the Jobs API?

Unfortunately, users cannot obtain jobs cost directly through the Jobs API. You’ll need to use the accounts API to access billing information, or use System tables. One big note, is the billing information retrieved through either the accounts API or the system tables is only the Databricks DBU costs.

The majority of your Databricks costs could come from your actual cloud usage (e.g. on AWS it’s the EC2 costs). To obtain these costs you’ll need to separately retrieve cost information from your cloud provider.

If this sounds painful – you’re right, it’s crazy annoying. Fortunately, Gradient does all of this for you and can retrieve both the DBU and cloud costs for you in a simple diagram to monitor your costs.

How does someone intelligently control their Jobs clusters with the API?

The Jobs API is an input/output system only. What you do with the information and abilities to control and manage Jobs is entirely up to you and your needs.

For users running Databricks Jobs at scale, one dream ability is to optimize and intelligently control jobs clusters to minimize costs and hit SLA goals. Building such a system is not trivial and requires an entire team to develop a custom algorithm as well as infrastructure.

Here at Sync, we built Gradient to solve exactly this need. Gradient is an all-in-one Databricks Jobs intelligence system that works with the Jobs API to help automatically control your jobs clusters. Check out the documentation here to get started.

Updating From Jobs API 2.0 to 2.1

The largest update from API 2.0 to 2.1 is the inclusion of multiple tasks in a job, as described in the official documentation. To explain a bit more, Databricks jobs can contain multiple tasks in a single job, where each task can be a different notebook, for example. All API 2.1 requests must conform to the multi-task format and responses are structured in the multi-task format.

Databricks jobs api example

Here is an example, borrowed from the official documentation, of how to create a job:

To create a job with the Databricks REST API, run the curl command below, which creates a cluster based on the parameters located in the create-job.json

curl --netrc --request POST \

https://<databricks-instance>/api/2.0/jobs/create \

--data @create-job.json \

| jq .

An example of what goes into the create-job.json is found below

{

  "name": "Nightly model training",

  "new_cluster": {

    "spark_version": "7.3.x-scala2.12",

    "node_type_id": "r3.xlarge",

    "aws_attributes": {

      "availability": "ON_DEMAND"

    },

    "num_workers": 10

  },

  "libraries": [

    {

      "jar": "dbfs:/my-jar.jar"

    },

    {

      "maven": {

        "coordinates": "org.jsoup:jsoup:1.7.2"

      }

    }

  ],

  "email_notifications": {

    "on_start": [],

    "on_success": [],

    "on_failure": []

  },

  "webhook_notifications": {

    "on_start": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_success": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_failure": []

  },

  "notification_settings": {

    "no_alert_for_skipped_runs": false,

    "no_alert_for_canceled_runs": false,

    "alert_on_last_attempt": false

  },

  "timeout_seconds": 3600,

  "max_retries": 1,

  "schedule": {

    "quartz_cron_expression": "0 15 22 * * ?",

    "timezone_id": "America/Los_Angeles"

  },

  "spark_jar_task": {

    "main_class_name": "com.databricks.ComputeModels"

  }

}

Azure databricks jobs api

The REST APIs are identical across all 3 cloud providers (AWS, GCP, Azure). Users can toggle between the different cloud versions in the reference page on the top left corner

Conclusion

The Databricks Jobs API is a powerful system which enables to programmatically control and monitor their jobs. Likely this is useful for “power users” who want to control many jobs or for users who need to use an external orchestrator, like Airflow, to orchestrate their jobs.

To add automatic intelligence to your Databricks Jobs API solutions to help lower costs and hit SLAs, check out Gradient as a potential fit.

Useful Links

Databricks Pricing Page

Databricks Pricing Calculator

Pricing For Azure

How To Optimize Databricks Clusters

Databricks Instructor-Led Courses

Databricks Guided Access Support Subscription

Migrate Your Data Warehouse to Databricks

Databricks Support Policy

Why Your Data Pipelines Need Closed-Loop Feedback Control

Jeffrey Chou
09.10.2023

As data teams scale up on the cloud, data platform teams need to ensure the workloads they are responsible for are meeting business objectives. At scale with dozens of data engineers building hundreds of production jobs, controlling their performance at scale is untenable for a myriad of reasons from technical to human.

The missing link today is the establishment of a closed loop feedback system that helps automatically drive pipeline infrastructure towards business goals. That was a mouthful, so let’s dive in and get more concrete about this problem.

The problem for data platform teams today

Data platform teams have to manage fundamentally distinct shareholders from management to engineers. Oftentimes these two teams have opposing goals, and platform managers can be squeezed by both ends.

Many real conversations we’ve had with platform managers and data engineers typically go like this:

“Our CEO wants me to lower cloud costs and make sure our SLAs are hit to keep our customers happy.”

Okay, so what’s the problem?

“The problem is that I can’t actually change anything directly, I need other people to help and that is the bottleneck”

So basically, platform teams find themselves handcuffed and face enormous friction when trying to actually implement improvements. Let’s zoom into the reasons why.

What’s holding back the platform team?

Data Teams are out of technical scope – Tuning clusters or complex configurations (Databricks, Snowflake) is a time consuming task where data teams would rather be focusing on actual pipelines and SQL code. Many engineers don’t have the skillset, support structure, or even know what the costs are for their jobs. Identifying and solving root cause problems is also a daunting task that gets in the way of just standing up a functional pipeline.

Too many layers of abstraction – Let’s just zoom in on one stack: Databricks runs their own version of Apache Spark, which runs on a cloud provider’s virtualized compute (AWS, Azure, GCP), with different network options, and different storage options (DBFS, S3, Blob), and by the way everything can be updated independently and randomly throughout the year. The amount of options is overwhelming and it’s impossible for platform folks to ensure everything is up to date and optimal.

Legacy code – One unfortunate reality is simply just legacy code. Oftentimes teams in a company can change, people come and go, and over time, the knowledge of any one particular job can fade away. This effect makes it even more difficult to tune or optimize a particular job.

Change is scary – There’s an innate fear to change. If a production job is flowing, do we want to risk tweaking it? The old adage comes to mind: “if it ain’t broke, don’t fix it.” Oftentimes this fear is real, if a job is not idempotent or there are other downstream effects, a botched job can cause a real headache. This creates a psychological barrier to even trying to improve job performance.

At scale there are too many jobs – Typically platform managers oversee hundreds if not thousands of production jobs. Future company growth ensures this number will only increase. Given all of the points above, even if you had a local expert, going in and tweaking jobs one at a time is simply not realistic. While this can work for a select few high priority jobs, it leaves the bulk of a company’s workloads more or less uncared for.

Clearly it’s an uphill battle for data platform teams to quickly make their systems more efficient at scale. We believe the solution is a paradigm shift in how pipelines are built. Pipelines need a closed loop control system that constantly drives a pipeline towards business goals without humans in the loop. Let’s dig in.

What does a closed loop control for a pipeline mean?

Today’s pipelines are what is known as an “open loop” system in which jobs just run without any feedback. To illustrate what I’m talking about, pictured below shows where “Job 1” just runs every day, with a cost of $50 per run. Let’s say the business goal is for that job to cost $30. Well, until somebody actually does something, that cost will remain at $50 for the foreseeable future – as seen in the cost vs. time plot.

What if instead, we had a system that actually fed back the output statistics of the job so that the next day’s deployment can be improved? It would look something like this:

What you see here is a classic feedback loop, where in this case the desired “set point” is a cost of $30. Since this job is run every day, we can take the feedback of the real cost and send it to an “update config” block that takes in the cost differential (in this case $20) and as a result apply a change in “Job 1’s configurations. For example, the “update config” block may reduce the number of nodes in the Databricks cluster.

What does this look like in production?

In reality this doesn’t happen in a single shot. The “update config” model is now responsible for tweaking the infrastructure to try to get the cost down to $30. As you can imagine, over time the system will improve and eventually hit the desired cost of $30, as shown in the image below.

This may all sound fine and dandy, but you may be scratching your head and asking “what is this magical ‘update config’ block?” Well that’s where the rubber meets the road. That block is a mathematical model that can input a numerical goal delta, and output an infrastructure configuration or maybe code change.

It’s not easy to make and will vary depending on the goal (e.g. costs vs. runtime vs. utilization). This model must fundamentally predict the impact of an infrastructure change on business goals – not an easy thing to do.

Nobody can predict the future

One subtle thing is that no “update config” model is 100% accurate. In the 4th blue dot, you can actually see that the cost goes UP at one point. This is because the model is trying to predict a change in the configurations that will lower costs, but because nothing can predict with 100% accuracy, sometimes it will be wrong locally, and as a result the cost may go up for a single run, while the system is “training.”

But, over time, we can see that the total cost does in fact go down. You can think of it as an intelligent trial and error process, since predicting the impact of configuration changes with 100% accuracy is straight up impossible.

The big “so what?” – Set any goal and go

The approach above is a general strategy and not one that is limited to just cost savings. The “set point” above is simply a goal that a data platform person puts in. It can be any kind of goal, for example runtime is a great example.

Let’s say we want a job to be under a 1 hour runtime (or SLA). We can let the system above tweak the configurations until the SLA is hit. Or what if it’s more complicated, a cost and SLA goal simultaneously? No problem at all, the system can optimize to hit your goals over many parameters. In addition to cost and runtime, other business use cases goals are:

Resource Utilization: Independent of cost and runtime, am I using the resources I have properly?
Energy Efficiency: Am I consuming the least amount of resources possible to minimize my carbon footprint?
Fault Tolerance: Is my job actually resilient to failure? Meaning do I want to over-spec it just in case I get preempted or just in case there are no SPOT instances available?
Scalability: Does my job scale? What if I have a spike in input data by 10x, will my job crash?
Latency: Are my jobs hitting my latency goals? Response time goals?

In theory, all a data platform person has to do is set goals, and then an automatic system can iteratively improve the infrastructure until the goals are hit. There are no humans in the loop, no engineers to get on board. The platform team just sets the goals they’ve received from their stakeholders. Sounds like a dream.

So far we’ve been pretty abstract. Let’s dive into a some concrete use cases that are hopefully compelling to people:

Example feature #1: Group jobs by business goals

Let’s say you’re a data platform manager and you oversee the operation of hundreds of production jobs. Right now, they all have their own cost and runtime. A simple graph below shows a cartoon example, where basically all of the jobs are randomly scattered across a cost and runtime graph.

What if you want to lower costs at scale? What if you want to change the runtime (or SLA) of many jobs at once? Right now you’d be stuck.

Now imagine if you had the closed loop control system above implemented for all of your jobs. All you’d have to do is set the high level business goals of your jobs (in this case SLA runtime requirements), and the feedback control system would do its best to find the infrastructure that accomplishes your goals. The end state will look like this:

Here we see each job’s color represents a different business goal, as defined by the SLA. The closed loop feedback control system behind the scenes changed the cluster / warehouse size, various configurations, or even adjusted entire pipelines to try to hit the SLA runtime goals at the lowest cost. Typically longer job runtimes lead to lower cost opportunities.

Example feature #2: Auto-healing jobs

As most data platform people can confirm, things are always changing in their data pipelines. Two very popular scenarios are: data size growing over time, and code changes. Both of which can cause erratic behavior when it comes to cost and runtime.

The illustration below shows the basic concept. Let’s walk through the example from left to right:

Start: Let’s say you have a job and over time the data size grows. Normally your cluster stays the same and as a result both costs and runtime increases.
Start Feedback: Over time the runtime approaches the SLA requirement and the feedback control system kicks in at the green arrow. At this point, the control system changes the cluster to stay below the red line while minimizing costs.
Code Change: At some point a developer pushes a new update to the code which causes a spike in the cost and runtime. The feedback control system kicks in and adjusts the cluster to work better with the new code change.

Hopefully these two examples explain the potential benefit of how a closed loop control pipeline can be beneficial. Of course in reality there are many details that have been left out and some design principles companies will have to adhere to. One big one is a way for configurations to revert back to a previous state in case something goes wrong. An idempotent pipeline would also be ideal here in case many iterations are needed.

Conclusion

Data pipelines are complex systems, and like any other complex system, they need feedback and control to ensure a stable performance. Not only does this help solve technical or business problems, it will dramatically help free up data platform and engineering teams to focus on actually building pipelines.

Like we mentioned before, a lot of this hinges on the performance of the “update config” block. This is the critical piece of intelligence that is needed to the success of the feedback loop. It is not trivial to build this block and is the main technical barrier today. It can be an algorithm or a machine learning model, and utilize historical data. It is the main technical component we’ve been working on over the past several years.

In our next post we’ll show an actual implementation of this system applied to Databricks Jobs, so you can believe that what we’re talking about is real!

Interested in learning more about closed loop controls for your Databricks pipelines? Reach out to Jeff Chou and the rest of the Sync Team.