Blog

Developing Gradient Part I

Introduction

Sync recently introduced Gradient, a tool that helps data engineers manage and optimize their compute infrastructure. The primary facet of Gradient is a Project which groups a sequence of runs of a Databricks job. After each run, the Spark eventlog and cluster information is sent to Sync. That accumulated project data is then fed into our recommendation engine, which gives back an optimized cluster configuration to make the next run of your job run at lower cost while keeping you on target for your SLA.

If that sounds like a challenging product to develop, then you’d be right! In fact, much of the challenge comes from getting the data needed to create the predictive engine, and we thought that this story and our strategy for tackling it was worth sharing.

In this blog post we give some insight into the development process of Gradient, with focus on the internal testing infrastructure that enabled the early development and validation of the product.

The Challenge

In the early days of development, the Gradient team was in a really tight spot. You see, at its heart, Gradient is a data insights product that requires a feedback loop of job runs and recommendations that get implemented in subsequent runs. You need this feedback loop to both assess the performance of the product and have a hope at improving the quality of recommendations. 

However, injecting yourself into a workflow is a tall order for customers who are rightly protective of their data pipelines, especially when the request comes from a young company with an unproved track record (even if our team – being real here – is totally awesome). Consultant-like interactions were mildly successful, but the update cycle was often a week or more, much too slow to have a hope of making meaningful progress. It was obvious then that at the start, we needed a solution that could get us a lot of data for many different jobs and many different cluster configurations … and didn’t source from customers.

The Solution

So how did we design, test, and validate our recommendation engine? We built infrastructure that would operate a Gradient Project using our own applications as a source. This system would select a Databricks Job (e.g. a TPC-DS query), create a Project, choose an initial “cold start” cluster configuration, and then cyclically run the job using Gradient’s recommendations to inform subsequent runs. Effectively we became a consumer of our own product. 

We also built an orchestrator that could run these Projects at scale using a variety of hardware configurations and applications. After selecting a few configurations, a team member could be spinning up hundreds of Projects with data available for analysis the next day. 

At the foundation of the whole system is a bank of jobs that ensures we capture a variety of workloads in our testing, mitigating as much bias as possible from our data. And in a stroke of genius we named this testing project the Job Bank.

Figure: Job Bank diagram. A Spark application is selected from the Job Bank vault and an initial hardware configuration is selected. After an initial run of the job, the results are passed into Gradient which then informs the subsequent run. Many instances of this process are orchestrated together so testing can occur at scale.

The Job Bank enabled two key vectors for performance assessment and improvement. 

  1. Having a parent run, its recommendation, and the informed child run allow us to assess the runtime prediction accuracy. Our data science team can then dig into these results, looking for scenarios where we perform better or worse. Naturally, this informs modeling and constraints to improve the quality of our recommendations.
  1. Since all of these jobs are run internally at Sync, we have full access to the cost and usage reports from both AWS and Databricks. Not only does this allow us to validate our cost modeling, but it lets us assess Gradient’s performance using cost-actual data which can be considered independently from our prediction accuracy. Look out for more details on cost performance in an upcoming blog!

The benefits of having a system like this were apparent almost immediately. It helped us uncover bugs and new edge cases, improve the engineering, and gave us clear direction in improving the runtime prediction accuracy. Most importantly, it enabled us to develop the recommendation engine into a state that we feel confident will provide customers reliable job management and real cost savings.

Some example data generated with the Job Bank is shown below. The general process to generate this data was

  1. Cold-start by selecting a Spark application and running it with an initial hardware configuration.
  2. Submit the resulting run data to Gradient and get a recommendation.
  3. Generate a child run using the recommended hardware configuration. 

The plot shows the runtime prediction error (predicted runtime minus child runtime) for about 20 unique applications with 3-5 cold-starts each. The error is plotted against the change in instance size from the parent to the child.  In this particular visualization, we can see that predicted error is correlated with how much the instance size changes.  These kinds of insights help us determine where to focus our efforts in improving the algorithm.

Figure: Example runtime prediction accuracy data generated using Job Bank. Each point is the runtime prediction error calculated by differencing a predicted runtime with the actual runtime using the predicted configuration.

Looking Forward

In the world of data it’s customer data that rules. As our user base grows and we have more feedback from those users, our internal system will become less relevant and it’s the customer recommendation cycles that will drive future development. After all, the variety of jobs and configurations that we might assess in the Job Bank is negligible compared to the domain of jobs that exist out in the wild. We look forward with eagerness and excitement to seeing how best we can improve Gradient in the future with customers at the center of mind.

Continue reading part II: How Gradient saves you money

Introducing: Gradient for Databricks

Wow the day is finally here! It’s been a long journey, but we’re so excited to announce our newest product: Gradient for Databricks.

Checkout our promo video here!

The quick pitch

Gradient is a new tool to help data engineers know when and how to optimize and lower their Databricks costs – without sacrificing performance.

For the math geeks out there, the name Gradient comes from the mathematical operator from vector calculus that is commonly used in optimization algorithms (e.g. gradient descent).

Over the past 18 months of development we’ve worked with data engineers around the world to understand their frustrations when trying to optimize their Databricks jobs. Some of the top pains we heard were:

  • “I have no idea how to tune Apache Spark”
  • “Tuning is annoying, I’d rather focus on development”
  • “There are too many jobs at my company. Manual tuning does not scale”
  • “But our Databricks costs are through the roof and I need help”

How did companies get here?

We’ve worked with companies around the world who absolutely love using Databricks. So how did so many companies (and their engineers) get to this efficiency pain point? At a high level, the story typically goes like this:

  • “The Honeymoon” phase: We are starting to use Databricks and the engineers love it
  • “The YOLO” phase: We need to build faster, let’s scale up ASAP. Don’t worry about efficiency.
  • “The Tantrum” phase: Uh oh. Everything on Databricks is exploding, especially our costs! Help!

So what did we do?

We wanted to attack the “Tantrum” problem head on. Internally three teams of data science, engineering, and product worked hand in hand with early design partners using our Spark Autotuner to figure out how to deliver a deeply technical solution that was also easy and intuitive. We used all the feedback on the biggest problems to build Gradient:

User Problem What Gradient Does
I don’t know when, why, or how to optimize my jobsGradient continuously monitors your clusters to notify you of when a new optimization is detected, estimate the cost/performance impact, and output a JSON configuration file to easily make the change.
I use Airflow or Databricks Workflows to orchestrate our jobs, everything I use must easily integrate.Our new python libraries and quick-start tutorials for Airflow and Databricks Workflows make it easy to integrate Gradient into your favorite orchestrators.
I just want to state my runtime requirements, and then still have my costs loweredSimply set your ideal max runtime and we’ll configure the cluster to hit your goals at the lowest possible cost.
My company wants us to use Autoscaling for our jobs clustersWhether you use auto-scaled or fixed clusters, Gradient supports optimizing both (or even switching from one to the other). 
I have hundreds of Databricks jobs. I need batch importing for optimizing to workProvide your Databricks token, and we’ll do all the heavy lifting of automatically fetching all of your qualified jobs and importing them into Gradient.

We want to hear from you!

Our early customers made Gradient what it is today, and we want to make sure it’s meeting companies’ needs. We made getting started super easy (you can Just login to the app here). Feel free to browse the docs here. Please let us know how we did via Intercom (in the docs and app).

Do Graviton instances lower costs for Spark on EMR on AWS?

Here at Sync we are passionate about optimizing cloud infrastructure for Apache Spark workloads.  One question we receive a lot is

“Do Graviton instances help lower costs?”  


For a little background information, AWS built their own processors which promise to be a “major leap” in performance.  Specifically for Spark on EMR, AWS published a report that claimed Graviton can help reduce costs up to 30% and speed up performance up to 15%.  These are fantastic results, and who doesn’t love a company that builds their own hardware.

As an independent company, here at Sync, we of course always want to verify AWS’s claims on the performance of Graviton instances.  So in this blog post we run several experiments with the TPC-DS benchmark with various driver and worker count configurations on several different instance classes to see for ourselves how these instances stack up.

The Experiment

The goal of the experiment is to see how Graviton instances perform relative to other popular instances that people use.  There are of course hundreds of instances types, so we only selected 10 popular instances to make this a feasible study.  

As for the workload, we selected the fan favorite benchmark, TPC-DS 1TB, with all 98 queries run in series.  This is different compared to what AWS used in their study, which was to look at individual queries within the benchmark.  We decided to track the total job runtime of all queries since we’re just looking for the high level “average” performance to see if any interesting trends appear.  Results of course may vary query by query, and of course your individual code is a complete wildcard.  We make no claim that these results are generally true for all workloads or your specific workloads.

The details of the experimental sweeps are shown below:

  • Workload:  TPC-DS 1TB (queries 1-98 run in series)
  • EMR Version:  6.2.0
  • Instances: [r6g, m5dn, c5, i3, m6g, r5, m5d, m5, c6g, r5d] (bold are the Graviton instances)
  • Driver Node sizes: *.xlarge, *.2xlarge, *.4xlarge  (* = instances)
  • Worker Nodes: *.xlarge
  • Number of workers: [5,12,20,32,50]
  • Cores.executor: 4
  • Market: on-demand, list pricing
  • Cost data:  True AWS costs extracted from the cost and usage reports, includes both EC2 and EMR fees

The Result

Below is a global view of all of the experiments run showing cost vs. runtime.  Each dot represents a different configuration as described by the list above.  Points that are in the bottom left hand corner edge are ideal as they are both cheaper and faster.

At a high level, we see that the c6g instances (light green dots) were the lowest cost with comparable runtimes, which was interesting to see.  The other two graviton instance (r6g and m6g) skewed lower-left than most of the other instances as well.  

One deviation is the c5 instances performed surprisingly well on both the cost and runtime curves.  They were quite similar to the best graviton chip, the c6g.

To make the information a bit easier to digest, we take an average of the runtime and cost data to do a clear side by side comparison of the different instances.  The salmon colored bars are the Graviton enabled instances.  

In the graph below the runtime of Graviton instances were comparable with other instances.  The r6g instances were the fastest instances, although not by much – only about 6.5% faster than m6g.  The one negative standout was that the i3 instances took around 20% longer runtime than all of the other instances.

More variation is seen in the cost breakdown, where we see that the Graviton instances were typically lower cost than their non-Graviton counterparts, some by a wide margin.  What really stole the show were the “c” class instances, where c5 actually was cheaper by about 10% than the m6g and r6g Graviton instances.  

The global winner was the c6g instance, which was the absolute cheapest.  It’s interesting to see the significant cost difference between the max (i3) and min (c6g), which shows a 70% cost difference!  

Based on the data above, it’s interesting to see that the runtime of Graviton instances was comparable to other non-Graviton instances.  So, what then was the cause of the huge cost differential?  It seems at the end of the day the total job cost generally followed the trends of the list prices of the machines.  Let’s look deeper.

The table below shows the list price of the instances and their on-demand list price, in order of lowest to highest cost.  We can see the lowest instance cost was the Graviton instance c6g, which corresponds to the study above where the c6g was the lowest cost.

However, there were some exceptions where more expensive instances still had cheaper total job costs:

  1. c5.xlarge – Was the 3rd lowest cost on-demand price, however had the 2nd cheapest overall job cost
  2. R6g.xlarge – Was the 5th lowest cost on-demand price, however had the 3rd  cheapest overall job cost

These two exceptions show that the actual list price of the instances doesn’t always guarantee overall total cost trends.  Sometimes the hardware is such a great fit for your job that it overcomes the higher cost.

InstanceList Price On-Demand
c6g.xlarge0.136
m6g.xlarge0.154
c5.xlarge0.17
m5.xlarge0.192
r6g.xlarge0.2016
m5d.xlarge0.226
r5.xlarge0.252
m5dn.xlarge0.272
r5d.xlarge0.288
I3.xlarge0.312

Conclusion

So at the end of the day, do Graviton instances save you money?  From this study, I’d say that on average their cost/performance numbers were in fact better than other popular instances.  However, as we saw above, it is not always true and, like most things we post – it depends.

If you’re able to explore different instance types, I’d definitely recommend trying out Graviton instances, as they look like a pretty solid bet.  

To revisit the claims that AWS had about Graviton instances being 30% cheaper and 15% more performant, based on the data above that is not always true and depends on a lot of cluster parameters.  

For example, one thing we’ll note is that in the AWS study, they only used workers with *.2xlarge instances, whereas our study only looked at *.xlarge worker node instances.  I also have no idea what Apache Spark configurations they used and if they matched what we did or not.

At the end of the day, everything depends on your workload and what your job is trying to do.  There is no one-size-fits-all instance for your jobs.  That’s why we built the Apache Spark Gradient to help users easily optimize their Apache Spark configurations and instance types to help hit their cost and runtime needs.

Feel free to try out the Spark Gradient yourself!

How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs

The Situation

Let’s say you’re a data engineer and you want to run your data/ML Spark job on AWS as fast as possible. You want to avoid slow Apache Spark performance. After you’ve written your code to be as efficient as possible, it’s time to deploy to the cloud.

Here’s the problem, there are over 600 machines in AWS (today), and if you add in the various Spark parameters, the number of possible deployment options becomes impossibly large. So inevitably you take a rough guess, or experiment with a few options, pick one that works, and forget about it.

The Impact

It turns out, this guessing game could undo all the work put in to streamline your code. The graph below shows the performance of a standard Bayes ML Spark job from the Hi-Bench test suite. Each point on the graph below is the result of changing just 2 parameters: (1) which compute instance is used, and (2) The number of nodes.

Clearly we can see the issue here, even with this very simple example. If a user selects just 2 parameters poorly, the runtime could be up to 10X slower or cost twice as much as it should (with little to no performance gain).

Keep in mind that this is a simplified picture, where we have ignored Spark parameters (e.g. memory, executor count) and cloud infrastructure options (e.g. storage volume, network bandwidth) which add even more uncertainty to the problem.

Daily Changes Make it Worse

To add yet another complication, data being processed today could look very different tomorrow. Fluctuations in data size, skew, and even minor modifications to the codebase can lead to crashed or slow jobs if your production infrastructure isn’t adapting to these changing needs.

How Sync Solved the Problem

At Sync, we think this problem should go away. We also think developers shouldn’t waste time running and testing their job on various combinations of configurations. We want developers to get up and running as fast as possible, completely eliminating the guesswork of cloud infrastructure. At its heart, our solution profiles your job, solves a huge optimization problem, and then tells you exactly how to launch to the cloud.

Try the Sync Gradient for Apache Spark yourself.

How does the worker size impact costs for Apache Spark on EMR AWS?

Here at Sync, we are passionate about optimizing data infrastructure on the cloud, and one common point of confusion we hear from users is what kind of worker instance size is best to use for their job?

Many companies run production data pipelines on Apache Spark in the elastic map reduce (EMR) platform on AWS.  As we’ve discussed in previous blog posts, wherever you run Apache Spark, whether it be on Databricks or EMR, the infrastructure you run it on can have a huge impact on the overall cost and performance.

To make matters even more complex, the infrastructure settings can change depending on your business goals.  Is there a service level agreement (SLA) time requirement?  Do you have a cost target?  What about both?  

One of the key tuning parameters is which instance size should your workers run on?  Should you use a few large nodes?  Or perhaps a lot of small nodes?  In this blog post, we take a deep dive into some of these questions utilizing the TPC-DS benchmark.  

Before starting, we want to be clear that these results are very specific to the TPC-DS workload, while it may be nice to generalize, we fully note that we cannot predict that these trends will hold true for other workloads.  We highly recommend people run their own tests to confirm.  Alternatively, we built the Gradientr for Apache Spark to help accelerate this process (feel free to check it out yourself!).

With that said, let’s go!

The Experiment

The main question we seek to answer is – “How does the worker size impact cost and performance for Spark EMR jobs?”  Below are the fixed parameters we used when conducting this experiment:

  • EMR Version: 6.2
  • Driver Node: m5.xlarge
  • Driver EBS storage: 32 GB
  • Worker EBS storage: 128 GB 
  • Worker instance family: m5
  • Worker type: Core nodes only
  • Workload: TPC-DS 1TB (Queries 1-98 in series)
  • Cost structure: On-demand, list price (to avoid Spot node variability)
  • Cost data: Extracted from the AWS cost and usage reports, includes both the EC2 fees and the EMR management fees

Fixed Spark settings:

  • Spark.executor.cores: 4
  • Number of executors: set to 100% cluster utilization based on the cluster size
  • Spark.executor.memory: automatically set based on number of cores

The fixed Spark settings we selected were meant to mimic safe “default” settings that an average Spark user may select at first.  To explain those parameters a bit more, since we are changing the worker instance size in this study, we decided to keep the number of cores per executor to be constant at 4.  The other parameters such as number of executors and executor memory are automatically calculated to utilize the machines to 100%.

For example, if a machine (worker) has 16 cores, we would create 4 executors per machine (worker).  If the worker has 32 cores, we would create 8 executors.

The variables we are sweeping are outlined below:

  • Worker instance type: m5.xlarge, m5.2xlarge, m5.4xlarge
  • Number of workers: 1-50 nodes

Results

The figure below shows the Spark runtime versus the number and type of workers.  The trend here is pretty clear, in that larger clusters are in fact faster.  The 4xlarge size outperformed all other cluster sizes.  If speed is your goal, selecting larger workers could help.  If one were to pick a best instance based on the graph below, one may draw the conclusion that:

It looks like the 4xlarge is the fastest choice

The figure below shows the true total cost versus the number and type of workers.  On the cost metric, the story almost flips compared to the runtime graph above.  The smallest instance usually outperformed larger instances when it came to lowering costs.  For 20 or more workers, the xlarge instances were cheaper than the other two choices.

If one were to quickly look at the plot below, and look for the “lowest points” which correspond to lowest cost, one could draw a conclusion that:

It looks like the 2xlarge and xlarge instance are the lowest cost, depending on the number of workers

However, the real story comes when we merge those two plots together and simultaneously look at cost vs. runtime.  In this plot, it is more desirable to be toward the bottom left, this means the run is both lower cost and faster.  As the plot below shows, if one were to look at the lowest points, the conclusion to be drawn is:

It looks like 4xlarge instances are the lowest cost choice… what?

What’s going on here, is that for a given runtime, there is always a lower cost configuration with the 4xlarge instances.  When you put it into that perspective, there is little to reason to use xlarge sizes as going to larger machines can get you something both faster and cheaper.  

The only caveat here is there is a floor to how cheap and slow the 4xlarge cluster can give you, and that’s with a worker count of 1.  Meaning, you could get a cheaper cluster with a smaller 2xlarge cluster, but the runtime becomes quite long and may be unacceptable for real-world applications.

Here’s a generally summary of how the “best worker” choice can change depending on your cost and runtime goals:

Runtime GoalCost GoalBest Worker
<20,000 secondsMinimize4xlarge
<30,000 secondsMinimize2xlarge
<A very long timeMinimizexlarge

A note on extracting EMR costs

Extracting the actual true costs for individual EMR jobs from the AWS billing information is not straight forward.  We had to write custom scripts to scan the low level cost and usage reports, looking for specific EMR cluster tags.  The exact mechanism for retrieving these costs will probably vary company to company, as different security permissions may alter the mechanics of how these costs can be extracted

If you work at a company and EMR costs are a high priority and you’d like help extracting your true EMR job level costs, feel free to reach out to us here at Sync, we’d be happy to work together.

Conclusion

The main takeaways here are the following points:

  • It Depends:  Selecting the “best” worker is highly dependent on both your cost and runtime goals.  It’s not straightforward what the best choice is.
  • It really depends:  Even with cost and runtime goals set, the “best” worker will also depend on the code, the data size, the data skew, Spot instance pricing, availability to just name a few.  
  • Where even are the costs?  Extracting the actual cost per workload is not easy in AWS, and is actually quite painful to capture both the EC2 and EMR management fees.

Of course here at Sync, we’re working on making this problem go away.  This is why we built the Spark Gradient product to help users quickly understand their infrastructure choices given business needs.  

Feel free to check out the Gradient yourself here!

You can also read our other blog posts here which go into other fundamental Spark infrastructure optimization questions.

Databricks driver sizing impact on cost and performance

As many previous blog posts have reported, tuning and optimizing the cluster configurations of Apache Spark is a notoriously difficult problem.  Especially when a data engineer needs to lower costs or accelerate runtimes on platforms such as EMR or Databricks on AWS, tuning these parameters becomes a high priority.  

Here at Sync, we will experimentally explore the impact of driver sizing in the Databricks platform on the TPC-DS 1TB benchmark, to see if we can obtain an understanding of the relationship between the driver instance size and cost/runtime of the job.

Driver node review

For those who may be less familiar with the driver node details in Apache Spark, there are many excellent previous blog posts as well as the official documentation on this topic and I will recommend users to read those if they are not familiar.  As a quick summary, the driver is an important part of the Apache Spark system and effectively acts as the “brain” of the entire operation.  

The driver program runs the main() function, creates the spark context, and schedules tasks onto the worker nodes.  Aside from these high level functions, we’d like to note that the driver node is also used in the execution of some functions, most famously when using the collect operation and broadcast joins.  During those functions, data is moved to the driver node and if it’s not appropriately sized, can cause a driver side out of memory error which can shut down the entire cluster.

As a quick side note, for broadcast joins, It looks like a ticket has been filed to change this behavior (at least for broadcast joins) in the open source Spark core.  So people should be aware that this may change in the future.

How Does Driver Sizing Impact Performance As a Function of the Number Of Workers?

The main experimental question we want to ask is “how does driver sizing impact performance as a function of the number of workers?”  The reason why we want to correlate driver size with the number of workers is that the number of workers is a very important parameter when tuning systems for either cost or runtime goals.  Observing how the driver impacts the worker scaling of the job is a key part of understanding and optimizing a cluster.

Fundamentally, the maximum number of tasks that can be executed in parallel is determined by the number of workers and executors.  Since the driver node is responsible for scheduling these tasks, we wanted to see if the number of workers changes the hardware requirements of the driver.  For example, does scheduling 1 million tasks require a different driver instance type than scheduling 10 tasks?  

Experimental Setup

The technical parameters of the experiment are below:

  • Data Platform:  Databricks
  • Compute type: Jobs (ephemeral cluster, 1 job per cluster)
  • Photon Enabled: No
  • Fixed parameters::  All worker nodes are i3.xlarge, all configs default
  • Sweep parameters:  Driver instance size (r5a.large, r5a.xlarge, r5a.4xlarge), number of workers
  • AWS market:  On-demand (to eliminate spot fluctuations)
  • Workload: Databrick’s own benchmark on  TPC-DS 1TB (all queries run sequentially)

For reference, here are the hardware specifications of the 3 different drivers used on AWS:

The result

We will break down the results into 3 main plots.  The first is below where we look at runtime vs. number of workers for the 3 different driver types.  In the plot below we see that as the number of workers increases the runtime decreases.  We note here that the scaling trend is not linear and there is a typical “elbow” scaling that occurs.  We published previously the general concept of scaling jobs.  We observe here that the largest driver, r5a.4xlarge, yielded the fastest performance across all worker sizes.

In the plot below we see the cost (DBU’s in $) vs. number of workers.  For the most part we see that the medium sized driver, r5a.xlarge is the most economical, except for the smallest number of workers where the smallest driver size r5a.large was the cheapest.

Putting both plots together, we can see the general summary when we plot cost vs. runtime.  The small numbers next to each point show the number of workers.  In general, the ideal points should be toward the bottom left, as that indicates a configuration that is both faster and cheaper.  Points that are higher up or to the right are more expensive and slower.  

Some companies are only concerned about service level agreement (SLA) timelines, and do not actually need the “fastest” possible runtime.  A more useful way to think about the plot below is to ask the question “what is the maximum time you want to spend running this job?”  Once that number is known, you can then select the configuration with the cheapest cost that matches your SLA.  

For example, consider the SLA scenarios below:

1)  SLA of 2500s – If you need your job to be completed in 2,500s or less, then you should select the r5a.4xlarge driver with a worker size of 50.

2)  SLA of 4000s – If you need your job to be completed in 4,000s or less, then you should select the r5a.xlarge driver with a worker size of 20.

3)  SLA of 10,000s – If you need your job to be completed in 10,000s or less, then you should select the r5a.large driver with a worker size of 5.

Key Insights

It’s very convenient to see the scaling trend of all 3 drivers plotted in this manner, as there are several key insights gained here:

  1. There is a general “good” optimal driver for TPC-DS 1TB – across the spectrum, it’s clear that r5a.xlarge is a good choice generally as it is usually cheaper and faster than the other driver sizes.  This shows the danger that if your driver is too big or too small, you could be wasting money and time.  
  2. At the extremes, driver size matters for TPC-DS 1TB  – At the wings of either large clusters (50 workers) or small clusters (5 workers) we can see that the best driver selection can swing between all 3 drivers.  
  3. Drivers can be too big – At 12 workers, the r5a.4xlarge performance is slightly faster but significantly more expensive than the other two driver types.  Unless that slight speedup is important, it’s clear to see that if a driver is too large, then the extra cost of the larger driver is not worth the slight speedup.  It’s like buying a Ferrari to just sit in traffic – definitely not worth it (although you will look cool).
  4. Small driver bottleneck – For the small driver curve (r5a.large), we see that the blue line’s elbow occurs at a higher runtime than the middle driver (r5a.xlarge).  This implies that the smaller driver is creating a runtime bottleneck for the entire workload as the cluster becomes larger.  The next section will dive into why.

Root cause analysis for the “small driver bottleneck”

To investigate the cause of the small driver bottleneck, we looked into the Spark eventlogs to see what values changed as we scaled the number of workers.  In the Spark UI in Databricks, the typical high level metrics for each task are shown below and plotted graphically.  The image below shows an example of a single task broken down into the 7 main metrics:

When we aggregated all of these values across all tasks, across all the different drivers and workers, the numbers were all pretty consistent, except for one number:  “Scheduler Delay”.   For those who may not be familiar, the formal definition from the Databricks Spark UI, is shown in the image below:

“Scheduler delay includes time to ship the task from the scheduler to the executor, and time to send the task result from the executor to the scheduler. If scheduler delay is large, consider decreasing the size of tasks or decreasing the size of task results.”

In the graph below, we plot the total aggregated scheduler delays of all tasks for each of the job configurations vs the number of workers.  It is expected that the aggregated scheduler delay should increase for a larger number of workers since there are more tasks.  For example, if there are 100 tasks, each with 1s of scheduler delay, the total aggregated scheduler day is 100s (even if all 100 tasks executed in parallel and the “wall clock” scheduler delay is only 1s).  Therefore, if there are 1000 tasks, the total aggregated scheduler should increase as well.  

Theoretically this should scale roughly linearly with the number of workers for a “healthy” system.  For the “middle” and “large” sized drivers (r5a.xlarge and r5a.4xlarge respectively), we see the expected growth of the scheduler delay.  However, for the “small” r5a.large driver, we see a very non-linear growth of the total aggregated scheduler delay, which contributes to the overall longer job runtime.  This appears to be a large contributor to the “small driver bottleneck” issue.

To understand a bit deeper as to what is the formal definition of Scheduler Delay, let’s look at the Spark source code inside the function AppStatusUtils.scala.  At a high level, scheduler delay is a simple calculation as shown in the code below:

schedulerDelay = duration – runTime – deserializeTime – serlializeTime – gettingResultTime

To put it in normal text, scheduler delay is basically a catch-all term, that is the time the task is spent doing something that is not executing, serializing data, or getting results.  A further question would be to see which one of these terms is increasing or decreasing due to the smaller driver?  Maybe duration is increasing, or maybe gettingResultTime is decreasing?  

If we look at the apples to apples case of 32 workers for the “medium” r5a.xlarge driver and the “small” r5a.large driver, the runtime of the “small” driver was significantly longer.  One could hypothesize that the average duration per task is longer (vs. one of the other terms becoming smaller).  

In summary, our hypothesis here is that by reducing the driver size (number of VCPUs and memory), we are incurring an additional time “tax” on each task by taking, on average, slightly longer to ship a task from the scheduler on the driver to each executor.  

A simple analogy here is, imagine you’re sitting in bumper to bumper traffic on a highway, and then all of a sudden every car (a task in Spark) just grew 20% longer, if there are enough cars you could be set back miles.

Conclusion

Based on the data described above, the answer to the question above is that inappropriately sized drivers can lead to excess cost and performance as workers scale up and down.  We present a hypothesis that a driver that is “too small” with too few VCPUs and memory, could cause, on average, an increase in the task duration via an additional overhead in the scheduler delay.  

This final conclusion is not terribly new to those familiar with Spark, but we hope seeing actual data can help create a quantitative understanding on the impact of driver sizing.  There are of course many other things that could cause a poor driver to elongate or even crash a job, (as described earlier via the OOM errors).  This analysis was just a deep dive into one observation.

I’d like to put a large caveat here that this analysis was specific to the TPC-DS workload, and it would be difficult to generalize these findings across all workloads.  Although the TPC-DS benchmark is a collection of very common SQL queries, in reality individual code, or things like user defined functions, could throw these conclusions out the window.  The only way to know for sure about your workloads is to run some driver sizing experiments.

As we’ve mentioned many times before, distributed computing is complicated, and optimizing your cluster for your job needs to be done on an individual basis.  Which is why we built the Apache Spark Autotuner for EMR and Databricks on AWS to help data engineers quickly find the answers they are looking for.