Case Study

Is Databricks’s autoscaling cost efficient?

Here at Sync we are always trying to learn and optimize complex cloud infrastructure, with the goal to help more knowledge to the community.  In our previous blog post we outlined a few high level strategies companies employ to squeeze out more efficiency in their cloud data platforms.  One very popular response from mid-sized to large enterprise companies we hear a lot of is:

“We use Autoscaling to minimize costs”

We wanted to zoom into this statement to really understand how true it is, and to get a better understanding of the fundamental question 

“Is autoscaling Apache Spark cost efficient?”  

To explain in more detail, we wanted to investigate the technical side of Autoscaling and really dive deep into a specific example.  Because of this we chose to begin with a gold standard workload to analyze, the TPC-DS benchmark, just to minimize any argument being made that we cherry picked a weird workload to skew the final answer.  Our goal here is to be as technical and informative as possible about a few workloads – we are not trying to perform a broad comprehensive study (that would take a long time).  So let’s begin:

What is Autoscaling?

For those who may not know, Autoscaling is the general concept that a cluster should automatically tune the number of workers (or instances on AWS) based on the needs of your job.  The basic message told to companies is, autoscaling will optimize the cluster for your workload and minimize costs.  

Technically, Autoscaling is usually a reactive algorithm that measures some utilization metric inside your cluster to determine if more or less resources are needed.  While this makes logical sense, in reality the complexity of Apache Spark and constantly changing cloud infrastructure make this problem highly unpredictable.  

In the Databricks UI, autoscaling is just a simple checkbox that many people may overlook.  The choice people make by selecting that box could impact their overall performance significantly.

Since many people use Databricks or EMR, the exact algorithm they employ is behind closed doors, so we don’t know the exact details of their logic.  The only thing we can do is measure their performance.

Experiment Setup

Our goal is to provide a technical study of Autoscaling from a novice’s point of view.  Meaning, our base case to compare against will be whatever “default” settings Databricks suggests.  We are not comparing against the global best or against an expert who has spent many days optimizing a particular cluster (who we think would probably do an awesome job).

  • Data Platform:  Databricks
  • Compute type: Jobs (ephemeral cluster, 1 job per cluster)
  • Photon Enabled: No
  • Baseline configuration:  Default params given to users at spin up
  • AWS market:  Driver on-demand, workers on spot with 100% on-demand fall back
  • Workload: Databrick’s own benchmark on  TPC-DS 100GB (all 99 queries run sequentially)

To keep things simple, we ran 3 comparison job runs:

  1. Fixed 8 Nodes – a fixed 8 node cluster using the default machine types suggested to us in the Databricks UI.
  2. Fixed 2 Nodes w/ Autotuner – We use our Apache Spark Autotuner product to recommend an optimized fixed custer to give us the lowest cost option (runtime not optimized).  The recommendation was to use 2 nodes (with different instance types than default)
  3. Autoscaler 2-8 Nodes – We used the default UI settings in Databricks here.
Fixed ClusterFixed Cluster (Autotuner)Autoscaler 2-8 Nodes
No. of Workers822-8
Driver Nodei3.xlarger5a.largei3.xlarge
Worker Nodesi3.xlargei3.2xlargei3.xlarge
Runtime [s]159324412834
DBU Cost [$]0.60.390.73
AWS Cost [$]0.920.921.35
Total Cost [$]1.521.312.08

The results

To our surprise, of the 3 jobs run, the default autoscaler performed the worst in both runtime and cost.  Both a fixed cluster of 8 nodes and 2 nodes, outperformed autoscaling in both time and cost.  The Sync optimized cluster outperformed autoscaling by 37% in terms of cost and 14% in runtime.  

To examine why the autoscaled cluster performed poorly, let’s look at the number of workers created and shut-down over time, in comparison to the fixed 2 node cluster.  The figure below tells the basic story, that the autoscaled cluster spent a lot of time scaling up and down, tuning itself to the workload itself.  At first glance, that is exactly what autoscaling is supposed to do, so why did the autoscaled cost and runtime perform so poorly?

The main reason, from what we can tell, is that there is a time penalty for changing the cluster size – specifically in upsizing the cluster.  We can see from the cluster event log below, that the time between “RESIZING” and “UPSIZE_COMPLETED” can span several minutes.  Based on the Spark UI, the executors don’t get launched until “UPSIZE_COMPLETED” occurs, so no new computing occurs until this step is achieved.  

Another observation here is that in order for us to run the TPC-DS benchmark, we had to run an init_script to install some code at the start of the job.  Based on the cluster event log below, it looks like every time it upsizes new machines, they have to reinstall all the init_scripts each time which costs time and money.  This is something to consider, where if your job requires you to load specific init_scripts, this would certainly negatively impact the autoscaling performance.

So to summarize, you are paying for the “ramp up time” of new workers during autoscaling, where no computing is occurring.  The more often your cluster upsizes, the more you will be waiting and paying.  

Databricks mentions that using pools can help speed up autoscaling, by creating a pool of “warm” instances ready to be kicked off.  Although you are not charged DBU’s, you do still have to pay AWS’s fees for those machines.  So in the end, it still depends on your workload, size of cluster, and use case if the pools solution makes sense.

Another issue is the question of optimizing for throughput.  If 3 nodes processes the data at the same rate as 8 nodes, then ideally autoscaling should stop at 3 nodes.  But it doesn’t seem like that’s the case here, as auto-scaling just went up to the max workers set by the user. 

The optimized fixed cluster looks at cost and throughput to find the best cluster, which is another reason why it is able to outperform the autoscaling solution.

Some follow up questions:

  • Is this just a TPC-DS specific artifact? 

We ran the same tests with two other internal Spark jobs, which we call Airline Delay and Gen Data, and observed the same trend – that the Autoscale cluster was more expensive than fixed clusters.  The amount of Autoscaling fluctuation was much less for Airline delay, so we noticed the advantage of a fixed cluster was reduced.   Gen Data is a very I/O intense job, and the autoscaler actually did not scale up the cluster beyond 2 nodes.  For the sake of brevity, we won’t show those details here (feel free to reach out if there are more questions).  

We just wanted to confirm that these results weren’t specific to TPC-DS, and if we had more time we could do a large scale test with a diverse set of workloads.  Here we observed the optimized fixed cluster (using the Sync Apache Spark Autotuner) achieved a 28% and 65% cost savings over default autoscaling for Airline Delay and Gen Data respectively.

  • What if we just set Autoscaling to 1-2 nodes (instead of 2-8)?

We thought that if we just changed the autoscaling min and max to be near what the “Fixed 2 node autotuner” cluster was, then it should get about the same runtime and cost.  To our surprise, what happened was the autoscaler bounced back and forth between 1 and 2 nodes, which caused a longer job run than the fixed cluster.  You can see in the plot below, we added the autoscaling job from 1-2 nodes on the worker plot.  Overall the cost of the fixed 2 nodes cluster was still 12% cheaper than the autoscaled version of the same cluster with 1-2 nodes.  

What this results indicates is that the parameters of min/max workers in the autoscaler are also parameters to optimize for cost and require experimentation.

  • How does the cost and runtime of the job change vs. varying the autoscaling max worker count? 

If the cost and runtime of your job changes based on the input into max and min worker count, then autoscaling actually becomes a new tuning parameter.  

The data below shows what happens if we keep the min_worker = 2, but sweep the max_worker from 3 to 8 workers.  Clearly both cost and runtime vary quite a bit compared to the Max Worker count.  And the profiles of these slopes depends on the workload.  The bumpiness of the total cost can be attributed to the fluctuating spot prices.

The black dashed line shows the runtime and cost performance of the optimize fixed 2 node cluster.  We note that a fixed cluster was able to outperform the best optimal autoscaling configuration for cost and runtime for the TPC-DS workload.

  • How did we get the cost of the jobs?

It turns out obtaining the actual cost charged for your jobs is pretty tedious and time consuming.  As a quick summary, below are the steps we took to obtain the actual observed costs of each job:

  1. Obtain the Databricks ClusterId of each completed job.  (this can be found in the cluster details of the completed job under “Automatically added tags”)
  2. In the Databricks console, go to the “manage account>usage tab”, filter results by tags, and search for the specific charge for each ClusterId.  (one note: the cost data is only updated every couple of hours, so you can’t retrieve this information right after your run completes)
  3. In AWS, go to your cost explorer, filter by tags, and type in the same cluster-id to obtain the AWS costs for that job (this tag is automatically transferred to your AWS account).  (Another note, AWS updates this cost data once a day, so you’ll have to wait)
  4. Add together your DBU and AWS EC2 costs to obtain your total job cost.

So to obtain the actual observed total cost (DBU and AWS), you have to wait around 24 hours for all of the cost data to hit their final end points.  We were disappointed to see we couldn’t see the actual cost in real time.  


In our analysis, we saw that a fixed cluster could outperform an autoscaled cluster in both runtime and costs for the 3 workloads that we looked at by 37%, 28%, and 65%.  Our experiments showed that by just sticking to a fixed cluster, we eliminated all of the overhead that came with autoscaling which resulted in faster runtimes and lower costs.  So ultimately, the net cost efficiency all depends on if the scaling benefits outweigh the negative overhead costs.

To be fair to the autoscaling algorithm, it’s very difficult to build a universal algorithm that reactively works for all workloads.  One has to analyze the specifics of each job in order to truly optimize the cluster underneath and then still experiment to really know what’s best.  This point is also not specific to Databricks, as many data platforms (EMR, Snowflake, etc) also have autoscaling policies that may work similarly.

To summarize our findings, here are a few high level takeaways:

  • Autoscaling is not one size fits all –  Cluster configurations is an extremely complicated topic that is highly dependent on the details of your workload.  A reactive autoscaling algorithm and the overheads associated with changing the cluster is a good attempt, but does not solve the problem of cluster optimization.
  • Autoscaling still requires tuning – Since Autoscaling is not a “set and forget” solution, it still requires tuning and experimentation to see what min and max worker settings are optimal for your application.  Unfortunately, since the autoscaling algorithm is opaque to users, the fastest way to determine the best settings is to manually experiment.
  • So when is autoscaling good to use for batch jobs?  It’s difficult to provide a general answer because, like mentioned above, it’s all dependent on your workload.  But perhaps two scenarios I could see are (1) if your job has long periods of idle time, then autoscaling should shut down the nodes correctly, or (2) you are running ad-hoc data science experiments and you are prioritizing productivity over costs.  Scenarios (1) and (2) could be the same thing!
  • So what should people do?  If cost efficiency of your production level Databricks jobs is a priority, I would heavily consider performing an experiment where you select a few jobs, switch them to fixed clusters, and then extract the costs to do a before and after analysis – just like we did here.

The challenge of the last bullet is, what is the optimal fixed cluster?  This is an age-old question that required a lot of manual experimentation to determine in the past, which is why we built the Apache Spark Autotuner to figure that out quickly.  In this study, that is how I found the optimal fixed clusters with a single file upload, without having to run numerous experiments.  

Maybe autoscaling is great for your workloads, maybe it isn’t, unfortunately the answer is really “it depends.”  There’s only one way to really find out – you need to experiment.  

Top 3 trends we’ve learned about the scaling of Apache Spark (EMR and Databricks)

We launched the Autotuner for Apache Spark several months ago, and have worked with many companies on analyzing and optimizing their Apache Spark workloads for EMR and Databricks. In this article, we summarize cluster scaling trends we’ve seen with customers, as well as the theory behind it. The truth is, cluster sizing and configuring is a very complex topic and is different for each workload. Some cloud providers ignore all of the complexities and offer simple “T-Shirt” sizes (e.g. small, large, xlarge), while although great for quick testing of jobs, will lead to massive cost inefficiencies in production environments.

The Sync Autotuner for Apache Spark makes it easy to understand the complex tradeoffs of clusters, and enables data engineers to make the best cloud infrastructure decisions for their production environments.

Try for free: Autotuner for Apache Spark

The Theory

In any distributed computing system (even beyond Apache Spark), there exist well known scaling trends (runtime vs. number of nodes), as illustrated in the images below. These trends are universal and fundamental to computer science, so even if you’re running Tensorflow, OpenFOAM (computational fluid dynamics solver), or MonteCarlo simulations on many nodes, they will all follow one of the three scaling trends below:

Standard Scaling: As more and more nodes are added, the runtime of the job decreases, but the cost also increases. The reason is because adding more nodes is not free computationally, there are usually additional overheads to runtime such as being network bound (e.g. shuffles in Spark), compute bound, I/O bound, or memory bound. As an example, doubling the number of nodes to run your job results in a runtime of more than half of the original runtime if they exhibit standard scaling.

At some point, adding more nodes has diminishing returns and the job stops running faster, but obviously cloud costs start rising (since more nodes are being added). We can see point B here is running on let’s say, 5 nodes, but point A is running on 25 nodes. Running your job at point A is significantly less cost efficient and you may be wasting your money.

Embarrassingly Parallel: This is the case when adding more nodes actually does linearly decrease your runtime, and as a result we see a “flat” cost curve. This is traditionally known in the industry as “embarrassingly parallel” because there are no penalties for adding more nodes. This is usually because there is very little communication between nodes (e.g. no shuffles in Spark), and each node just acts independently.

For example at point B we are running at 5 nodes, but point A we’re running at 25 nodes. Turns out, although your number of nodes from A to B went up by 5x, your runtime also went down by 5x. So they both cancel out and you basically have a flat cost curve. In this case, you are free to increase your cluster size, and decrease your runtime for no extra cost! Due to the computational overheads mentioned above though, this case is quite rare and will eventually stop at large enough nodes (when exactly depends on your code).

Negative Scaling: This is the interesting case when running with more nodes is both cheaper and faster (the complete opposite of “Standard Scaling”). The reason here is that some overheads could actually decrease with larger cluster sizes. For example, there could be a network or disk I/O bound issue (e.g. fetch time waiting for data), where having more nodes increases the effective network or I/O bandwidth and makes your jobs run a lot faster. If you have too few nodes, then network or I/O will be your bottleneck as your Spark application gets hung up on fetching data. Memory bound jobs could also exhibit this behavior if the cluster is too small and doesn’t have enough memory, and there exists significant memory overhead.

For example at point B, we are running at 5 nodes, but now we only have 5 machines performing data read/write. But at point A we have 25 nodes, we have 5x more bandwidth on read/write, and thus the job runs much faster.

Real Customer Plots

The 3 scaling trends are universal behaviors of any distributed compute system, Apache Spark applications included. These scaling curves exist whether you’re running open source Spark, EMR, or Databricks — this is fundamental computer science stuff here.

When we actually started processing customer logs, we noticed that the jobs weren’t even on the proper scaling curve, due to the improper configurations of Spark. As a result, we saw that customers were actually located in the “Land of Inefficiency” (as shown by the striped region below), in which they were observing both larger costs and runtime, for no good reason.

For example if you set your workers and memory settings improperly, the result you’d see in the autotuner is a black “current” dot in the “Land of Inefficiency.” The entire goal of the autotuner is to provide an easy and automatic way for customers to achieve an efficient Spark cluster.

Standard Scaling — In the 3 screen shots below, we see the classic standard scaling for customer jobs. We see the classic “elbow” curve as described above. We can see that here in all 3 cases, all of the users were in the “Land of Inefficiency.” Some of the runtime and cost savings went up to 90%, which was amazing to see. Users can also tune the cost/runtime, based on their company’s goals.

Embarrassingly Parallel: In the screen shots below, we see almost flat curves for these jobs. In these cases the jobs were almost entirely CPU bound, meaning there was little communication between nodes. As a result, adding more nodes linearly increased the runtime. In this case, the jobs were still in the “Land of Inefficiency”, so substantial cost/runtime savings could still be achieved.

Negative Scaling — In the screen shots below, we see the negative scaling behavior. The issue here is a large amount of fetch wait time (e.g. network I/O) that causes larger clusters to be substantially more efficient than smaller clusters. As a result, going to larger clusters will be more advantageous for both cost and runtime.


We hope this was a useful blog for data engineers. As readers hopefully see, the scaling of your big data jobs is not straightforward, and is highly dependent on the particularities of your job. The big question is always, what is the bottleneck of your job? Is it CPU, network, disk I/O, or memory bound? Or perhaps it is a combination of a few things. The truth is, “it depends” and requires workload specific optimization. The Autotuner for Apache Spark is an easy way to understand your workload, bring you out of the “Land of Inefficiency”, and optimize your job depending on the type of scaling behavior it exhibits.

One question we get a lot is — what about multi-tenant situations when one cluster is running hundreds or thousands of jobs? How does the Autotuner take into account other simultaneous jobs? This solutions requires another level of optimization, and one we recently published a paper on entitled “Global Optimization of Data Pipelines on the Cloud”


  1. Autotuner post for EMR on AWS
  2. Autotuner post for Databricks on AWS
  3. Global Optimization of Data Pipelines on the Cloud

Disney Sr. Data Engineer User Case Study

Sr. Data Engineer at Disney Streaming

In the self-written blog post below, a Sr. Data Engineer chronicles his experience with the Spark Autotuner for EMR. In the blog post we helped accelerate a job from 90 to 24 minutes, which was amazing to see!

The first job I put into the autotuner went from processing in around 90 minutes to 25 minutes after I changed the configurations, only using a slightly larger cluster. However, that time save makes up for using more nodes, so it definitely worked to our advantage.

Matthew Weingarten

Extrapolated over a full year, our anticipated savings to his company was over $100K on AWS! Obviously this doesn’t include the extra time savings of removing the current manual guesswork of provisioning clusters.

User’s blogpost here

Optimize Databricks clusters based on cost and performance

We’re excited to announce that the Sync Apache Spark Cluster Autotuner Solution now supports Databricks! (our previous blog post was about Spark in EMR) In this blog post we discuss a real use-case with a customer, a cloud-native data based company, and how we lowered their Databricks cluster costs by 34% and accelerated their jobs by 47%. We discuss the lessons learned and how exactly the Autotuner works.

Try the Sync Autotuner for Databricks yourself

Databricks is one of the most popular platforms to run Apache Spark. It provides a relatively friendly interface that allows data scientists to focus on development of the analytical workloads and build extract load transform (ELT) type operations in a performant manner. The multiple options it provides, by virtue of being built on top of Apache Spark, like supported languages (Java, Python, Scala, R, and SQL) and rich libraries (MLlib, graphX, sparknlp ) makes it an attractive choice for a data compute platform. That said, without careful consideration of cluster set-ups to run big data workloads, costs and runtime can easily expand beyond expectations or initial assumptions. he non-trivial effort required to adjust proper compute parameters cannot be underestimated.

Configuring compute infrastructure for Databricks to hit cost/performance goals can be daunting. (Image by author)

To that extent, a mid-sized B2B customer company that provides data services to businesses, approached Sync Computing with the desire to improve their Databricks usage. Since the customer’s product is data, their Databricks bill and engineering time directly impacts their profit margin. After careful investigation, we were able to provide guidance on several workloads. This particular job, after our collaboration, resulted in 34% cost reduction and 17% runtime reduction. (Though we have achieved greater gains with other Databricks jobs.) This means, we were able to not only reduce their cost but enable better capabilities around meeting their data SLAs (service level agreements). The chart below depicts their initial cost and runtime and the associated improvements we achieved together. The gray dots represent different predictions with varying numbers of workers of the same instance type.

Results of the autotuner prediction and real run by the customer. (Image by author)

To achieve the result above, a full list of the parameters changed by the Autotuner is shown in table 1. By switching instance types, number of workers, memory and storage parameters all simultaneously, the Autotuner performs a global optimization to achieve the desired cost and performance goals selected by the user.

Table 1. Comparison of the original vs optimized Databricks configurations. (Image by author)

Out of the collaboration, several highlights may provide value to other Databricks users.  We have detailed them below (in no particular order).

Reduce Costs by Right-sizing Worker Instance Types, number of workers, and Storage

Without much insight about which instance types are worth selecting for specific workloads, it’s tempting to assemble clusters that are composed of large worker nodes with significant attached storage. This strategy is sensible for focusing not on infrastructure during development of data pipelines but once these pipelines are in production, it is worth the effort to look at cluster composition to save costs and runtime. In our collaboration with the customer, we discovered that they had the opportunity to downsize their worker node instance type, number of workers and assign appropriate storage. See figure below that illustrates this point. Our recommendation actually includes smaller worker instances and increased EBS (storage). The initial cluster included 11 r5dn.16xlarge instances and the recommended cluster included a larger number of workers (21) but smaller instance types, r5.12xlarge. The resulting cost decrease is mostly due to smaller instances despite an increase in the number. Balancing instance types and the number of workers is a delicate calculation, if done incorrectly could erase all potential gains. The Autotuner predicts both values simultaneously for users, to eliminate this tricky step. The costs associated with the worker EC2 instances (shown in salmon) represents where the major cost reduction occurred.  

We note that switching instances is not trivial, as it may impact other parameters. For example in this case, the r5dn instance type has attached storage. Moving to the r5 instance type requires adding the appropriate amount of EBS storage hence the small increase in the worker EBS costs.The Sync Autotuner takes this into account and auto-populates the parameters such as these when suggesting to switch instance types.

The cost impact of right sizing the driver node. (Image by author)

Reduce Costs by Right-sizing Driver Instance Type

Databricks provides a wide range of instance types to choose from when setting the driver and worker nodes. The plethora of options stems from the variety of compute the underlying cloud provider (AWS, Azure and GCP) put forth. The number of options can be overwhelming. A common pattern among Spark users involves choosing the same instance type for both worker and driver nodes. We have been able to help folks, like the customer, tune their cluster settings to choose more tailored driver instance types. By avoiding over-provisioning of driver nodes, the cost contribution of the driver can be reduced. This approach tends to yield strong benefits in scenarios where the driver node uses ON-DEMAND instances and the worker nodes are using SPOT instances. The benefits are more pronounced when the cluster has fewer workers.

The chart below shows the cost breakdown of a four worker node cluster. Initially, the cluster consisted of a ON-DEMAND driver m5.12xlarge instance with four SPOT m5.12xlarge workers instances. By right-sizing the driver node to a m5.xlarge instance, a 21% cost reduction is achieved. The salmon-colored portion represents the cost contribution of the change in driver instance type. The right-sizing must avoid under-provisioning so the appropriate spark configuration parameters need to be adopted. That is what the Sync AutoTuner enables. A small but noticeable increase in the worker costs is related to a slight increase in the runtime associated with the driver instance change.

The cost impact of rightsizing the worker node. (Image by author)

The impact of Spot Availability on Runtime

Using spot instances for worker nodes is a great way to save on compute costs. the customer adopted this practice prior to our initial conversations. However, the cost-saving strategy may actually result in longer runtimes due to availability issues. For large clusters, Databricks may start the job even with only a fraction of the desired total number of target worker nodes. As a result, the data processing throughput is slowly ramped up over time resulting in longer runtimes compared to the ideal case of having all the desired workers from the start. In addition, it is also possible for worker nodes to drop off, due to low Spot availability, during a run of a job and come back via Databricks Autorecovery feature. As a consequence, the total runtime for the cluster will be longer than for a cluster that has the full target number of workers for the entire job.

The chart below presents how Sync helped the customer accelerate their Databricks jobs by 47% by switching to a higher availability instance. The black line represents the run with up to 18 r5dn.16xlarge workers. The salmon line represents the run with 32 c5.12xlarge workers. The AWS Spot Advisor (see link) indicated that the r5dn.16xlarge instance type typically had a higher frequency of interruption, at times 10%-15% greater, than the c5.12xlarge instance type. As we see below, this small difference in interruption can lead to almost a 2x change in runtime.

The run with the c5.12xlarge workers (“low interruptibility”) had no difficulty in assembling the targeted 32 worker count from the beginning. In contrast, the run with the r5dn.16xlarge workers (“high interruptibility”) took a few minutes to start the job but with only 5 of the targeted 18 workers count. It took over 200 minutes to increase the node count to only 15 nodes, never reaching the fully requested amount of 18. Switching worker instance types also requires updates to the spark parameters (e.g. spark.executor.cores, executor memory, number of executors), fortunately the Spark Autotuner adjusts these parameters as well for each instance type, making it easy for users. The Sync AutoTuner makes cluster configuration recommendations that take into account availability. Databricks users, like the customer, can take advantage of the cost benefits of spot instances with confidence that unforeseen availability will not negatively impact their job runtimes.

By selecting Spot instances with higher availability, Customer’s runtime accelerated by 47%. (Image by author)


The compute infrastructure on which Databricks runs can have a large impact on the cost and performance of any production job. Because the cloud offers an almost endless array of compute options, understanding how to select which cloud configurations to use can lead to an intractable search space. At Sync our mission is to make this problem go away for data engineers everywhere.

Auto Optimize Apache Spark with the Spark Autotuner

Launch to the cloud based on cost and time: This article explains Sync Computing’s Spark Cluster Autotuner Solution and how it was used to reduce Duolingo’s AWS EMR Spark costs by up to 55%. This solution eliminates the inefficient manual tuning and guesswork currently used when configuring Spark clusters and settings to provide the best cost, performance, and reliability — without any code changes.

The Problem with Spark Infrastructure

Determining cloud infrastructure settings to optimize job cost and speed for modern Spark jobs is neither practical nor possible for cloud developers.  Even with optimal settings for one job, Spark jobs vary daily in code base, data sizes, and cloud spot pricing resulting in wide variations of cost/performance to developers. Most cloud users rely on simple rules of thumb or recommendations from co-workers or past jobs on what settings should be selected. Optimizing these choices requires sweeping through instance types, cloud settings, and spark configurations, a task no busy data engineer has time for.

What if it were possible to explore the effects of hardware changes without having to actually rerun jobs? Buried within the output of every Spark run is a trove of information connecting its performance to the  underlying hardware. When combined with deep mathematical modeling, this data can be used to predict application performance on different cloud hardware configurations. This is the core idea behind Sync’s first solution, the Spark Cluster Autotuner.

How Sync Spark Cluster Autotuner can Help

Sync’s Spark Cluster Autotuner removes the burden of choosing the right AWS cluster hardware and spark configurations for your recurring production Spark applications. Using only your most recent Spark eventlog and its associated cluster information, the Cluster Autotuner returns the optimal cluster and spark configurations for your next run.

Whether “optimal” to you means the fastest, cheapest, or somewhere in between, the Cluster Autotuner will give you the appropriate settings for your needs.

Figure 1: Example configuration selections from the Spark Cluster Autotuner. These options are presented to a user within minutes of uploading a Spark eventlog.

How Sync Spark Cluster Autotuner works

The Cluster Autotuner works by mathematically modeling the task-level details of a spark eventlog and calculating how those details will change on different hardware configurations, resulting in an estimate of the runtime on each set of hardware.

Runtime estimates are combined with the latest AWS pricing and reliability information to yield performance estimates (runtime and cost) for each configuration. An optimization calculation is performed to search through all the configurations to pick the best options for the user.

Figure 2: Basic workflow of the Spark Predictor

Duolingo’s Situation

Duolingo builds the world’s #1 language learning application serving over 40 million monthly active users. As a cloud native company, Duolingo processes terabytes of data daily on the cloud, leading to exorbitant costs. Utilizing the cloud efficiently directly impacts the company’s bottom line.

In the following section, we demonstrate a case study of this solution’s use with Duolingo. The experiment follows the workflow depicted in Figure 2.

Duolingo has a number of recurring production Spark jobs run on AWS EMR, which when run daily or even multiple times per day, incur substantial costs over the course of a year. Their #1 priority was to reduce costs, even at the expense of runtime. Figuring out the best configuration on their own would require extensive manual testing and parameter sweeps, a many-hour task no engineer has the bandwidth for.

Sync presented them with the Spark Cluster Autotuner, which would get rid of manual experimenting, to instantly reduce their cloud costs on two of their ETL jobs. Basic information of these jobs is outlined in Table 1.

JobInput Runtime (min)Input Data Size (TB)Input Cost ($)
Table 1

The most recent eventlog from each job was run through the Cluster Autotuner and the configurations which yielded the lower cluster costs were used in the subsequent runs. The results of this experiment are depicted in Figure 3. For both jobs, the Sync Optimized configuration resulted in a substantial reduction in cost, without touching any code for an easy and fully reversible demonstration.

Figure 3: Cost efficiency of three Spark jobs before and after using the Sync Spark Cluster Autotuner

The Prediction

Figure 4 shows a subset of the predictions using Duolingo’s ETL-D log. Three instance types are shown, where each point on the respective curve represents a different number of workers. Performance estimates of the input, predicted, and measured jobs are indicated by the red, green, and blue triangles, respectively.

Figure 4: Performance predictions on varying hardware configurations for the ETL-D job. The performance points of the input, prediction, and measurement are indicated by the triangles.

In this example, a small number of workers in the prediction results in long-running but inexpensive jobs. As the number of workers increases, the application is expected to be faster but costlier. The initial input configuration was deep in the plateau of diminishing returns of cluster scaling. The recommendation was therefore to reduce the number of workers in order to move away from the runtime plateau.

The key insight enabled by the Cluster Autotuner is given by the knowledge of where your current job lies on the performance curve, and what you need to change to get to another point on that curve. In Duolingo’s case, cost was the only relevant parameter. On the other hand, if runtime was a critical parameter, then it would be easy to pick another point on this curve that runs nearly as fast as the original job but still with significant cost savings.

This flexibility of choice is a major utility of the Spark Cluster Autotuner. The word “optimal” can mean cheapest to one group, or fastest to another, and the Cluster Autotuner gives the right configuration according to each user’s desires. Table 2 shows the input and predicted configurations for this job.

Table 2: Hardware and spark configurations before and after using the Cluster Autotuner for the ETL-D job.

The Measurement

When Duolingo actually ran this predicted configuration in their production runs, they instantly saw dramatic cost savings — which was precisely their goal.

The greatest cost savings come from the reduction in cluster size (from 1,664 to 384 vcpu’s). Although the cluster size was reduced by 4x, the runtime only increased slightly from 17 min to 22 min, and cost was reduced by 55%.

These results can be understood by looking at the activity charts in Figure 5. In the input log, the average number of active cores was only about 1/6th of the cores available to Spark. This indicates that the majority of the job is not well distributed, and most of the cluster time was spent doing nothing. The optimized result reduced the cluster size, bringing the mean activity closer to the available capacity, making the job more efficient and therefore less expensive. Of course, those stages which were well distributed now take longer, resulting in a slightly longer runtime.

Figure 5: Cluster activity for the ETL-D job before and after running Sync’s Spark Cluster Autotuner

At first glance it appears that reducing the cluster size more would improve the utilization even further, resulting in even lower costs. However, this is untrue in this case, because increasing the runtime also increases the driver costs and the EBS costs. The Cluster Autotuner takes all of these factors into account to estimate the total cost of running a job at various cluster sizes.

The next most impactful change to the job is the driver size, as an on-demand instance for the driver can cost as much as several equivalent spot instances. After analyzing the input log, the Cluster Autotuner determined that an m5.xlarge had sufficient memory to handle this job, reducing driver cost by nearly 10x. Lastly, the changes to the Spark configurations are largely to conform to the new hardware configuration, though these settings are necessary for the application to run efficiently on the hardware.

Conclusion – Demo it yourself

This demonstration is just a small example of the complexity built in the Spark Cluster Autotuner. Changes to the hardware and Spark settings can impact the runtime of a job in many and often subtle ways. Appropriately accounting for these effects to accurately predict runtime requires the deep mathematical modeling and optimization of the Cluster Autotuner, which goes far beyond the capability of simple rule-of-thumb decisions or local optimization techniques. But don’t take our word for it, try out our first solution in the link below on your real production Spark jobs today – we’d love your feedback.

Sync Computing – Configure complex cloud infrastructure for your data/ML workloads based on cost and time, before you submit your jobs to obtain the best performance and value.