Let’s say you’re a data engineer and you want to run your data/ML Spark job on AWS as fast as possible. You want to avoid slow Apache Spark performance. After you’ve written your code to be as efficient as possible, it’s time to deploy to the cloud.
Here’s the problem, there are over 600 machines in AWS (today), and if you add in the various Spark parameters, the number of possible deployment options becomes impossibly large. So inevitably you take a rough guess, or experiment with a few options, pick one that works, and forget about it.
It turns out, this guessing game could undo all the work put in to streamline your code. The graph below shows the performance of a standard Bayes ML Spark job from the Hi-Bench test suite. Each point on the graph below is the result of changing just 2 parameters: (1) which compute instance is used, and (2) The number of nodes.
Clearly we can see the issue here, even with this very simple example. If a user selects just 2 parameters poorly, the runtime could be up to 10X slower or cost twice as much as it should (with little to no performance gain).
Keep in mind that this is a simplified picture, where we have ignored Spark parameters (e.g. memory, executor count) and cloud infrastructure options (e.g. storage volume, network bandwidth) which add even more uncertainty to the problem.
Daily Changes Make it Worse
To add yet another complication, data being processed today could look very different tomorrow. Fluctuations in data size, skew, and even minor modifications to the codebase can lead to crashed or slow jobs if your production infrastructure isn’t adapting to these changing needs.
How Sync Solved the Problem
At Sync, we think this problem should go away. We also think developers shouldn’t waste time running and testing their job on various combinations of configurations. We want developers to get up and running as fast as possible, completely eliminating the guesswork of cloud infrastructure. At its heart, our solution profiles your job, solves a huge optimization problem, and then tells you exactly how to launch to the cloud.
Here at Sync, we are passionate about optimizing data infrastructure on the cloud, and one common point of confusion we hear from users is what kind of worker instance size is best to use for their job?
Many companies run production data pipelines on Apache Spark in the elastic map reduce (EMR) platform on AWS. As we’ve discussed in previous blog posts, wherever you run Apache Spark, whether it be on Databricks or EMR, the infrastructure you run it on can have a huge impact on the overall cost and performance.
To make matters even more complex, the infrastructure settings can change depending on your business goals. Is there a service level agreement (SLA) time requirement? Do you have a cost target? What about both?
One of the key tuning parameters is which instance size should your workers run on? Should you use a few large nodes? Or perhaps a lot of small nodes? In this blog post, we take a deep dive into some of these questions utilizing the TPC-DS benchmark.
Before starting, we want to be clear that these results are very specific to the TPC-DS workload, while it may be nice to generalize, we fully note that we cannot predict that these trends will hold true for other workloads. We highly recommend people run their own tests to confirm. Alternatively, we built the Autotuner for Apache Spark to help accelerate this process (feel free to check it out yourself!).
With that said, let’s go!
The main question we seek to answer is – “How does the worker size impact cost and performance for Spark EMR jobs?” Below are the fixed parameters we used when conducting this experiment:
EMR Version: 6.2
Driver Node: m5.xlarge
Driver EBS storage: 32 GB
Worker EBS storage: 128 GB
Worker instance family: m5
Worker type: Core nodes only
Workload: TPC-DS 1TB (Queries 1-98 in series)
Cost structure: On-demand, list price (to avoid Spot node variability)
Cost data: Extracted from the AWS cost and usage reports, includes both the EC2 fees and the EMR management fees
Fixed Spark settings:
Number of executors: set to 100% cluster utilization based on the cluster size
Spark.executor.memory: automatically set based on number of cores
The fixed Spark settings we selected were meant to mimic safe “default” settings that an average Spark user may select at first. To explain those parameters a bit more, since we are changing the worker instance size in this study, we decided to keep the number of cores per executor to be constant at 4. The other parameters such as number of executors and executor memory are automatically calculated to utilize the machines to 100%.
For example, if a machine (worker) has 16 cores, we would create 4 executors per machine (worker). If the worker has 32 cores, we would create 8 executors.
The figure below shows the Spark runtime versus the number and type of workers. The trend here is pretty clear, in that larger clusters are in fact faster. The 4xlarge size outperformed all other cluster sizes. If speed is your goal, selecting larger workers could help. If one were to pick a best instance based on the graph below, one may draw the conclusion that:
It looks like the 4xlarge is the fastest choice
The figure below shows the true total cost versus the number and type of workers. On the cost metric, the story almost flips compared to the runtime graph above. The smallest instance usually outperformed larger instances when it came to lowering costs. For 20 or more workers, the xlarge instances were cheaper than the other two choices.
If one were to quickly look at the plot below, and look for the “lowest points” which correspond to lowest cost, one could draw a conclusion that:
It looks like the 2xlarge and xlarge instance are the lowest cost, depending on the number of workers
However, the real story comes when we merge those two plots together and simultaneously look at cost vs. runtime. In this plot, it is more desirable to be toward the bottom left, this means the run is both lower cost and faster. As the plot below shows, if one were to look at the lowest points, the conclusion to be drawn is:
It looks like 4xlarge instances are the lowest cost choice… what?
What’s going on here, is that for a given runtime, there is always a lower cost configuration with the 4xlarge instances. When you put it into that perspective, there is little to reason to use xlarge sizes as going to larger machines can get you something both faster and cheaper.
The only caveat here is there is a floor to how cheap and slow the 4xlarge cluster can give you, and that’s with a worker count of 1. Meaning, you could get a cheaper cluster with a smaller 2xlarge cluster, but the runtime becomes quite long and may be unacceptable for real-world applications.
Here’s a generally summary of how the “best worker” choice can change depending on your cost and runtime goals:
<A very long time
A note on extracting EMR costs
Extracting the actual true costs for individual EMR jobs from the AWS billing information is not straight forward. We had to write custom scripts to scan the low level cost and usage reports, looking for specific EMR cluster tags. The exact mechanism for retrieving these costs will probably vary company to company, as different security permissions may alter the mechanics of how these costs can be extracted
If you work at a company and EMR costs are a high priority and you’d like help extracting your true EMR job level costs, feel free to reach out to us here at Sync, we’d be happy to work together.
The main takeaways here are the following points:
It Depends: Selecting the “best” worker is highly dependent on both your cost and runtime goals. It’s not straightforward what the best choice is.
It really depends: Even with cost and runtime goals set, the “best” worker will also depend on the code, the data size, the data skew, Spot instance pricing, availability to just name a few.
Where even are the costs? Extracting the actual cost per workload is not easy in AWS, and is actually quite painful to capture both the EC2 and EMR management fees.
Of course here at Sync, we’re working on making this problem go away. This is why we built the Spark Autotuner product to help users quickly understand their infrastructure choices given business needs.
As many previous blog posts have reported, tuning and optimizing the cluster configurations of Apache Spark is a notoriously difficult problem. Especially when a data engineer needs to lower costs or accelerate runtimes on platforms such as EMR or Databricks on AWS, tuning these parameters becomes a high priority.
Here at Sync, we will experimentally explore the impact of driver sizing in the Databricks platform on the TPC-DS 1TB benchmark, to see if we can obtain an understanding of the relationship between the driver instance size and cost/runtime of the job.
Driver node review
For those who may be less familiar with the driver node details in Apache Spark, there are many excellent previous blog posts as well as the official documentation on this topic and I will recommend users to read those if they are not familiar. As a quick summary, the driver is an important part of the Apache Spark system and effectively acts as the “brain” of the entire operation.
The driver program runs the main() function, creates the spark context, and schedules tasks onto the worker nodes. Aside from these high level functions, we’d like to note that the driver node is also used in the execution of some functions, most famously when using the collect operation and broadcast joins. During those functions, data is moved to the driver node and if it’s not appropriately sized, can cause a driver side out of memory error which can shut down the entire cluster.
As a quick side note, for broadcast joins, It looks like a ticket has been filed to change this behavior (at least for broadcast joins) in the open source Spark core. So people should be aware that this may change in the future.
The main question we want to ask is “how does driver sizing impact performance as a function of the number of workers?” The reason why we want to correlate driver size with the number of workers is that the number of workers is a very important parameter when tuning systems for either cost or runtime goals. Observing how the driver impacts the worker scaling of the job is a key part of understanding and optimizing a cluster.
Fundamentally, the maximum number of tasks that can be executed in parallel is determined by the number of workers and executors. Since the driver node is responsible for scheduling these tasks, we wanted to see if the number of workers changes the hardware requirements of the driver. For example, does scheduling 1 million tasks require a different driver instance type than scheduling 10 tasks?
The technical parameters of the experiment are below:
Data Platform: Databricks
Compute type: Jobs (ephemeral cluster, 1 job per cluster)
Photon Enabled: No
Fixed parameters:: All worker nodes are i3.xlarge, all configs default
Sweep parameters: Driver instance size (r5a.large, r5a.xlarge, r5a.4xlarge), number of workers
Workload: Databrick’s own benchmark on TPC-DS 1TB (all queries run sequentially)
For reference, here are the hardware specifications of the 3 different drivers used on AWS:
We will break down the results into 3 main plots. The first is below where we look at runtime vs. number of workers for the 3 different driver types. In the plot below we see that as the number of workers increases the runtime decreases. We note here that the scaling trend is not linear and there is a typical “elbow” scaling that occurs. We published previously the general concept of scaling jobs. We observe here that the largest driver, r5a.4xlarge, yielded the fastest performance across all worker sizes.
In the plot below we see the cost (DBU’s in $) vs. number of workers. For the most part we see that the medium sized driver, r5a.xlarge is the most economical, except for the smallest number of workers where the smallest driver size r5a.large was the cheapest.
Putting both plots together, we can see the general summary when we plot cost vs. runtime. The small numbers next to each point show the number of workers. In general, the ideal points should be toward the bottom left, as that indicates a configuration that is both faster and cheaper. Points that are higher up or to the right are more expensive and slower.
Some companies are only concerned about service level agreement (SLA) timelines, and do not actually need the “fastest” possible runtime. A more useful way to think about the plot below is to ask the question “what is the maximum time you want to spend running this job?” Once that number is known, you can then select the configuration with the cheapest cost that matches your SLA.
For example, consider the SLA scenarios below:
1) SLA of 2500s – If you need your job to be completed in 2,500s or less, then you should select the r5a.4xlarge driver with a worker size of 50.
2) SLA of 4000s – If you need your job to be completed in 4,000s or less, then you should select the r5a.xlarge driver with a worker size of 20.
3) SLA of 10,000s – If you need your job to be completed in 10,000s or less, then you should select the r5a.large driver with a worker size of 5.
It’s very convenient to see the scaling trend of all 3 drivers plotted in this manner, as there are several key insights gained here:
There is a general “good” optimal driver for TPC-DS 1TB – across the spectrum, it’s clear that r5a.xlarge is a good choice generally as it is usually cheaper and faster than the other driver sizes. This shows the danger that if your driver is too big or too small, you could be wasting money and time.
At the extremes, driver size matters for TPC-DS 1TB – At the wings of either large clusters (50 workers) or small clusters (5 workers) we can see that the best driver selection can swing between all 3 drivers.
Drivers can be too big – At 12 workers, the r5a.4xlarge performance is slightly faster but significantly more expensive than the other two driver types. Unless that slight speedup is important, it’s clear to see that if a driver is too large, then the extra cost of the larger driver is not worth the slight speedup. It’s like buying a Ferrari to just sit in traffic – definitely not worth it (although you will look cool).
Small driver bottleneck – For the small driver curve (r5a.large), we see that the blue line’s elbow occurs at a higher runtime than the middle driver (r5a.xlarge). This implies that the smaller driver is creating a runtime bottleneck for the entire workload as the cluster becomes larger. The next section will dive into why.
Root cause analysis for the “small driver bottleneck”
To investigate the cause of the small driver bottleneck, we looked into the Spark eventlogs to see what values changed as we scaled the number of workers. In the Spark UI in Databricks, the typical high level metrics for each task are shown below and plotted graphically. The image below shows an example of a single task broken down into the 7 main metrics:
When we aggregated all of these values across all tasks, across all the different drivers and workers, the numbers were all pretty consistent, except for one number: “Scheduler Delay”. For those who may not be familiar, the formal definition from the Databricks Spark UI, is shown in the image below:
“Scheduler delay includes time to ship the task from the scheduler to the executor, and time to send the task result from the executor to the scheduler. If scheduler delay is large, consider decreasing the size of tasks or decreasing the size of task results.”
In the graph below, we plot the total aggregated scheduler delays of all tasks for each of the job configurations vs the number of workers. It is expected that the aggregated scheduler delay should increase for a larger number of workers since there are more tasks. For example, if there are 100 tasks, each with 1s of scheduler delay, the total aggregated scheduler day is 100s (even if all 100 tasks executed in parallel and the “wall clock” scheduler delay is only 1s). Therefore, if there are 1000 tasks, the total aggregated scheduler should increase as well.
Theoretically this should scale roughly linearly with the number of workers for a “healthy” system. For the “middle” and “large” sized drivers (r5a.xlarge and r5a.4xlarge respectively), we see the expected growth of the scheduler delay. However, for the “small” r5a.large driver, we see a very non-linear growth of the total aggregated scheduler delay, which contributes to the overall longer job runtime. This appears to be a large contributor to the “small driver bottleneck” issue.
To understand a bit deeper as to what is the formal definition of Scheduler Delay, let’s look at the Spark source code inside the function AppStatusUtils.scala. At a high level, scheduler delay is a simple calculation as shown in the code below:
To put it in normal text, scheduler delay is basically a catch-all term, that is the time the task is spent doing something that is not executing, serializing data, or getting results. A further question would be to see which one of these terms is increasing or decreasing due to the smaller driver? Maybe duration is increasing, or maybe gettingResultTime is decreasing?
If we look at the apples to apples case of 32 workers for the “medium” r5a.xlarge driver and the “small” r5a.large driver, the runtime of the “small” driver was significantly longer. One could hypothesize that the average duration per task is longer (vs. one of the other terms becoming smaller).
In summary, our hypothesis here is that by reducing the driver size (number of VCPUs and memory), we are incurring an additional time “tax” on each task by taking, on average, slightly longer to ship a task from the scheduler on the driver to each executor.
A simple analogy here is, imagine you’re sitting in bumper to bumper traffic on a highway, and then all of a sudden every car (a task in Spark) just grew 20% longer, if there are enough cars you could be set back miles.
Based on the data described above, the answer to the question above is that inappropriately sized drivers can lead to excess cost and performance as workers scale up and down. We present a hypothesis that a driver that is “too small” with too few VCPUs and memory, could cause, on average, an increase in the task duration via an additional overhead in the scheduler delay.
This final conclusion is not terribly new to those familiar with Spark, but we hope seeing actual data can help create a quantitative understanding on the impact of driver sizing. There are of course many other things that could cause a poor driver to elongate or even crash a job, (as described earlier via the OOM errors). This analysis was just a deep dive into one observation.
I’d like to put a large caveat here that this analysis was specific to the TPC-DS workload, and it would be difficult to generalize these findings across all workloads. Although the TPC-DS benchmark is a collection of very common SQL queries, in reality individual code, or things like user defined functions, could throw these conclusions out the window. The only way to know for sure about your workloads is to run some driver sizing experiments.
As we’ve mentioned many times before, distributed computing is complicated, and optimizing your cluster for your job needs to be done on an individual basis. Which is why we built the Apache Spark Autotuner for EMR and Databricks on AWS to help data engineers quickly find the answers they are looking for.
Here at Sync we are always trying to learn and optimize complex cloud infrastructure, with the goal to help more knowledge to the community. In our previous blog post we outlined a few high level strategies companies employ to squeeze out more efficiency in their cloud data platforms. One very popular response from mid-sized to large enterprise companies we hear a lot of is:
“We use Autoscaling to minimize costs”
We wanted to zoom into this statement to really understand how true it is, and to get a better understanding of the fundamental question
“Is autoscaling Apache Spark cost efficient?”
To explain in more detail, we wanted to investigate the technical side of Autoscaling and really dive deep into a specific example. Because of this we chose to begin with a gold standard workload to analyze, the TPC-DS benchmark, just to minimize any argument being made that we cherry picked a weird workload to skew the final answer. Our goal here is to be as technical and informative as possible about a few workloads – we are not trying to perform a broad comprehensive study (that would take a long time). So let’s begin:
What is Autoscaling?
For those who may not know, Autoscaling is the general concept that a cluster should automatically tune the number of workers (or instances on AWS) based on the needs of your job. The basic message told to companies is, autoscaling will optimize the cluster for your workload and minimize costs.
Technically, Autoscaling is usually a reactive algorithm that measures some utilization metric inside your cluster to determine if more or less resources are needed. While this makes logical sense, in reality the complexity of Apache Spark and constantly changing cloud infrastructure make this problem highly unpredictable.
In the Databricks UI, autoscaling is just a simple checkbox that many people may overlook. The choice people make by selecting that box could impact their overall performance significantly.
Since many people use Databricks or EMR, the exact algorithm they employ is behind closed doors, so we don’t know the exact details of their logic. The only thing we can do is measure their performance.
Our goal is to provide a technical study of Autoscaling from a novice’s point of view. Meaning, our base case to compare against will be whatever “default” settings Databricks suggests. We are not comparing against the global best or against an expert who has spent many days optimizing a particular cluster (who we think would probably do an awesome job).
Data Platform: Databricks
Compute type: Jobs (ephemeral cluster, 1 job per cluster)
Photon Enabled: No
Baseline configuration: Default params given to users at spin up
AWS market: Driver on-demand, workers on spot with 100% on-demand fall back
Workload: Databrick’s own benchmark on TPC-DS 100GB (all 99 queries run sequentially)
To keep things simple, we ran 3 comparison job runs:
Fixed 8 Nodes – a fixed 8 node cluster using the default machine types suggested to us in the Databricks UI.
Fixed 2 Nodes w/ Autotuner – We use our Apache Spark Autotuner product to recommend an optimized fixed custer to give us the lowest cost option (runtime not optimized). The recommendation was to use 2 nodes (with different instance types than default)
Autoscaler 2-8 Nodes – We used the default UI settings in Databricks here.
Fixed Cluster (Autotuner)
Autoscaler 2-8 Nodes
No. of Workers
DBU Cost [$]
AWS Cost [$]
Total Cost [$]
To our surprise, of the 3 jobs run, the default autoscaler performed the worst in both runtime and cost. Both a fixed cluster of 8 nodes and 2 nodes, outperformed autoscaling in both time and cost. The Sync optimized cluster outperformed autoscaling by 37% in terms of cost and 14% in runtime.
To examine why the autoscaled cluster performed poorly, let’s look at the number of workers created and shut-down over time, in comparison to the fixed 2 node cluster. The figure below tells the basic story, that the autoscaled cluster spent a lot of time scaling up and down, tuning itself to the workload itself. At first glance, that is exactly what autoscaling is supposed to do, so why did the autoscaled cost and runtime perform so poorly?
The main reason, from what we can tell, is that there is a time penalty for changing the cluster size – specifically in upsizing the cluster. We can see from the cluster event log below, that the time between “RESIZING” and “UPSIZE_COMPLETED” can span several minutes. Based on the Spark UI, the executors don’t get launched until “UPSIZE_COMPLETED” occurs, so no new computing occurs until this step is achieved.
Another observation here is that in order for us to run the TPC-DS benchmark, we had to run an init_script to install some code at the start of the job. Based on the cluster event log below, it looks like every time it upsizes new machines, they have to reinstall all the init_scripts each time which costs time and money. This is something to consider, where if your job requires you to load specific init_scripts, this would certainly negatively impact the autoscaling performance.
So to summarize, you are paying for the “ramp up time” of new workers during autoscaling, where no computing is occurring. The more often your cluster upsizes, the more you will be waiting and paying.
Databricks mentions that using pools can help speed up autoscaling, by creating a pool of “warm” instances ready to be kicked off. Although you are not charged DBU’s, you do still have to pay AWS’s fees for those machines. So in the end, it still depends on your workload, size of cluster, and use case if the pools solution makes sense.
Another issue is the question of optimizing for throughput. If 3 nodes processes the data at the same rate as 8 nodes, then ideally autoscaling should stop at 3 nodes. But it doesn’t seem like that’s the case here, as auto-scaling just went up to the max workers set by the user.
The optimized fixed cluster looks at cost and throughput to find the best cluster, which is another reason why it is able to outperform the autoscaling solution.
Some follow up questions:
Is this just a TPC-DS specific artifact?
We ran the same tests with two other internal Spark jobs, which we call Airline Delay and Gen Data, and observed the same trend – that the Autoscale cluster was more expensive than fixed clusters. The amount of Autoscaling fluctuation was much less for Airline delay, so we noticed the advantage of a fixed cluster was reduced. Gen Data is a very I/O intense job, and the autoscaler actually did not scale up the cluster beyond 2 nodes. For the sake of brevity, we won’t show those details here (feel free to reach out if there are more questions).
We just wanted to confirm that these results weren’t specific to TPC-DS, and if we had more time we could do a large scale test with a diverse set of workloads. Here we observed the optimized fixed cluster (using the Sync Apache Spark Autotuner) achieved a 28% and 65% cost savings over default autoscaling for Airline Delay and Gen Data respectively.
What if we just set Autoscaling to 1-2 nodes (instead of 2-8)?
We thought that if we just changed the autoscaling min and max to be near what the “Fixed 2 node autotuner” cluster was, then it should get about the same runtime and cost. To our surprise, what happened was the autoscaler bounced back and forth between 1 and 2 nodes, which caused a longer job run than the fixed cluster. You can see in the plot below, we added the autoscaling job from 1-2 nodes on the worker plot. Overall the cost of the fixed 2 nodes cluster was still 12% cheaper than the autoscaled version of the same cluster with 1-2 nodes.
What this results indicates is that the parameters of min/max workers in the autoscaler are also parameters to optimize for cost and require experimentation.
How does the cost and runtime of the job change vs. varying the autoscaling max worker count?
If the cost and runtime of your job changes based on the input into max and min worker count, then autoscaling actually becomes a new tuning parameter.
The data below shows what happens if we keep the min_worker = 2, but sweep the max_worker from 3 to 8 workers. Clearly both cost and runtime vary quite a bit compared to the Max Worker count. And the profiles of these slopes depends on the workload. The bumpiness of the total cost can be attributed to the fluctuating spot prices.
The black dashed line shows the runtime and cost performance of the optimize fixed 2 node cluster. We note that a fixed cluster was able to outperform the best optimal autoscaling configuration for cost and runtime for the TPC-DS workload.
How did we get the cost of the jobs?
It turns out obtaining the actual cost charged for your jobs is pretty tedious and time consuming. As a quick summary, below are the steps we took to obtain the actual observed costs of each job:
Obtain the Databricks ClusterId of each completed job. (this can be found in the cluster details of the completed job under “Automatically added tags”)
In the Databricks console, go to the “manage account>usage tab”, filter results by tags, and search for the specific charge for each ClusterId. (one note: the cost data is only updated every couple of hours, so you can’t retrieve this information right after your run completes)
In AWS, go to your cost explorer, filter by tags, and type in the same cluster-id to obtain the AWS costs for that job (this tag is automatically transferred to your AWS account). (Another note, AWS updates this cost data once a day, so you’ll have to wait)
Add together your DBU and AWS EC2 costs to obtain your total job cost.
So to obtain the actual observed total cost (DBU and AWS), you have to wait around 24 hours for all of the cost data to hit their final end points. We were disappointed to see we couldn’t see the actual cost in real time.
In our analysis, we saw that a fixed cluster could outperform an autoscaled cluster in both runtime and costs for the 3 workloads that we looked at by 37%, 28%, and 65%. Our experiments showed that by just sticking to a fixed cluster, we eliminated all of the overhead that came with autoscaling which resulted in faster runtimes and lower costs. So ultimately, the net cost efficiency all depends on if the scaling benefits outweigh the negative overhead costs.
To be fair to the autoscaling algorithm, it’s very difficult to build a universal algorithm that reactively works for all workloads. One has to analyze the specifics of each job in order to truly optimize the cluster underneath and then still experiment to really know what’s best. This point is also not specific to Databricks, as many data platforms (EMR, Snowflake, etc) also have autoscaling policies that may work similarly.
To summarize our findings, here are a few high level takeaways:
Autoscaling is not one size fits all – Cluster configurations is an extremely complicated topic that is highly dependent on the details of your workload. A reactive autoscaling algorithm and the overheads associated with changing the cluster is a good attempt, but does not solve the problem of cluster optimization.
Autoscaling still requires tuning – Since Autoscaling is not a “set and forget” solution, it still requires tuning and experimentation to see what min and max worker settings are optimal for your application. Unfortunately, since the autoscaling algorithm is opaque to users, the fastest way to determine the best settings is to manually experiment.
So when is autoscaling good to use for batch jobs? It’s difficult to provide a general answer because, like mentioned above, it’s all dependent on your workload. But perhaps two scenarios I could see are (1) if your job has long periods of idle time, then autoscaling should shut down the nodes correctly, or (2) you are running ad-hoc data science experiments and you are prioritizing productivity over costs. Scenarios (1) and (2) could be the same thing!
So what should people do? If cost efficiency of your production level Databricks jobs is a priority, I would heavily consider performing an experiment where you select a few jobs, switch them to fixed clusters, and then extract the costs to do a before and after analysis – just like we did here.
The challenge of the last bullet is, what is the optimal fixed cluster? This is an age-old question that required a lot of manual experimentation to determine in the past, which is why we built the Apache Spark Autotuner to figure that out quickly. In this study, that is how I found the optimal fixed clusters with a single file upload, without having to run numerous experiments.
Maybe autoscaling is great for your workloads, maybe it isn’t, unfortunately the answer is really “it depends.” There’s only one way to really find out – you need to experiment.
Here at Sync, we’ve spoken with companies of all sizes, from some of the largest companies in the world to 50 person startups who desperately need to improve their cloud costs and efficiencies for their data pipelines. Especially in today’s uncertain economy, companies worldwide are implementing best practices and utilizing SaaS tools in an effort to save their bottom lines on the cloud.
This article is the first in a series of posts dedicated to the topic of improving cloud efficiency for data pipelines, why it’s so hard, and what can be done.
In our discussions with companies, we have identified several recurring challenges that hinder their ability to improve their cloud infrastructure for big data platforms. While our current focus is on systems that use Apache Spark (such as EMR or Databricks on AWS), many of these problems are common across different technologies.
In this article, we will summarize the top six reasons we’ve heard from companies for these challenges and provide an overview of the underlying issues. In our next blog post, we will discuss the strategies that companies use to address these problems.
1. Developer Bandwidth
Optimizing complex cloud infrastructure is not easy and in fact requires significant time and experimentation. Simply understanding (let alone implementing) the potential business trade offs of various infrastructure choices requires hours of research. It’s a full time job and developers already have one.
For example, should a developer run their Apache Spark job on memory, compute, or network optimized instance families? What about AWS’s new Graviton chips, those sound pretty promising? What about all of the Apache Spark parameters like partition sizes, memory, number of executors etc? The only way to truly know is to manually experiment and sweep parameters — this is just not a sustainable approach.
Even if an optimal configuration is found, that configuration may quickly go stale as code, input data profile, business kpis, and cloud pricing all fluctuate.
Developers are mostly just trying to get their code up and running in the first place, let alone rarely do they have the luxury of optimization.
2. Switching cost of existing infrastructure
Improving existing infrastructure is typically a lot more work than just changing a few lines of code. Optimizing infrastructure for substantial cost or runtime gains could come at the cost of changing fundamental components of your architecture.
Many developers wonder if “the juice is worth the squeeze?”
For example, changing platforms (let’s say Databricks vs EMR, Apache Hive vs Apache Iceberg, or Non-Spark vs. Spark) is a daunting task that could slow down product velocity (and hence revenue) while the switch is being implemented.
We’ve seen companies knowingly operate inefficiently simply because switching to something better is “too much work,” and would rather live with a less efficient system.
3. Too many changing variables
Many production data systems are subject to many fluctuating parameters, such as varying data size, skew, volume of jobs, spot node availability and pricing, changing codebase, and engineer turnover to name a few. The last one is particularly challenging, as many pipelines are set up by former employees, and big changes come with big risk.
With such complexity, many teams deem it too complicated and risky to try to optimize and would rather focus on “lower hanging fruit.” For new workloads, companies often go the easy way out and just copy and paste prior configurations from a completely different workload — which can lead to poor resource utilization.
4. Lack of expertise
Cloud native companies typically prioritize fast product development time, and cloud providers are happy to oblige by quickly making a massive amount of resources available with a single click. While this democratization of cloud provisioning has been a game changer for speed to market, cloud optimization isn’t typically recognized as a part of the developer job description.
The conflict of spending time learning about low level compute tradeoffs instead of building the next feature, is preventing this knowledge from increasing in the general workforce. Furthermore, it’s not a secret that as a whole, there is a massive cloud skill shortage in the market today. Outside of the FAANG mafia, it’s incredibly difficult for other companies to compete for talent with deep knowledge of cloud infrastructure.
One large global company complained to us that they keep losing architects to AWS, GCP, and Azure themselves! Simply put, many companies couldn’t optimize cloud infrastructure even if they desperately wanted to.
5. Misaligned incentives
At large companies different teams are responsible fordifferent parts of a company’s cloud footprint, from developers up to the CTO — each team can have different incentives. With these differing motivations, it can be difficult to quickly enact changes that help improve efficiency, since decisions from one group may negatively impact the goals of another.
For example, companies often have vendor contracts they need to stay within the boundaries of, but when it’s projected that they will exceed the forecasted consumption, cloud optimization can rise from low to high priority. This sudden change can cause a frenzy of discussions on what to do, the tradeoffs, and impact on various groups. Companies have to choose between keeping developers focused on new product development, or refocusing them on cloud optimization projects.
For developers to play an efficient role in this complex web, they must be able to quickly understand and articulate the business impact of their infrastructure choices — but this is much easier said than done. Ensuring that the thousands of low-level infrastructure choices a developer makes align with the high business goals of a VP is incredibly difficult.
6. Scale and Risk
Many companies we’ve spoken to run tens of thousands of business critical production Apache Spark jobs a day. Optimizing jobs at this scale is a daunting task, and many companies only consider superficial optimizations such as reserved instances or autoscaling, but lack the ability to individually tune and optimize resources for each of the thousands of jobs launched daily.
So what do companies do now?
Unfortunately, the answer is “it depends.” The solution to addressing the challenges of cloud optimization varies depending on a company’s current structure and approach. Some companies may choose to form internal optimization teams or implement autoscaling, while others may explore serverless options or engage with cloud solution architects. In our next post, we will discuss these common strategies and their respective advantages and disadvantages.
Here at Sync, our whole mission is to build a product that solves these issues and empowers developers from all company sizes to reach a new level of cloud infrastructure efficiency and align them to business goals.
Our first product is an Autotuner for Apache Spark which profiles existing workloads from EMR and Databricks, and provides data engineers an easy way to obtain optimized configurations to achieve cost or runtime business goals. We recently released an API for the Autotuner which allows developers to programmatically scale the optimization intelligence across all their workloads.
We’re passionate about solving this deep and extremely complicated problem. In our next post we’ll discuss possible solutions and their tradeoffs. Please feel free to follow us on Medium to keep in the loop for when we release the next article in this series.
At Sync we’re building something that is really hard. We’re trying to disrupt a $100B industry where some of the world’s biggest companies live. On top of that, we’re attacking a layer in the tech stack that is mired in complexity, history, and evolution.
So why do we think we’re going to win? Because we are approaching the problem in an entirely new way. Nobody has ever tried to build what we’re building before. We are making deep mathematical optimization available to end users via a SaaS product that will be fully integrated into their existing workflows. We have the team, the investors, and the vision to push forward down a path, and we’re looking for mission oriented people who revel in difficulty and love a challenge.
Why work for Sync?
At Sync, we offer market competitive salaries, company equity, flexible time off, 100% health insurance coverage, parental leave, standard benefits, 401K, and company retreats. Internally we have employee lecture talks, zoom coffee breaks, live zoom games (with prizes), and dedicated slack channels for puzzles and food.
All employees get a Mac laptop and an office stipend that makes your work life better. Since we’re a fully remote company, each employee’s needs are different and you should be free to customize.
Although work is important, nothing is more important than your personal life, family, and health. We don’t ask employees to pull all-nighters or work on weekends. Personally, I have 3 kids and understand that my time isn’t always my own. Sometimes we have to drop things and go to a recital. In general, we have a “grad school” mentality – all we ask is for you to try to hit your high level goals on time – how, where, and when is up to you.
What will you be working on?
You will be working on a new SaaS product aimed at changing the way developers control cloud infrastructure. To clarify, we’re not building another framework for developers to build new systems on (we think there are enough options). Instead – we are building the core intelligence and automation that developers will utilize in their existing workflows. Our product must be accurate, performant, and simple to use.
The product is technically deep, rich, and solves a very challenging customer pain. You will find no shortage of technical challenges that will push you to think in new ways and force you to learn new skills.
Who should apply?
We’re looking for mission oriented people who revel in a challenge and value building something new. As a startup, we value people who understand the flexibility, grittiness, and determination that is required when tackling such a daunting mission with an early stage company. We are looking for people who understand how to move fast and efficiently, how to communicate with the team, align on goals, receive feedback, and push forward. Time is our most scarce resource.
What is the interview process like?
The process starts with a chat with our head of talent who is looking for high level goal alignment and culture fit. The next phase is a meeting with the hiring manager who will dive deeper into the role and assess technical fit. The final phase is a panel interview with 3-4 current employees who will assess technical, culture, and goal fits. The full panel then meets to discuss and a final decision is made. Assuming all schedules align, the process can be as fast as 2 weeks from first call to offer.
Who will I be working with?
Depending on the role you’re applying for, you’ll be placed on the appropriate team with a manager. The team’s background is quite diverse, spanning PhDs from top universities to cloud infrastructure and product veterans, to early career folks.
Where will I work?
The team is fully remote. You can work from wherever you like within the United States. We’re a remote first company and have been from the start. To help foster a human connection, we sponsor company wide retreats focused on goal alignment and just hanging out.
How do I apply?
If all of this sounds appealing, please check out our job postings to see if anything is a match.