Blog

Is Databricks autoscaling cost efficient?

Jeffrey Chou
01.20.2023

Here at Sync we are always trying to learn and optimize complex cloud infrastructure, with the goal to help more knowledge to the community. In our previous blog post we outlined a few high level strategies companies employ to squeeze out more efficiency in their cloud data platforms. One very popular response from mid-sized to large enterprise companies we hear a lot of is:

“We use Autoscaling to minimize costs”

We wanted to zoom into this statement to really understand how true it is, and to get a better understanding of the fundamental question

“Is autoscaling Apache Spark cost efficient?”

To explain in more detail, we wanted to investigate the technical side of Autoscaling and really dive deep into a specific example. Because of this we chose to begin with a gold standard workload to analyze, the TPC-DS benchmark, just to minimize any argument being made that we cherry picked a weird workload to skew the final answer. Our goal here is to be as technical and informative as possible about a few workloads – we are not trying to perform a broad comprehensive study (that would take a long time). So let’s begin:

What is Autoscaling?

For those who may not know, Autoscaling is the general concept that a cluster should automatically tune the number of workers (or instances on AWS) based on the needs of your job. The basic message told to companies is, autoscaling will optimize the cluster for your workload and minimize costs.

Technically, Autoscaling is usually a reactive algorithm that measures some utilization metric inside your cluster to determine if more or less resources are needed. While this makes logical sense, in reality the complexity of Apache Spark and constantly changing cloud infrastructure make this problem highly unpredictable.

In the Databricks UI, autoscaling is just a simple checkbox that many people may overlook. The choice people make by selecting that box could impact their overall performance significantly.

Since many people use Databricks or EMR, the exact algorithm they employ is behind closed doors, so we don’t know the exact details of their logic. The only thing we can do is measure their performance.

Experiment Setup

Our goal is to provide a technical study of Autoscaling from a novice’s point of view. Meaning, our base case to compare against will be whatever “default” settings Databricks suggests. We are not comparing against the global best or against an expert who has spent many days optimizing a particular cluster (who we think would probably do an awesome job).

Data Platform: Databricks
Compute type: Jobs (ephemeral cluster, 1 job per cluster)
Photon Enabled: No
Baseline configuration: Default params given to users at spin up
AWS market: Driver on-demand, workers on spot with 100% on-demand fall back
Workload: Databrick’s own benchmark on TPC-DS 100GB (all 99 queries run sequentially)

To keep things simple, we ran 3 comparison job runs:

Fixed 8 Nodes – a fixed 8 node cluster using the default machine types suggested to us in the Databricks UI.
Fixed 2 Nodes w/ Gradient- We use our Apache Spark Gradient product to recommend an optimized fixed custer to give us the lowest cost option (runtime not optimized). The recommendation was to use 2 nodes (with different instance types than default)
Autoscaler 2-8 Nodes – We used the default UI settings in Databricks here.

	Fixed Cluster	Fixed Cluster (Gradient)	Autoscaler 2-8 Nodes
No. of Workers	8	2	2-8
Driver Node	i3.xlarge	r5a.large	i3.xlarge
Worker Nodes	i3.xlarge	i3.2xlarge	i3.xlarge
Runtime [s]	1593	2441	2834
DBU Cost [$]	0.6	0.39	0.73
AWS Cost [$]	0.92	0.92	1.35
Total Cost [$]	1.52	1.31	2.08

The results

To our surprise, of the 3 jobs run, the default autoscaler performed the worst in both runtime and cost. Both a fixed cluster of 8 nodes and 2 nodes, outperformed autoscaling in both time and cost. The Sync optimized cluster outperformed autoscaling by 37% in terms of cost and 14% in runtime.

To examine why the autoscaled cluster performed poorly, let’s look at the number of workers created and shut-down over time, in comparison to the fixed 2 node cluster. The figure below tells the basic story, that the autoscaled cluster spent a lot of time scaling up and down, tuning itself to the workload itself. At first glance, that is exactly what autoscaling is supposed to do, so why did the autoscaled cost and runtime perform so poorly?

The main reason, from what we can tell, is that there is a time penalty for changing the cluster size – specifically in upsizing the cluster. We can see from the cluster event log below, that the time between “RESIZING” and “UPSIZE_COMPLETED” can span several minutes. Based on the Spark UI, the executors don’t get launched until “UPSIZE_COMPLETED” occurs, so no new computing occurs until this step is achieved.

Another observation here is that in order for us to run the TPC-DS benchmark, we had to run an init_script to install some code at the start of the job. Based on the cluster event log below, it looks like every time it upsizes new machines, they have to reinstall all the init_scripts each time which costs time and money. This is something to consider, where if your job requires you to load specific init_scripts, this would certainly negatively impact the autoscaling performance.

So to summarize, you are paying for the “ramp up time” of new workers during autoscaling, where no computing is occurring. The more often your cluster upsizes, the more you will be waiting and paying.

Databricks mentions that using pools can help speed up autoscaling, by creating a pool of “warm” instances ready to be kicked off. Although you are not charged DBU’s, you do still have to pay AWS’s fees for those machines. So in the end, it still depends on your workload, size of cluster, and use case if the pools solution makes sense.

Another issue is the question of optimizing for throughput. If 3 nodes processes the data at the same rate as 8 nodes, then ideally autoscaling should stop at 3 nodes. But it doesn’t seem like that’s the case here, as auto-scaling just went up to the max workers set by the user.

The optimized fixed cluster looks at cost and throughput to find the best cluster, which is another reason why it is able to outperform the autoscaling solution.

Some follow up questions:

Is this just a TPC-DS specific artifact?

We ran the same tests with two other internal Spark jobs, which we call Airline Delay and Gen Data, and observed the same trend – that the Autoscale cluster was more expensive than fixed clusters. The amount of Autoscaling fluctuation was much less for Airline delay, so we noticed the advantage of a fixed cluster was reduced. Gen Data is a very I/O intense job, and the autoscaler actually did not scale up the cluster beyond 2 nodes. For the sake of brevity, we won’t show those details here (feel free to reach out if there are more questions).

We just wanted to confirm that these results weren’t specific to TPC-DS, and if we had more time we could do a large scale test with a diverse set of workloads. Here we observed the optimized fixed cluster (using the Sync Apache Spark Gradient) achieved a 28% and 65% cost savings over default autoscaling for Airline Delay and Gen Data respectively.

What if we just set Autoscaling to 1-2 nodes (instead of 2-8)?

We thought that if we just changed the autoscaling min and max to be near what the “Fixed 2 node gradient” cluster was, then it should get about the same runtime and cost. To our surprise, what happened was the autoscaler bounced back and forth between 1 and 2 nodes, which caused a longer job run than the fixed cluster. You can see in the plot below, we added the autoscaling job from 1-2 nodes on the worker plot. Overall the cost of the fixed 2 nodes cluster was still 12% cheaper than the autoscaled version of the same cluster with 1-2 nodes.

What this results indicates is that the parameters of min/max workers in the autoscaler are also parameters to optimize for cost and require experimentation.

How does the cost and runtime of the job change vs. varying the autoscaling max worker count?

If the cost and runtime of your job changes based on the input into max and min worker count, then autoscaling actually becomes a new tuning parameter.

The data below shows what happens if we keep the min_worker = 2, but sweep the max_worker from 3 to 8 workers. Clearly both cost and runtime vary quite a bit compared to the Max Worker count. And the profiles of these slopes depends on the workload. The bumpiness of the total cost can be attributed to the fluctuating spot prices.

The black dashed line shows the runtime and cost performance of the optimize fixed 2 node cluster. We note that a fixed cluster was able to outperform the best optimal autoscaling configuration for cost and runtime for the TPC-DS workload.

How did we get the cost of the jobs?

It turns out obtaining the actual cost charged for your jobs is pretty tedious and time consuming. As a quick summary, below are the steps we took to obtain the actual observed costs of each job:

Obtain the Databricks ClusterId of each completed job. (this can be found in the cluster details of the completed job under “Automatically added tags”)
In the Databricks console, go to the “manage account>usage tab”, filter results by tags, and search for the specific charge for each ClusterId. (one note: the cost data is only updated every couple of hours, so you can’t retrieve this information right after your run completes)
In AWS, go to your cost explorer, filter by tags, and type in the same cluster-id to obtain the AWS costs for that job (this tag is automatically transferred to your AWS account). (Another note, AWS updates this cost data once a day, so you’ll have to wait)
Add together your DBU and AWS EC2 costs to obtain your total job cost.

So to obtain the actual observed total cost (DBU and AWS), you have to wait around 24 hours for all of the cost data to hit their final end points. We were disappointed to see we couldn’t see the actual cost in real time.

Conclusion

In our analysis, we saw that a fixed cluster could outperform an autoscaled cluster in both runtime and costs for the 3 workloads that we looked at by 37%, 28%, and 65%. Our experiments showed that by just sticking to a fixed cluster, we eliminated all of the overhead that came with autoscaling which resulted in faster runtimes and lower costs. So ultimately, the net cost efficiency all depends on if the scaling benefits outweigh the negative overhead costs.

To be fair to the autoscaling algorithm, it’s very difficult to build a universal algorithm that reactively works for all workloads. One has to analyze the specifics of each job in order to truly optimize the cluster underneath and then still experiment to really know what’s best. This point is also not specific to Databricks, as many data platforms (EMR, Snowflake, etc) also have autoscaling policies that may work similarly.

To summarize our findings, here are a few high level takeaways:

Autoscaling is not one size fits all – Cluster configurations is an extremely complicated topic that is highly dependent on the details of your workload. A reactive autoscaling algorithm and the overheads associated with changing the cluster is a good attempt, but does not solve the problem of cluster optimization.
Autoscaling still requires tuning – Since Autoscaling is not a “set and forget” solution, it still requires tuning and experimentation to see what min and max worker settings are optimal for your application. Unfortunately, since the autoscaling algorithm is opaque to users, the fastest way to determine the best settings is to manually experiment.
So when is autoscaling good to use for batch jobs? It’s difficult to provide a general answer because, like mentioned above, it’s all dependent on your workload. But perhaps two scenarios I could see are (1) if your job has long periods of idle time, then autoscaling should shut down the nodes correctly, or (2) you are running ad-hoc data science experiments and you are prioritizing productivity over costs. Scenarios (1) and (2) could be the same thing!
So what should people do? If cost efficiency of your production level Databricks jobs is a priority, I would heavily consider performing an experiment where you select a few jobs, switch them to fixed clusters, and then extract the costs to do a before and after analysis – just like we did here.

The challenge of the last bullet is, what is the optimal fixed cluster? This is an age-old question that required a lot of manual experimentation to determine in the past, which is why we built the Apache Spark Gradient to figure that out quickly. In this study, that is how I found the optimal fixed clusters with a single file upload, without having to run numerous experiments.

Maybe autoscaling is great for your workloads, maybe it isn’t, unfortunately the answer is really “it depends.” There’s only one way to really find out – you need to experiment.

The top 6 lessons learned why companies struggle with cloud data efficiency

Jeffrey Chou
12.14.2022

Here at Sync, we’ve spoken with companies of all sizes, from some of the largest companies in the world to 50 person startups who desperately need to improve their cloud costs and efficiencies for their data pipelines. Especially in today’s uncertain economy, companies worldwide are implementing best practices and utilizing SaaS tools in an effort to save their bottom lines on the cloud.

This article is the first in a series of posts dedicated to the topic of improving cloud efficiency for data pipelines, why it’s so hard, and what can be done.

In our discussions with companies, we have identified several recurring challenges that hinder their ability to improve their cloud infrastructure for big data platforms. While our current focus is on systems that use Apache Spark (such as EMR or Databricks on AWS), many of these problems are common across different technologies.

In this article, we will summarize the top six reasons we’ve heard from companies for these challenges and provide an overview of the underlying issues. In our next blog post, we will discuss the strategies that companies use to address these problems.

1. Developer Bandwidth

Optimizing complex cloud infrastructure is not easy and in fact requires significant time and experimentation. Simply understanding (let alone implementing) the potential business trade offs of various infrastructure choices requires hours of research. It’s a full time job and developers already have one.

For example, should a developer run their Apache Spark job on memory, compute, or network optimized instance families? What about AWS’s new Graviton chips, those sound pretty promising? What about all of the Apache Spark parameters like partition sizes, memory, number of executors etc? The only way to truly know is to manually experiment and sweep parameters — this is just not a sustainable approach.

Even if an optimal configuration is found, that configuration may quickly go stale as code, input data profile, business kpis, and cloud pricing all fluctuate.

Developers are mostly just trying to get their code up and running in the first place, let alone rarely do they have the luxury of optimization.

2. Switching cost of existing infrastructure

Improving existing infrastructure is typically a lot more work than just changing a few lines of code. Optimizing infrastructure for substantial cost or runtime gains could come at the cost of changing fundamental components of your architecture.

Many developers wonder if “the juice is worth the squeeze?”

For example, changing platforms (let’s say Databricks vs EMR, Apache Hive vs Apache Iceberg, or Non-Spark vs. Spark) is a daunting task that could slow down product velocity (and hence revenue) while the switch is being implemented.

We’ve seen companies knowingly operate inefficiently simply because switching to something better is “too much work,” and would rather live with a less efficient system.

3. Too many changing variables

Many production data systems are subject to many fluctuating parameters, such as varying data size, skew, volume of jobs, spot node availability and pricing, changing codebase, and engineer turnover to name a few. The last one is particularly challenging, as many pipelines are set up by former employees, and big changes come with big risk.

With such complexity, many teams deem it too complicated and risky to try to optimize and would rather focus on “lower hanging fruit.” For new workloads, companies often go the easy way out and just copy and paste prior configurations from a completely different workload — which can lead to poor resource utilization.

4. Lack of expertise

Cloud native companies typically prioritize fast product development time, and cloud providers are happy to oblige by quickly making a massive amount of resources available with a single click. While this democratization of cloud provisioning has been a game changer for speed to market, cloud optimization isn’t typically recognized as a part of the developer job description.

The conflict of spending time learning about low level compute tradeoffs instead of building the next feature, is preventing this knowledge from increasing in the general workforce. Furthermore, it’s not a secret that as a whole, there is a massive cloud skill shortage in the market today. Outside of the FAANG mafia, it’s incredibly difficult for other companies to compete for talent with deep knowledge of cloud infrastructure.

One large global company complained to us that they keep losing architects to AWS, GCP, and Azure themselves! Simply put, many companies couldn’t optimize cloud infrastructure even if they desperately wanted to.

5. Misaligned incentives

At large companies different teams are responsible fordifferent parts of a company’s cloud footprint, from developers up to the CTO — each team can have different incentives. With these differing motivations, it can be difficult to quickly enact changes that help improve efficiency, since decisions from one group may negatively impact the goals of another.

For example, companies often have vendor contracts they need to stay within the boundaries of, but when it’s projected that they will exceed the forecasted consumption, cloud optimization can rise from low to high priority. This sudden change can cause a frenzy of discussions on what to do, the tradeoffs, and impact on various groups. Companies have to choose between keeping developers focused on new product development, or refocusing them on cloud optimization projects.

For developers to play an efficient role in this complex web, they must be able to quickly understand and articulate the business impact of their infrastructure choices — but this is much easier said than done. Ensuring that the thousands of low-level infrastructure choices a developer makes align with the high business goals of a VP is incredibly difficult.

6. Scale and Risk

Many companies we’ve spoken to run tens of thousands of business critical production Apache Spark jobs a day. Optimizing jobs at this scale is a daunting task, and many companies only consider superficial optimizations such as reserved instances or autoscaling, but lack the ability to individually tune and optimize resources for each of the thousands of jobs launched daily.

So what do companies do now?

Unfortunately, the answer is “it depends.” The solution to addressing the challenges of cloud optimization varies depending on a company’s current structure and approach. Some companies may choose to form internal optimization teams or implement autoscaling, while others may explore serverless options or engage with cloud solution architects. In our next post, we will discuss these common strategies and their respective advantages and disadvantages.

Conclusion

Here at Sync, our whole mission is to build a product that solves these issues and empowers developers from all company sizes to reach a new level of cloud infrastructure efficiency and align them to business goals.

Our first product is an Gradient for Apache Spark which profiles existing workloads from EMR and Databricks, and provides data engineers an easy way to obtain optimized configurations to achieve cost or runtime business goals. We recently released an API for the Gradient which allows developers to programmatically scale the optimization intelligence across all their workloads.

Try our Gradient for Apache Spark on EMR and Databricks yourself

We’re passionate about solving this deep and extremely complicated problem. In our next post we’ll discuss possible solutions and their tradeoffs. Please feel free to follow us on Medium to keep in the loop for when we release the next article in this series.

We’re hiring, let’s build.

Jeffrey Chou
10.07.2022

What are we building?

At Sync we’re building something that is really hard. We’re trying to disrupt a $100B industry where some of the world’s biggest companies live. On top of that, we’re attacking a layer in the tech stack that is mired in complexity, history, and evolution.

So why do we think we’re going to win? Because we are approaching the problem in an entirely new way. Nobody has ever tried to build what we’re building before. We are making deep mathematical optimization available to end users via a SaaS product that will be fully integrated into their existing workflows. We have the team, the investors, and the vision to push forward down a path, and we’re looking for mission oriented people who revel in difficulty and love a challenge.

Why work for Sync?

At Sync, we offer market competitive salaries, company equity, flexible time off, 100% health insurance coverage, parental leave, standard benefits, 401K, and company retreats. Internally we have employee lecture talks, zoom coffee breaks, live zoom games (with prizes), and dedicated slack channels for puzzles and food.

All employees get a Mac laptop and an office stipend that makes your work life better. Since we’re a fully remote company, each employee’s needs are different and you should be free to customize.

Although work is important, nothing is more important than your personal life, family, and health. We don’t ask employees to pull all-nighters or work on weekends. Personally, I have 3 kids and understand that my time isn’t always my own. Sometimes we have to drop things and go to a recital. In general, we have a “grad school” mentality – all we ask is for you to try to hit your high level goals on time – how, where, and when is up to you.

What will you be working on?

You will be working on a new SaaS product aimed at changing the way developers control cloud infrastructure. To clarify, we’re not building another framework for developers to build new systems on (we think there are enough options). Instead – we are building the core intelligence and automation that developers will utilize in their existing workflows. Our product must be accurate, performant, and simple to use.

The product is technically deep, rich, and solves a very challenging customer pain. You will find no shortage of technical challenges that will push you to think in new ways and force you to learn new skills.

Who should apply?

We’re looking for mission oriented people who revel in a challenge and value building something new. As a startup, we value people who understand the flexibility, grittiness, and determination that is required when tackling such a daunting mission with an early stage company. We are looking for people who understand how to move fast and efficiently, how to communicate with the team, align on goals, receive feedback, and push forward. Time is our most scarce resource.

What is the interview process like?

The process starts with a chat with our head of talent who is looking for high level goal alignment and culture fit. The next phase is a meeting with the hiring manager who will dive deeper into the role and assess technical fit. The final phase is a panel interview with 3-4 current employees who will assess technical, culture, and goal fits. The full panel then meets to discuss and a final decision is made. Assuming all schedules align, the process can be as fast as 2 weeks from first call to offer.

Who will I be working with?

Depending on the role you’re applying for, you’ll be placed on the appropriate team with a manager. The team’s background is quite diverse, spanning PhDs from top universities to cloud infrastructure and product veterans, to early career folks.

Where will I work?

The team is fully remote. You can work from wherever you like within the United States. We’re a remote first company and have been from the start. To help foster a human connection, we sponsor company wide retreats focused on goal alignment and just hanging out.

How do I apply?

If all of this sounds appealing, please check out our job postings to see if anything is a match.

Careers

Globally Optimized Data Pipelines On The Cloud — Airflow + Apache Spark

Jeffrey Chou
05.31.2022

Sync Computing presents a new kind of scheduler capable of automatically optimizing cloud resources for data pipelines to achieve runtime, cost, and reliability goals

Here at Sync, we recently launched our Apache Spark Autoutuner product, which helps people optimize their EMR and Databricks clusters on AWS. Turns out, there’s more on the roadmap for us and we recently published a technical paper our automatic globally optimized resource allocation (AGORA) scheduler, which extends beyond cluster autotuning, and towards the more general concept of resource allocation + scheduling optimization.

In this blog post we explain the high level concepts and how it can help data engineers run their production systems more reliably, hit deadlines, and lower costs — all with a click of a button. We show a simulation of our system on Alibaba’s cluster trace resulted in a 65% reduction in total runtime.

Introduction

Let’s say you’re a data engineer and you manage several production data pipelines via Airflow. Your goal is to achieve job reliability, hit your service level agreements (SLA), and minimize costs for your company. But due to changing data sizes, skew, source code, cloud infrastructure, resource contention, spot pricing, and spot availability, achieving your goals in real-time is basically impossible. So how can Sync help solve this problem? Let’s start with the basics.

The Problem

For simplicity, let’s say your data pipelines are run via Airflow, and each one of your Airflow directed acyclic graphs (DAGs) is composed of several task nodes. For simplicity let’s assume each one of the nodes in the DAG is an Apache Spark job. In the DAG image below in Fig. 1, we see four Apache Spark jobs, named “Index Analysis,” “Sentiment Analysis,” “Airline Delay,” and “Movie Rec.”

Fig. 1. Simple Airflow DAG with 4 Apache Spark nodes

You have a deadline and you need this DAG to finish within 800 seconds. How can you achieve that?

One clever way you might think of to try to achieve your goals is to optimize each job separately for runtime. You know 800s is a tight deadline, so you decide to choose configurations for each job that would give you the fastest runtime. There’s no way of figuring out what that might be, so you decide to run a set of experiments: run each job with different configurations and choose the one that gives you the fastest runtime.

Being a data engineer, you could even set up this whole process to be automated. Once you’ve figured out the configurations separately, you can use those configurations to run the DAG with the standard Airflow scheduler. However, what happens in this example is these jobs will be launched serially, maximizing the cluster each time, and it does not meet your deadline. In the figure below we see the number of vCPUs available in a cluster on the y-axis and time on the x-axis.

Fig. 2. VCPUs vs. Runtime plot of the Apache Spark jobs run with Airflow, using default schedulers

You tried to optimize each job separately for the fastest runtime, but it’s still not meeting the requirement. What could you do now to accomplish this goal? Just try increasing the cluster size to hopefully speed up the jobs? That could work but how much larger? You could run more experiments, but now the problem bounds are increasing and the growing list of experiments would take forever to finish. Also, changing the resources for an Apache Spark job is notoriously difficult, as you may cause a crash or a memory error if you don’t also change the corresponding Apache Spark configurations.

The Solution

To really see how best you can reallocate resources to hit the 800s SLA, we’ll have to look at the predicted runtime vs. resources curve for each Apache Spark job independently. For the four jobs, the predicted performance plots are shown in Fig. 3 below across 4 different instances on AWS.

We can see that depending on the job and the hardware, the number of nodes and runtime are very different depending on the job.

Fig. 3. Predicted runtime vs. resources plots for each of the 4 Apache Spark jobs

With this information, we can use AGORA to solve the scheduling problem to properly allocate resources, and re-run the DAG. Now we see that the 800s SLA is achieved, without changing the cluster size, changing Spark configurations, and obeying the DAG dependencies. What AGORA does is interesting, we see the “purple’ job gets massively compressed in terms of resources, with only a small impact on runtime. Whereas the “green” job doesn’t change much because it may blow up the runtime. Understanding which jobs can be “squished” and which cannot is critical for this optimization. In some sense, it’s a game of “squishy Tetris”!

Fig. 4. Globally optimized schedule, capable of achieving the 800s SLA

The Catch

Well, that looks great, what is so hard about that? Well, it turns out that scheduling problem is what is known in the math world as an NP-hard optimization problem. Actually solving that problem explodes into a very very difficult problem to solve quickly. With just those 4 jobs, we can see from the graphs below that to solve that schedule it can take over 1000 seconds, via brute force methods. Obviously, nobody wants to wait for that.

Fig. 5. Search space and solve time for the scheduling optimization problem of a simple DAG

The other issue is we need to predict those runtime vs. resources graphs with just a few prior runs. We don’t want to actually re-run jobs 100’s of times just to run it well once. This is where our Gradient product comes into play. With just 1 log, we can predict the cost-runtime performance for various hardware and optimized spark configurations.

At Sync Computing, we’ve solved both issues:

Scheduling modeling & solve time: We mathematically modeled Apache Spark, cloud resources, and cloud economics to an optimization problem we can solve extremely quickly. More details on the math can be found in the technical paper.
Predicting performance on alternative hardware: Our Gradient for Apache Spark takes in one log and can predict the performance across various machines — simultaneously accounting for hardware options, costs, spot availability, and Apache Spark configurations. See our case study here.

Goal based optimization

In the simple example above, the priority was to hit an SLA deadline, by reallocating resources. It turns out, we can also set other priorities. One obvious one is cost savings for cloud usage, in which the cost of the instances is prioritized over the total runtime. In this example, we utilized more realistics DAGs, as shown in the image below:

Fig. 6. Comprehensive DAGs more akin to realistic jobs.

For cost based optimization, what typically happens is fewer resources (less nodes) are used for each job which usually results in longer runtimes, albeit lower costs. Alternatively, we can be runtime optimized, in which more resources are used, albeit at higher costs. The simple table below highlights this general relationship. Of course knowing the exact number of nodes and runtime is highly dependent on the exact job.

When we run our goal based optimization, we show that for DAG1 we can achieve a runtime savings of 37%, or a cost savings of 78%. For DAG2, we can achieve a runtime savings is 45%, or a cost savings of 72% — it all depends on what you’re trying to achieve.

Fig. 7. Optimizing the schedule for DAGs 1 and 2 to minimize runtime.

Fig. 8. Optimizing the schedule for DAGs 1 and 2 to minimize cloud costs.

Of course other prioritization can be implemented as well, or even a mix. For example, some subset of the DAGs need to hit SLA deadlines, whereas some other subset need to be cost minimized. From a user’s perspective, all the user has to do is set the high level goals, and AGORA automatically reconfigures and reschedules the cluster to hit all goals simultaneously. The engineer can just sit back and relax.

The Solution at Scale — Alibaba Cluster Trace with 14 million tasks

So what happens if we apply our solution in a real-world large system? Fortunately, Alibaba publishes their cluster traces for academic purposes. The 2018 Alibaba cluster trace includes batch jobs run on 4034 machines over a period of 8 days. There are over 4 million jobs (represented as DAGs) and over 14 million tasks. Each machine has 96 cores and an undisclosed amount of memory.

When we simulate our solution at scale, we demonstrated total runtime/cost reduction of 65%, across the entire cluster. At the scale of Alibaba, 65% reduction over 14 million tasks is a massive amount of savings.

Fig. 9. Simulated performance of AGORA on the Alibaba cluster trace

Conclusion

We hope this article illuminates the possibilities in terms of the impact of looking globally across entire data pipelines. Here are the main take aways:

There are massive gains if you look globally: The gains we show here only get larger as we look at larger systems. Optimizing across low level hardware to high level DAGs reveals a massive opportunity.
The goals are flexible: Although we only show cost and runtime optimizations, the model is incredibly general and can account for reliability, “green” computing, or any other priority you’d like to encode.
The problem is everywhere: In this write up we focused on Airflow and Apache Spark. In reality, this general resource allocation and scheduling problem is fundamental to computer science itself — Extending to other large scale jobs (machine learning, simulations, high-performance computing), containers, microservices, etc.

At Sync, we built the Gradient for Apache Spark, but that’s just the tip of the iceberg for us. AGORA is currently being built and tested internally here at Sync with early users. If you’d like a demo, please feel free to reach out to see if we can help you achieve your goals.

We’ll follow up this article with more concrete customer based use-cases to really demonstrate the applicability and benefits of AGORA. Stay tuned!

References

Rising above the clouds

Suraj Bramhavar
01.03.2022

[The future of computing will be determined not by individual chips, but by the orchestration of many]

The Technological Imperative

Cloud computing today consists of an endless array of computers sitting in a far off warehouse and rented out by the second. This modality has altered the way we harness computing power, and created a ripe new opportunity to keep up with our exploding demand for more. Gains that were previously defined by cramming more and more power into a single chip can now be achieved by stitching many of them together.

At the same time, our unfettered access to these resources and the exponential growth of data used to feed them have created an unsustainable situation: demand for compute is growing much faster than our ability to meet it.

The explosive growth of ML models far exceeds the capabilities of even the latest hardware chips

Products like Google’s TPU pods, Nvidia’s DGX, or the Cerebras wafer-scale engine, are noble attempts to address this growing divide. These systems, however, are only half the battle. As they grow larger and more complex, the need to intelligently orchestrate the work running on them grows as well. What happens when only fractions of a workload need the extreme (and extremely expensive!) compute? How do I know which task goes on which chip and when?

The future of computing is undoubtedly distributed, but is saddled with the prospect of diminishing returns. So much effort is spent on boosting our compute resources with comparatively little spent on understanding how best to allocate them. Continuing down this path will leave our computing grid dominated by this ever-growing inefficiency.

The Business Imperative

As cloud adoption rises, these inefficiencies have landed squarely onto corporate balance sheets as a recurring operating expense. Not a small one either. According to a report by Andreessen Horowitz, over $100 Billion is spent each year on cloud services and growing, over 50% of which (we believe) is wasted. The top 30 data infrastructure startups have raised over $8 billion of venture capital in the last 5 years at an aggregate value of $35 billion, per Pitchbook. In addition, the engineering cost of delayed development time and lack of skilled talent further contribute to wasted resources not spent on driving revenue for businesses. While the cloud unleashed a torrent of software innovation, businesses large and small are struggling mightily to find this waste and control their ballooning bills.

The rise of enterprise cloud spending is skyrocketing, with no slowdown in sight

This collision between development speed and the complexity of infrastructure has created a fundamental conundrum in the industry:

How does an engineer launch jobs on the cloud

both easily and efficiently?

Sync Computing is tackling this problem head on. How? Here are a few key pillars:

Step 1) Scale to the cloud based ONLY on the things you care about: cost and time

One of the biggest gaps in today’s cloud computing environment is a simple mechanism to choose computers for a given workload based on metrics which matter to end users. AWS alone offers a continuously evolving menu of around 400 different machines with hundreds of settings and fluctuating prices, and few customers have the time and resources to identify which ones are best for their workload. With unlimited resources available at our fingertips, the most common question asked by customers remains: “Which do I choose?” The seemingly innocuous decision can have a huge impact. The exact same workload run on the wrong computer can run up to 5x slower, cost 5x more, or in some cases crash entirely.

Example of a customer’s production job cluster (black dot) performing slower and more expensive than optimal

Sync Computing is building a solution to this dilemma, providing users choices based only on the metrics which matter to them most: cost and time.

Interested in seeing how this is helping drive real value right now? Click here

Step 2) Orchestrate the work to meet your goals

With the cost/runtime goals set, matching these goals with the multitude of infrastructure options becomes impossibly hard.

Current Cloud systems are scheduled poorly and are often over provisioned (left). Sync takes the same jobs, reallocates resources, and reschedules to give the best performance possible (right).

Modern workloads consist of thousands of tasks distributed across hundreds of machines. Sifting through the innumerable combinations of tasks to machines while meeting strict cost and runtime constraints is an intractable mathematical optimization problem. Today’s computing systems are designed to avoid the problem entirely, opting to accept the inefficiency and unpredictability resulting from poor choices.

Sync Computing is developing the world’s first distributed scheduling engine (DSE) designed specifically for this challenge. At its heart is a unique processor capable of finding the best way to distribute work in milliseconds rather than hours or days. The ability to make these calculations in real time and at relevant scale serves as the key to unlocking an untapped well of additional computing power, and as workloads continue to grow, will become a cornerstone for future computing environments.

A New Paradigm

It is high time for a new working model for how we interact with computers, one that is centered around users, and Sync Computing is building the technology required for this shift.

Our mission is to build a dramatically more transparent, more efficient cloud, and propel the next generation of computing innovations.