How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs
Optimizing your Apache Spark code in EMR and Databricks is only half the battle. In this post we discuss the impact cloud infrastructure has on performance — and how the Sync Gradient can help.
The Situation
Let’s say you’re a data engineer and you want to run your data/ML Spark job on AWS as fast as possible. You want to avoid slow Apache Spark performance. After you’ve written your code to be as efficient as possible, it’s time to deploy to the cloud.
Here’s the problem, there are over 600 machines in AWS (today), and if you add in the various Spark parameters, the number of possible deployment options becomes impossibly large. So inevitably you take a rough guess, or experiment with a few options, pick one that works, and forget about it.
The Impact
It turns out, this guessing game could undo all the work put in to streamline your code. The graph below shows the performance of a standard Bayes ML Spark job from the Hi-Bench test suite. Each point on the graph below is the result of changing just 2 parameters: (1) which compute instance is used, and (2) The number of nodes.
Clearly we can see the issue here, even with this very simple example. If a user selects just 2 parameters poorly, the runtime could be up to 10X slower or cost twice as much as it should (with little to no performance gain).
Keep in mind that this is a simplified picture, where we have ignored Spark parameters (e.g. memory, executor count) and cloud infrastructure options (e.g. storage volume, network bandwidth) which add even more uncertainty to the problem.
Daily Changes Make it Worse
To add yet another complication, data being processed today could look very different tomorrow. Fluctuations in data size, skew, and even minor modifications to the codebase can lead to crashed or slow jobs if your production infrastructure isn’t adapting to these changing needs.
How Sync Solved the Problem
At Sync, we think this problem should go away. We also think developers shouldn’t waste time running and testing their job on various combinations of configurations. We want developers to get up and running as fast as possible, completely eliminating the guesswork of cloud infrastructure. At its heart, our solution profiles your job, solves a huge optimization problem, and then tells you exactly how to launch to the cloud.
More from Sync:
Choosing the right Databricks cluster: Spot instances vs. on-demand clusters, All-Purpose Compute vs. Jobs Compute
Choosing the right Databricks cluster: Spot instances vs. on-demand clusters, All-Purpose Compute vs. Jobs Compute
Databricks Compute Comparison: Classic Jobs vs Serverless Jobs vs SQL Warehouses
Databricks Compute Comparison: Classic Jobs vs Serverless Jobs vs SQL Warehouses