How poor provisioning of cloud resources can lead to 10X slower Apache Spark jobs

Optimizing your Apache Spark code in EMR and Databricks is only half the battle. In this post we discuss the impact cloud infrastructure has on performance — and how the Sync Gradient can help.

The Situation

Let’s say you’re a data engineer and you want to run your data/ML Spark job on AWS as fast as possible. You want to avoid slow Apache Spark performance. After you’ve written your code to be as efficient as possible, it’s time to deploy to the cloud.

Here’s the problem, there are over 600 machines in AWS (today), and if you add in the various Spark parameters, the number of possible deployment options becomes impossibly large. So inevitably you take a rough guess, or experiment with a few options, pick one that works, and forget about it.

The Impact

It turns out, this guessing game could undo all the work put in to streamline your code. The graph below shows the performance of a standard Bayes ML Spark job from the Hi-Bench test suite. Each point on the graph below is the result of changing just 2 parameters: (1) which compute instance is used, and (2) The number of nodes.

Clearly we can see the issue here, even with this very simple example. If a user selects just 2 parameters poorly, the runtime could be up to 10X slower or cost twice as much as it should (with little to no performance gain).

Keep in mind that this is a simplified picture, where we have ignored Spark parameters (e.g. memory, executor count) and cloud infrastructure options (e.g. storage volume, network bandwidth) which add even more uncertainty to the problem.

Daily Changes Make it Worse

To add yet another complication, data being processed today could look very different tomorrow. Fluctuations in data size, skew, and even minor modifications to the codebase can lead to crashed or slow jobs if your production infrastructure isn’t adapting to these changing needs.

How Sync Solved the Problem

At Sync, we think this problem should go away. We also think developers shouldn’t waste time running and testing their job on various combinations of configurations. We want developers to get up and running as fast as possible, completely eliminating the guesswork of cloud infrastructure. At its heart, our solution profiles your job, solves a huge optimization problem, and then tells you exactly how to launch to the cloud.

Try the Sync Gradient for Apache Spark yourself.

Jeffrey Chou

24 Mar 2023

Jeff is the co-founder and CEO of Sync Computing. He holds a PhD from UC Berkeley and was a post-doc at MIT. His interest is in large compute infrastructure and entrepreneurship.

Optimizing your Apache Spark code in EMR and Databricks is only half the battle. In this post we discuss the impact cloud infrastructure has on performance — and how the Sync Gradient can help.

The Situation

The Impact

Daily Changes Make it Worse

How Sync Solved the Problem

More from Sync: