The top 6 lessons learned why companies struggle with cloud data efficiency

We summarize the top lessons we’ve learned from talking to companies big and small

Here at Sync, we’ve spoken with companies of all sizes, from some of the largest companies in the world to 50 person startups who desperately need to improve their cloud costs and efficiencies for their data pipelines. Especially in today’s uncertain economy, companies worldwide are implementing best practices and utilizing SaaS tools in an effort to save their bottom lines on the cloud.

This article is the first in a series of posts dedicated to the topic of improving cloud efficiency for data pipelines, why it’s so hard, and what can be done.

In our discussions with companies, we have identified several recurring challenges that hinder their ability to improve their cloud infrastructure for big data platforms. While our current focus is on systems that use Apache Spark (such as EMR or Databricks on AWS), many of these problems are common across different technologies.

In this article, we will summarize the top six reasons we’ve heard from companies for these challenges and provide an overview of the underlying issues. In our next blog post, we will discuss the strategies that companies use to address these problems.

1. Developer Bandwidth

Optimizing complex cloud infrastructure is not easy and in fact requires significant time and experimentation. Simply understanding (let alone implementing) the potential business trade offs of various infrastructure choices requires hours of research. It’s a full time job and developers already have one.

For example, should a developer run their Apache Spark job on memory, compute, or network optimized instance families? What about AWS’s new Graviton chips, those sound pretty promising? What about all of the Apache Spark parameters like partition sizes, memory, number of executors etc? The only way to truly know is to manually experiment and sweep parameters — this is just not a sustainable approach.

Even if an optimal configuration is found, that configuration may quickly go stale as code, input data profile, business kpis, and cloud pricing all fluctuate.

Developers are mostly just trying to get their code up and running in the first place, let alone rarely do they have the luxury of optimization.

2. Switching cost of existing infrastructure

Improving existing infrastructure is typically a lot more work than just changing a few lines of code. Optimizing infrastructure for substantial cost or runtime gains could come at the cost of changing fundamental components of your architecture.

Many developers wonder if “the juice is worth the squeeze?”

For example, changing platforms (let’s say Databricks vs EMR, Apache Hive vs Apache Iceberg, or Non-Spark vs. Spark) is a daunting task that could slow down product velocity (and hence revenue) while the switch is being implemented.

We’ve seen companies knowingly operate inefficiently simply because switching to something better is “too much work,” and would rather live with a less efficient system.

3. Too many changing variables

Many production data systems are subject to many fluctuating parameters, such as varying data size, skew, volume of jobs, spot node availability and pricing, changing codebase, and engineer turnover to name a few. The last one is particularly challenging, as many pipelines are set up by former employees, and big changes come with big risk.

With such complexity, many teams deem it too complicated and risky to try to optimize and would rather focus on “lower hanging fruit.” For new workloads, companies often go the easy way out and just copy and paste prior configurations from a completely different workload — which can lead to poor resource utilization.

4. Lack of expertise

Cloud native companies typically prioritize fast product development time, and cloud providers are happy to oblige by quickly making a massive amount of resources available with a single click. While this democratization of cloud provisioning has been a game changer for speed to market, cloud optimization isn’t typically recognized as a part of the developer job description.

The conflict of spending time learning about low level compute tradeoffs instead of building the next feature, is preventing this knowledge from increasing in the general workforce. Furthermore, it’s not a secret that as a whole, there is a massive cloud skill shortage in the market today. Outside of the FAANG mafia, it’s incredibly difficult for other companies to compete for talent with deep knowledge of cloud infrastructure.

One large global company complained to us that they keep losing architects to AWS, GCP, and Azure themselves! Simply put, many companies couldn’t optimize cloud infrastructure even if they desperately wanted to.

5. Misaligned incentives

At large companies different teams are responsible fordifferent parts of a company’s cloud footprint, from developers up to the CTO — each team can have different incentives. With these differing motivations, it can be difficult to quickly enact changes that help improve efficiency, since decisions from one group may negatively impact the goals of another.

For example, companies often have vendor contracts they need to stay within the boundaries of, but when it’s projected that they will exceed the forecasted consumption, cloud optimization can rise from low to high priority. This sudden change can cause a frenzy of discussions on what to do, the tradeoffs, and impact on various groups. Companies have to choose between keeping developers focused on new product development, or refocusing them on cloud optimization projects.

For developers to play an efficient role in this complex web, they must be able to quickly understand and articulate the business impact of their infrastructure choices — but this is much easier said than done. Ensuring that the thousands of low-level infrastructure choices a developer makes align with the high business goals of a VP is incredibly difficult.

6. Scale and Risk

Many companies we’ve spoken to run tens of thousands of business critical production Apache Spark jobs a day. Optimizing jobs at this scale is a daunting task, and many companies only consider superficial optimizations such as reserved instances or autoscaling, but lack the ability to individually tune and optimize resources for each of the thousands of jobs launched daily.

So what do companies do now?

Unfortunately, the answer is “it depends.” The solution to addressing the challenges of cloud optimization varies depending on a company’s current structure and approach. Some companies may choose to form internal optimization teams or implement autoscaling, while others may explore serverless options or engage with cloud solution architects. In our next post, we will discuss these common strategies and their respective advantages and disadvantages.

Conclusion

Here at Sync, our whole mission is to build a product that solves these issues and empowers developers from all company sizes to reach a new level of cloud infrastructure efficiency and align them to business goals.

Our first product is an Gradient for Apache Spark which profiles existing workloads from EMR and Databricks, and provides data engineers an easy way to obtain optimized configurations to achieve cost or runtime business goals. We recently released an API for the Gradient which allows developers to programmatically scale the optimization intelligence across all their workloads.

Try our Gradient for Apache Spark on EMR and Databricks yourself

We’re passionate about solving this deep and extremely complicated problem. In our next post we’ll discuss possible solutions and their tradeoffs. Please feel free to follow us on Medium to keep in the loop for when we release the next article in this series.

Jeffrey Chou

14 Dec 2022

Jeff is the co-founder and CEO of Sync Computing. He holds a PhD from UC Berkeley and was a post-doc at MIT. His interest is in large compute infrastructure and entrepreneurship.