serverless

Top 9 Lessons Learned about Databricks Jobs Serverless

Jeffrey Chou
07.02.2024

To much fanfare, Databricks announced their wide release of serverless compute across all of their platforms at the Data + AI 2024 conference. It’s quite clear that Databricks’ vision is to own the compute layer to make life easier for end users, so they don’t have to worry about annoying cluster or versions details.

We at Sync are all about cloud compute efficiency. So of course we had to take a closer look at Databricks Serverless Compute to provide an honest perspective of serverless computing compared classic computing, and evaluate the pros and cons.

Full disclosure – Sync is a Databricks partner. With that said, everything we state here is merely our opinion, which is based on the experimental measurements covered below. Our goal is to provide an unbiased guide, so that users can decide for themselves whether or not serverless is a good choice for them.

This post focuses on the jobs serverless feature, which is currently under public preview. For more information about the SQL warehouse serverless product, check out this post.

What is serverless? – A travel analogy

At a high level serverless means that end users don’t have to provision cloud infrastructure anymore, Databricks will do it all for you – so users can just focus on their code. For example, selecting which instances to use is now under the control of Databricks, not the end user.

While this sounds like a no-brainer, if you zoom in a bit closer there are some pros and cons. Let’s use an analogy here. Let’s say you want to travel from San Francisco to London.

The “classic” way of doing this is all of the planning is up to you, whether you travel by car, train, boat, bus, walking, running, or even swimming it’s all up to you to figure out. And then there’s lodging, scheduling, and budget to think about. This is a lot of work. However, you can custom tailor to your exact specifications, including timing and budget.

The “serverless” way of doing this is you close your eyes in San Francisco and you wake up in London. That sounds like the dream situation. However, you are soon handed a bill that costs $50,000 and you arrive a day after your big meeting. How did that work out for you?

If you’re a high volume traveler for business, you may prefer the high level of control the “classic” way brings, since you need that level of granularity. If you’re a wealthy aimless traveler just enjoying the world, serverless is probably the dream. It all depends, both methods are potentially great.

The lessons learned

As you review the results, we’d like you to bear in mind that jobs serverless will most likely improve with time, and that our current results are merely a snapshot in time. Will these numbers hold up in 1 year? Maybe, maybe not – we have no idea. We can only hope that this feedback helps shape the offering with some honest feedback from the field.

Now for the good stuff: here are the top 9 lessons learned from evaluating Databricks Jobs Serverless.

Serverless compute is not cost optimized
Ideal for short or ad-hoc jobs
Eliminating spin up time is the biggest value add
Serverless has zero knobs, which makes life easy but at the price of control
You have no control over the runtime of your jobs
Migrating to serverless is not easy
Costs are completely determined by Databricks
What happens if there’s an error?
You can’t leverage your cloud contracts

Read on for our in-depth analysis of Databricks server vs. serverless computing.

1. Serverless compute is not cost optimized

By far, the biggest hope users have for serverless is that “it will optimize my compute for me.” While this is true in some regard, the core issue users need to understand is “Does serverless provide the lowest cost option?” We can answer very clearly and unambiguously that, unfortunately, it does not. Serverless is not the cheapest option around.

While serverless is a pretty good option, it’s certainly not the most optimal when it comes to cost for all use cases. As evidence, we ran a test job and found that an optimized cluster by Gradient (our flagship product), outperformed Databricks serverless jobs by roughly 60% from a cost perspective!

Our test job runs basic queries on a randomly generated dataset. The runtime on a classic cluster is about 1 hour, which is pretty typical in many jobs we’ve encountered at companies. Serverless was able to run the job much faster, taking only 30 minutes which was great to see. Unfortunately, the runtime savings didn’t translate to the cost savings, as you can see in the chart below.

In the test we are utilizing on-demand instances with list pricing. Users will likely save even more if using Spot instances. However, you have no option to use spot instances with serverless. You have no access at all to what’s going on.

This test result might not translate to your internal jobs. You may have a job that demonstrates that serverless massively outperforms an optimized cluster in terms of costs – it all depends on your workload. With that said, this data point does prove that serverless is not GLOBALLY optimal. Serverless does not guarantee cost savings.

Some skeptics might say that the cost savings of serverless appears in the form of engineering hours saved. With serverless, engineers don’t have to spend time thinking about clusters – which can translate to real time and money saved. We completely agree with this point of view, that is very substantial.

Our one counter argument is that getting started on any cluster is pretty easy today, so most people don’t spend time tuning their clusters if they don’t want to. Engineers typically resort to cluster tuning to help lower cost or improve performance. So if the cost and performance of Serverless is not ideal, you’re just out of luck – serverless may not be solving the root issue that tuning is attempting to solve.

At the end of the day everything depends on the particularities of your workload and use case.

2. Ideal for short or ad-hoc jobs

A great use case for serverless, which we fully endorse, is using serverless for short (<5 min) jobs. The elimination of spin up time for your cluster is a massive win that we really love.

Here’s an experiment we ran using a trivial job that doesn’t really do anything. The job doesn’t even run Spark and is run on a single node cluster. The cost was slightly lower with serverless, but the big win was in the runtime where we saw roughly 80% reduction in runtime! This improvement is mostly due to the complete elimination of cluster spin up time which can take 5-10 minutes.

What’s interesting is even though serverless does not have spin up time, the cost premium for serverless still equated to roughly the same overall cost – which was a bit disappointing.

With that said, the big win is that users may not always know that they are running a short job and could be massively over provisioning their clusters on accident.

Serverless helps to avoid that mistake and that can result in substantial big cost savings – simply preventing human error.

3. Eliminating spin up time is the biggest value add

We couldn’t love this aspect enough. Cluster spin up time is such a pain to deal with when you’re just trying to run something in real-time. So many times users have to wait for a cluster to spin up and get sidetracked by another task so that they don’t come back to the cluster until an hour later.

Our one nit pick here is that for scheduled jobs in a pipeline, spin up time is less of a concern. These are jobs that can be running at all hours of the day at scales of 1000s of jobs running per day. At that level, spin up time is really a cost factor – and then the real question is if serverless provides the lowest cost.

However, if you’re doing a quick ad-hoc experiment, or just want to get a quick result – we highly recommend serverless as you’ll get your result much faster.

4. Serverless has zero knobs, which makes life easy but at the price of control

Something quite unique about Databrick jobs serverless is that there are zero knobs. Not even “T-shirt” sizing, like what we have today in SQL serverless platforms. This means that you can’t even select “small, medium, large, x-large” clusters – you don’t get to select anything.

For a company that is just trying to get jobs up and running asap, we think this can be pretty great. It can save engineers some time when it comes to provisioning infrastructure.

The big tradeoff is that you can’t change anything. If you care about cost and runtime and want the ability to tune performance, then this may not be a convenient feature.

5. You have no control over the runtime of your jobs

The big downside of jobs serverless is that there’s no way to tune the cluster to adjust cost or runtime. You basically have to live with whatever Databricks decides. This means that if you want faster runtime, you can’t just throw a bigger cluster at it and call it a day. You can’t do anything really, except change your code. You’re stuck.

We can only assume that eventually Databricks will throw in some high-level “performance” knob, as we think this is a pretty big limitation, but who knows.

6. Migrating to serverless is not easy

Serverless utilizes shared compute resources in the background, and as a result enforces a large number of general restrictions. We’ve heard rumors that it’s a giant Spark on Kubernetes clusters, but don’t quote us on that.

A couple impactful restrictions of serverless are:

You must have Unity Catalog enabled
Scala and R are not supported
Only ANSI SQL is supported when writing SQL
Spark RDD APIs are not supported
Caching API and SQL commands are not supported
Global temporary views are not supported
You cannot access DBFS

This goes on and on, exceeding over 100 limitations. We, in fact, had a hard time getting ANY job to run on serverless. We ran into issues even with simple test jobs. It wasn’t until we manually changed the code and moved data around that we finally got it to work.

Our opinion is unless this is improved dramatically, it will be a giant lift and shift amount of work for enterprises who built their jobs on classic compute. Serverless eliminates the general flexibility we had on classic clusters. This will likely slow the adoption of serverless for larger companies. Probably new workloads will get onboarded to serverless first, before any big migration effort takes place.

7. Costs are completely determined by Databricks

One thing we found troubling was the pricing. On the Databricks website they say that the cost is $0.35/DBU. But, where is the DBU/hr metric? Normally, one would take the runtime of the job and calculate a $/hr rate. Then, a user could tune the cluster size and tune the rate of cost. But with zero knobs, we have no control over the rate of cost.

It appears that we only get the number of DBUs a job costs AFTER it completes via the system tables. We find it odd that this number is calculated completely in the dark. Today my job costs 10 DBUs, next month it may cost 12 DBUs. Why? No clue. The amount of power Databricks has here is a bit overreaching in our opinion.

In fact, because there is no DBU/hr metric, the “list price” they have here of $0.35/DBU is incomplete. Since the amount of DBUs my jobs cost is solely under Databricks’ control, the end price is pretty much arbitrary. For example, the calculation of job cost is simply: $0.35/DBU * X DBU, where X is the amount of DBUs Databricks determines and reports in the system tables.

This is quite the advantage for Databricks, and in our opinion, it will result in large revenue and profit growth. Any compute optimization they do on the backend helps widen their margins, while the cost to the user stays mysterious. I don’t blame Databricks for doing this, and this is also not a new concept. In fact, many companies prefer to run serverless for this revenue generating reason. Users get “convenience” and Databricks makes more money, it’s a fair exchange.

What’s funny is that this is approaching the Snowflake’s serverless model. Not only is Snowflake the company’s mortal enemy, many people complain that Snowflake is too expensive for this very reason. It will be interesting to see how the market reacts in the long run and if sticker shock to their serverless bills will cause some CFOs to take action.

8. What happens if there’s an error?

In Spark-land, everyone knows of the dreaded out of memory (OOM) error. What happens if the cluster under the serverless hood hits this error for your job? Typically we would try to fix this with more memory, but we don’t have that option with serverless.

Are users now dependent on Databricks to fix this? That could be dangerous. In cluster-land, a million things can go wrong, and now that it’s all managed by Databricks you’re pretty much at the mercy of their support team if anything critical goes down.

9. You can’t leverage your cloud contracts

If you have established special discounts with AWS or Azure, or custom plans for certain instances – listen up. If you use serverless jobs those discounts can become irrelevant in regards to your Databricks usage. This is because on serverless the compute runs inside the Databricks environment.

This may or may not apply to your company, it really depends on the nature of your contracts and the volume of your Databricks usage. We thought we’d point it out as it is a major difference between classic and serverless compute.

Conclusion

Distributed computing is a very complex topic, and rarely is anything a guaranteed slam dunk. In this post, we covered the pros and cons of Databricks serverless jobs as it is today.. Some will find this capability beneficial and that’s fantastic! Based on our initial analysis, we found serverless jobs to be ideal for short and ad-hoc jobs, as the largest value add is the elimination of spin up times.

As most companies, Databricks’ marketing overshoots the benefits of their features from time to time. We’ve already reported on this in regards to Photon and Autoscaling, and we hope we have now sprinkled some truth in regards to Serverless as well. By the way, Databricks states that Photon and autoscaling are automatic for jobs serverless, which in our analysis, often leads to unnecessary cost increases.

As you can imagine, we get a lot of questions about serverless. Here are the top two questions we are asked about Databricks serverless jobs:

Databricks says serverless is cost efficient, so what’s the deal?

Yes, Databricks has presented materials touting that serverless is cost efficient relative to classic compute. This presentation from DAIS 2024 is a great example. The real question is – what do you compare serverless to? If you’re comparing serverless to just default settings, then serverless may likely do very well. But a workload optimized cluster can likely outperform serverless. Finding an optimized cluster, though, is by no means an easy feat. How do we determine how “optimal” it is?

Skeptics may say that we are biased as well, that examples above are based on jobs we knew could outperform serverless. This is a totally fair position to have. For the record – we did not cherry pick a workload in our tests, we just quickly found the first job we could that was even compatible with serverless.

At the end of the day, benchmarks presented by external parties can be totally irrelevant to your use case. Fancy benchmarks like TPC-DS, or even the one we shared in this post do not look like your jobs. There’s only one thing that really matters: YOUR WORKLOADS.

Finally, we say, don’t take our word for it, nor Databricks’. Compare serverless head to head with an optimized classic cluster and see for yourself. If you need help, Gradient is here to lead you to help that coveted optimized cluster. Or, if you have the knowledge and skills, you can manually optimize your cluster and compare it to serverless. u.

How does this impact Sync?

Does serverless impact the Sync roadmap? We have short term and a long term answers to this question:

Short term: Serverless and classic will co-exist for quite some time. We see serverless as just another option (like Photon, or autoscaling). Sync’s algorithms can test whether serverless or an optimized classic cluster is best for your needs, and share recommendations or auto-apply those changes for you. At the end of the day, all we care about is selecting the best options and configurations based on your goals. If serverless is that, then we’ll be happy to point your jobs in that direction.

Long term: We are a cloud compute management company, and Databricks is just the first stop in our evolution. Our plan is to expand to all facets of cloud computing, from Spark, to bare CPUs, GPUs, Kubernetes etc., it’s all up for grabs. Databricks has been a great partner and an important first step, but longer term, it will be one of dozens of platforms we support. Our guess is optimizing classic compute will play a role for Databricks users for many years to come, and we’re happy to help them with that

Like we said earlier, don’t take our word for it. Try serverless out for yourself, do your own homework. Conduct an A/B test and see if serverless is actually cost effective. If you need help automatically finding the optimized classic cluster, feel free to check us out.

Rethinking Serverless: The Price of Convenience

Vinoo Ganesh
02.09.2024

As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend.

The root cause of this confusion has to do with the fact that the catch-all term “Serverless” actually compromises two large operating models: functions and jobs. At Sync – we’re intimately familiar with optimizing jobs, so when our customers gave us feedback to focus a portion of our attention on serverless functions, we were more than intrigued.

The hypothesis was simple. Could we extend our expertise and background in optimizing Databricks large scale/batch compute workloads to optimizing many smaller batch compute workloads.

Serverless Functions

First, let’s see how we got here.

One of the most painful parts of the developer workflow is “real world deployment.” In the real world, deploying code that was written locally to the right environment and to work in the same way was extraordinarily painful. Libraries issues, scaling issues, infrastructure management issues, provisioning issues, resource selection issues, and a number of other issues plagued developers. The cloud just didn’t mimic the ease and simplicity of local developer environments.

Then Serverless functions emerged. All of a sudden, developers could write and deploy code in a function with the same level of simplicity as writing it locally. Then never had to worry about spinning up an EC2 instance or figuring out what the material differences between AMI and Ubuntu are. They didn’t have to play with docker files or even have to do scale testing. They wrote the exact same Python or NodeJS code that they wrote locally in a Cloud IDE and it just worked. It seemed perfect.

Soon, mission critical pieces of infrastructure were supported by double digit line python functions deployed in the cloud. Enter: Serverless frameworks. All of a sudden, it became even easier to adopt and deploy serverless functions. Enterprises adopted these functions like hotcakes. Many deployed in the hundreds or even thousands of these functions.

Why We Care

At Sync, our focus since our inception has been optimizing large scale compute jobs. Whether through Spark, EMR, or Databricks, the idea of introspecting a job and building a model through which we can understand and optimize that job, is our bread and butter. As we continued our development, multiple customers began asking for support of serverless technologies. Naturally, we assumed they were talking about Serverless Job functionality (which many were), but there was a substantial portion focused on Serverless Function functionality.

So we set out to answer a simple question: Are Serverless Functions in their current form working for the modern enterprise?

The answer, as it happens, is a resounding no.

Industry Focus

In 2022, an IBM blog post titled “The Future Is Serverless” was published, which cited the “energy-efficient and cost-efficient” nature of serverless applications as a primary reason that the future will be serverless. They make the – valid – case that reserving cloud capacity is challenging and consumers of cloud serverless functions are better served by allowing technologies such as KNative to streamline the “serverless-ification” processes. In short, their thesis is that complex workloads, such as those run in Kubernetes, are better served by Serverless offerings.

In 2023, Datadog released their annual “State of Serverless” post, where they show the continued adoption of Serverless technologies. This trend is present across all of the 3 major cloud vendors.

https://www.datadoghq.com/state-of-serverless/

The leader of the pack is AWS Lambda. Lambda has traditionally been the entry point for developers to deploy their Serverless workloads.

But hang on, 40%+ of Lambda Invocations happen in NodeJS? NodeJS is not traditionally thought of as a distributed computing framework, nor is it generally used for some large scale orchestration of computate tasks. But it seems to be dominating the Lambda serverless world.

So, yes, IBM argues that Serverless is great for scaling distributed computation tasks, but what if that’s not what you’re doing with Serverless?

https://www.datadoghq.com/state-of-serverless/

What Serverless Solved

Before we get into the details of what’s missing, let’s talk about where things are currently working.

Where Things Work 1: Uptime Guarantees

One of the critical, but most frustrating pieces of the developer lifecycle is uptime requirements. Many developers hear the term five-nines, and shudder. Building applications that have specific uptime guarantees is not only challenging, it’s also time-intensive. When large scale systems are made up of small, discrete pieces of computation, the problem can become all the more complex.

Luckily, the Lambda SLAs guarantee a fairly reasonable amount of uptime, right out of the box. This can save otherwise substantial developer efforts of scoping, building, and testing highly available systems.

Where Things Work 2: Concurrency + Auto Scaling

Introspecting a large scale system isn’t easy. Companies like DataDog and CloudFlare run multi-billion dollar businesses off of this exact challenge. In an environment where requests can burst unexpectedly, creating and designing systems that scale based on spot user demand is also difficult.

One of the most powerful aspects of a serverless or hosted model (such as AWS Lambda), is the demand-based auto-scaling capabilities offered by the infrastructure. These effects are compounded, especially when the functions themselves are stateless. This effectively eliminates developers having to care about the operational concerns of autoscaling. There are unquestionably still the cost concerns, serverload concerns, and others, but serverless function offerings give developers a good starting point.

Problem 1: Developer Bandwidth

In a typical Serverless Function deployment, the initial choice of configuration tends to be the perpetual choice of configuration.

Wait, hang on, “initial choice of configuration”? Meaning, users still have to manually select their own configuration? It turns out, yes, users still need to manually pick a particular configuration for each serverless function they deploy. It’s actually a bit ironic – with the promise of true 0-management jobs, users are still required to intelligently select resource configuration.

If an engineer deploys and accidently overspecs a serverless function initially, it’s fairly unlikely that they will ever revisit the function to optimize it. This is generally the case for a few reasons:

Time – Most engineers don’t have the time to go back and ensure that functions they have written weeks, months, or even years ago are operating under the ideal resources. This largely feeds into #2.
Incentives – Engineers are not incentivized by picking the optimal resource configuration for their jobs. They’d rather have the job be guaranteed to work, while spending a bit more of their company’s compute budget.
Employee Churn – Enterprises have inherent entropy and employees are oftentimes transient. People start jobs and people leave jobs. The knowledge generally leaves with them. When other engineers inherently previous work, they are significantly more incentivized to just ensure it works, rather than ensure that it works optimally.

Problem 2: Serverless Still Requires Tuning

Lambda is predicated on a simple principle – the resource requirements for workloads that take less than 15 minutes to run, can be pretty easily approximated. Lambda makes it easy for developers to set-and-forget, offering only one knob for them to worry about.

That knob is memory. Using Lambda, you can configure the memory allocated to a lambda function as a value between 128 MB and 10,240 MB. Lambda will automatically decide how much vCPU to allocate to you based on the memory setting.

This… sounds great. “I only have to pick one lever and, and all of a sudden, I get everything else figured out for me? That’s perfect!” If that were the end of the story, I would get to finish this post right now.

Instead, life is all about tradeoffs – generally correlated tradeoffs. In this case, it’s cost and performance. As an engineer, it’s easy for me to pick the largest memory setting available to me just to ensure my Lambda function works, regardless of what its actual resource requirements are. Once it works, why would I ever touch it again?

Well, it turns out that picking large, frequently uncorrelated-to-necessary-resources values isn’t the most cost effective thing to do. So much so, in fact, that an AWS Solutions Engineer built and open sourced a tool to help users actually find the correct memory levels for their Lambda functions. The tool uses AWS Step Functions to walk users down to the minimum necessary level. It’s been so popular that it has 5K stars on GitHub… and 18.8K

deployments.

Clearly, the one-knob-rules-all solution isn’t working.

Problem 3: Serverless Is Hard to Introspect

The scale and growth testing that plagued engineers for decades before the rise of Serverless, was unfortunately not in vain. Understanding how users will be interacting with an application, in terms of number of requests or compute load gives engineers a powerful understanding of what to expect when things go live.

In the Serverless Function architecture, engineers don’t think about these considerations and push the burden onto the infrastructure itself. As long as the infrastructure works – it’s unlikely that an already oversubscribed engineer would spend time digging into the performance or cost characteristics of the Serverless function.

Absent home-rolled solutions, there are few tools that allow for the detailed observability of a single serverless function. Furthermore, there are usually hundreds if not thousands of serverless functions deployed. Observability across a fleet of functions is nearly impossible.

Furthermore, the primary mechanism folks can use for per-function observability is AWS CloudWatch. Cloudwatch logs events for each lambda invocation and stores a few metrics. The major problem though, is that just collecting this information in CloudWatch has been observed to be more expensive than Lamba itself. In fact, there are full articles, posts, and best practices around just managing the costs associated with Lambda CloudWatch logs.

Problem 4: No Auto-Optimization

The year 2023 brought on a material shift in the mentality of “compute” consumers. Enterprises that were previously focused on growth at all costs shifted their focus to efficiency. Vendors in the generic Cloud, Snowflake, and Databricks ecosystem popped up at increasing rates. Most had a simple goal – provide high level visibility into workloads.

They provided interactive charts and diagrams to show ongoing cost changes… But they didn’t provide the fundamental “healing” mechanisms. It would be like going to the doctor, having them diagnose a problem, but provide no recourse.

Consistent with their focus on efficiency, enterprises had a few options. Larger ones deployed full teams to focus on this effort. Smaller ones that didn’t have the budget or manpower turned to observability tools… nearly all of which fell short, as they missed the fundamental optimization component.

Providing detailed visibility across a few, large scale jobs is considered table stakes for many observability providers, but for some reason providing that same level of visibility across many, small scale jobs, in an efficient and easy to optimize way hasn’t become standard.

Conclusion

We’re in a fairly unique period as an industry. Job visibility, tuning, introspection, and optimization have reemerged as key pieces of the modern tech stack. But most focus on the whales, when they should be focusing on the barracudas.

If these problems resonate with you – drop us a line at info@synccomputing.com. We’d love to chat.