Blog

Rethinking Serverless: The Price of Convenience

As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend. 

The root cause of this confusion has to do with the fact that the catch-all term “Serverless” actually compromises two large operating models: functions and jobs. At Sync – we’re intimately familiar with optimizing jobs, so when our customers gave us feedback to focus a portion of our attention on serverless functions, we were more than intrigued.

The hypothesis was simple. Could we extend our expertise and background in optimizing Databricks large scale/batch compute workloads to optimizing many smaller batch compute workloads. 

Serverless Functions

First, let’s see how we got here.

One of the most painful parts of the developer workflow is “real world deployment.” In the real world, deploying code that was written locally to the right environment and to work in the same way was extraordinarily painful. Libraries issues, scaling issues, infrastructure management issues, provisioning issues, resource selection issues, and a number of other issues plagued developers. The cloud just didn’t mimic the ease and simplicity of local developer environments. 

Then Serverless functions emerged. All of a sudden, developers could write and deploy code in a function with the same level of simplicity as writing it locally. Then never had to worry about spinning up an EC2 instance or figuring out what the material differences between AMI and Ubuntu are. They didn’t have to play with docker files or even have to do scale testing. They wrote the exact same Python or NodeJS code that they wrote locally in a Cloud IDE and it just worked. It seemed perfect.

Soon, mission critical pieces of infrastructure were supported by double digit line python functions deployed in the cloud. Enter: Serverless frameworks. All of a sudden, it became even easier to adopt and deploy serverless functions. Enterprises adopted these functions like hotcakes. Many deployed in the hundreds or even thousands of these functions. 

Why We Care

At Sync, our focus since our inception has been optimizing large scale compute jobs. Whether through Spark, EMR, or Databricks, the idea of introspecting a job and building a model through which we can understand and optimize that job, is our bread and butter. As we continued our development, multiple customers began asking for support of serverless technologies. Naturally, we assumed they were talking about Serverless Job functionality (which many were), but there was a substantial portion focused on Serverless Function functionality. 

So we set out to answer a simple question: Are Serverless Functions in their current form working for the modern enterprise? 

The answer, as it happens, is a resounding no. 

Industry Focus

In 2022, an IBM blog post titled “The Future Is Serverless” was published, which cited the “energy-efficient and cost-efficient” nature of serverless applications as a primary reason that the future will be serverless. They make the – valid – case that reserving cloud capacity is challenging and consumers of cloud serverless functions are better served by allowing technologies such as KNative to streamline the “serverless-ification” processes. In short, their thesis is that complex workloads, such as those run in Kubernetes, are better served by Serverless offerings.

In 2023, Datadog released their annual “State of Serverless” post, where they show the continued adoption of Serverless technologies. This trend is present across all of the 3 major cloud vendors.

https://www.datadoghq.com/state-of-serverless/

The leader of the pack is AWS Lambda. Lambda has traditionally been the entry point for developers to deploy their Serverless workloads. 

But hang on, 40%+ of Lambda Invocations happen in NodeJS? NodeJS is not traditionally thought of as a distributed computing framework, nor is it generally used for some large scale orchestration of computate tasks. But it seems to be dominating the Lambda serverless world.

So, yes, IBM argues that Serverless is great for scaling distributed computation tasks, but what if that’s not what you’re doing with Serverless?

https://www.datadoghq.com/state-of-serverless/

What Serverless Solved 

Before we get into the details of what’s missing, let’s talk about where things are currently working. 

Where Things Work 1: Uptime Guarantees 

One of the critical, but most frustrating pieces of the developer lifecycle is uptime requirements. Many developers hear the term five-nines, and shudder. Building applications that have specific uptime guarantees is not only challenging, it’s also time-intensive. When large scale systems are made up of small, discrete pieces of computation, the problem can become all the more complex. 

Luckily, the Lambda SLAs guarantee a fairly reasonable amount of uptime, right out of the box. This can save otherwise substantial developer efforts of scoping, building, and testing highly available systems. 

Where Things Work 2: Concurrency + Auto Scaling

Introspecting a large scale system isn’t easy. Companies like DataDog and CloudFlare run multi-billion dollar businesses off of this exact challenge. In an environment where requests can burst unexpectedly, creating and designing systems that scale based on spot user demand is also difficult. 

One of the most powerful aspects of a serverless or hosted model (such as AWS Lambda), is the demand-based auto-scaling capabilities offered by the infrastructure. These effects are compounded, especially when the functions themselves are stateless. This effectively eliminates developers having to care about the operational concerns of autoscaling. There are unquestionably still the cost concerns, serverload concerns, and others, but serverless function offerings give developers a good starting point. 

Problem 1: Developer Bandwidth 

In a typical Serverless Function deployment, the initial choice of configuration tends to be the perpetual choice of configuration.

Wait, hang on, “initial choice of configuration”? Meaning, users still have to manually select their own configuration? It turns out, yes, users still need to manually pick a particular configuration for each serverless function they deploy. It’s actually a bit ironic – with the promise of true 0-management jobs, users are still required to intelligently select resource configuration. 

If an engineer deploys and accidently overspecs a serverless function initially, it’s fairly unlikely that they will ever revisit the function to optimize it. This is generally the case for a few reasons:

  1. Time – Most engineers don’t have the time to go back and ensure that functions they have written weeks, months, or even years ago are operating under the ideal resources. This largely feeds into #2.
  2. Incentives – Engineers are not incentivized by picking the optimal resource configuration for their jobs. They’d rather have the job be guaranteed to work, while spending a bit more of their company’s compute budget.
  3. Employee Churn – Enterprises have inherent entropy and employees are oftentimes transient. People start jobs and people leave jobs. The knowledge generally leaves with them. When other engineers inherently previous work, they are significantly more incentivized to just ensure it works, rather than ensure that it works optimally. 

Problem 2: Serverless Still Requires Tuning

Lambda is predicated on a simple principle – the resource requirements for workloads that take less than 15 minutes to run, can be pretty easily approximated. Lambda makes it easy for developers to set-and-forget, offering only one knob for them to worry about.

That knob is memory. Using Lambda, you can configure the memory allocated to a lambda function as a value between 128 MB and 10,240 MB. Lambda will automatically decide how much vCPU to allocate to you based on the memory setting. 

This… sounds great. “I only have to pick one lever and, and all of a sudden, I get everything else figured out for me? That’s perfect!” If that were the end of the story, I would get to finish this post right now. 

Instead, life is all about tradeoffs – generally correlated tradeoffs. In this case, it’s cost and performance. As an engineer, it’s easy for me to pick the largest memory setting available to me just to ensure my Lambda function works, regardless of what its actual resource requirements are. Once it works, why would I ever touch it again?

Well, it turns out that picking large, frequently uncorrelated-to-necessary-resources values isn’t the most cost effective thing to do. So much so, in fact, that an AWS Solutions Engineer built and open sourced a tool to help users actually find the correct memory levels for their Lambda functions. The tool uses AWS Step Functions to walk users down to the minimum necessary level. It’s been so popular that it has 5K stars on GitHub… and 18.8K 

deployments. 

Clearly, the one-knob-rules-all solution isn’t working.  


Problem 3: Serverless Is Hard to Introspect

The scale and growth testing that plagued engineers for decades before the rise of Serverless, was unfortunately not in vain. Understanding how users will be interacting with an application, in terms of number of requests or compute load gives engineers a powerful understanding of what to expect when things go live. 

In the Serverless Function architecture, engineers don’t think about these considerations and push the burden onto the infrastructure itself. As long as the infrastructure works – it’s unlikely that an already oversubscribed engineer would spend time digging into the performance or cost characteristics of the Serverless function. 

Absent home-rolled solutions, there are few tools that allow for the detailed observability of a single serverless function. Furthermore, there are usually hundreds if not thousands of serverless functions deployed. Observability across a fleet of functions is nearly impossible.

Furthermore, the primary mechanism folks can use for per-function observability is AWS CloudWatch. Cloudwatch logs events for each lambda invocation and stores a few metrics. The major problem though, is that just collecting this information in CloudWatch has been observed to be more expensive than Lamba itself. In fact, there are full articles, posts, and best practices around just managing the costs associated with Lambda CloudWatch logs.

Problem 4: No Auto-Optimization

The year 2023 brought on a material shift in the mentality of “compute” consumers. Enterprises that were previously focused on growth at all costs shifted their focus to efficiency. Vendors in the generic Cloud, Snowflake, and Databricks ecosystem popped up at increasing rates. Most had a simple goal – provide high level visibility into workloads. 

They provided interactive charts and diagrams to show ongoing cost changes… But they didn’t provide the fundamental “healing” mechanisms. It would be like going to the doctor, having them diagnose a problem, but provide no recourse. 

Consistent with their focus on efficiency, enterprises had a few options. Larger ones deployed full teams to focus on this effort. Smaller ones that didn’t have the budget or manpower turned to observability tools… nearly all of which fell short, as they missed the fundamental optimization component.  

Providing detailed visibility across a few, large scale jobs is considered table stakes for many observability providers, but for some reason providing that same level of visibility across many, small scale jobs, in an efficient and easy to optimize way hasn’t become standard. 

Conclusion

We’re in a fairly unique period as an industry. Job visibility, tuning, introspection, and optimization have reemerged as key pieces of the modern tech stack. But most focus on the whales, when they should be focusing on the barracudas. 

If these problems resonate with you – drop us a line at info@synccomputing.com. We’d love to chat. 

Databricks vs Snowflake: A Complete 2024 Comparison 

Databricks and Snowflake are two of the largest names in cloud data solutions – and for good reason. Both platforms have been instrumental in helping companies generate value from their internal (and external) data assets. Each platform has distinct advantages and features and they’ve increasingly overlapped in their offerings—leaving many confused about which solution is best suited for their business needs.

Unfortunately, as is the case in most things in life, there isn’t a simple answer for “which one is better.” But, at Sync, we’ve seen, debugged, and optimized thousands of workloads across numerous enterprises, which has given us a unique perspective. 

To begin a meaningful comparison, we have to understand the history and core competency of each offering.  To help in that process, we at Sync broke down the major differences between Snowflake and Databricks — including price, performance, integration, security and best use cases—in order to best address the individual needs of each user.

While this topic has been debated plenty of times in the past, here at Sync we aim to provide emphasis on the total cost of ownership and ROI for companies.  With all that said, let’s begin!

Below is our complete 2024 guide to all things Databricks vs. Snowflake.

Databricks vs. Snowflake: What Are The Key Differences?

The first thing to understand about the two platforms is what they are, and what solution they are hoping to provide.

Databricks is a cloud-based, unified data analytics platform, for building, deploying, and sharing data analytics solutions at scale. Databricks aims to provide one unified interface where users can store data and execute jobs in interactive, shareable workspaces. These workspaces contain cloud-based notebooks—which are the backbone of Databricks— through which all compute functions are built to be computed by cloud based machines.

Snowflake, on the other hand, is a fully-managed, SaaS cloud-based data warehouse. Whereas Databricks was initially designed to unify data pipelines, Snowflake was designed to be the easiest to manage data warehouse solution on the cloud. While the target market for Databricks is data scientists and data engineers, the target market for Snowflake is typically data analysts—who are highly proficient in SQL queries and data analysis—but not as interested in complex computations or machine learning workflows.

Over time, Databricks and Snowflake have been increasingly in competition, as each hopes to expand their offerings to be an all-in-one cloud data platform solution. New products like Snowflake’s Snowpark (which offers Python functionality) and Databricks’s DBSQL (their serverless data warehouse) have made it increasingly difficult to differentiate the offering of each product. 

For the time being, most would agree that Snowflake tends to be the dominant name for easy-to-use cloud data warehouse solutions, and Databricks is the winner for cloud-based machine learning and data science workflows.

Databricks vs Snowflake: Data Storage

At the moment, Snowflake has the edge for querying structured data, and Databricks has the edge for raw and unstructured data needed for ML. In the future, we think Databricks data lakehouse platform could be the catch-all solution for all data management.

One of the largest differences between Snowflake and Databricks is how they store and access data. Both lead the industry in speed and scale. The largest difference between the two is the architecture of data warehouse vs data lakehouse, and the storage of unstructured vs structured data.

Snowflake

Snowflake, at its core, is a cloud data warehouse. It stores structured data in a closed, proprietary format for quick, seamless data querying and transformation. Their proprietary format allows for high speed and reliability with tradeoffs on flexibility. More recently, Snowflake is allowing the ingestion of data and storage of data in additional formats (such as Apache Iceberg), but the vast majority of its’ customer data still sits in their own format. 

Snowflake utilizes a multi-cluster shared disk architecture, in which compute resources share the same storage device, but retain their own CPU and memory. To achieve this, Snowflake  ingests, optimizes, and compresses data to a cloud object storage layer, like Amazon S3 or Google Cloud Storage. Data here is organized into a columnar format and segmented into micro-partitions, anywhere from 50 to 500MB. These micro-partitions store metadata, which helps dramatically with speed. Interestingly enough – Snowflake’s own internal storage file format is not open source – keeping most customers locked in. 

To function efficiently, Snowflake uses multiple layers to provide an enterprise-experience to the cloud processing workload. Snowflake maintains a cloud services layer that handles the enterprise authentication and access control. 

For execution, Snowflake uses virtual warehouses, which are abstractions on top of regular cloud instances (such as EC2).These warehouses query data from a separate Data storage layer, effectively separating storage and compute.  . This separation of compute and storage makes Snowflake infinitely scalable and allows users to run concurrent queries off the same data, with reasonable isolation., 

Snowflake and Databricks are cloud agnostic meaning they run all three major cloud service providers, Amazon AWS, Microsoft Azure and Google Cloud Platform (GCP).

Sync’s Take: Snowflake’s architecture allows for fast and reliable querying of structured data, at scale. It has appeal to those who want simple methods for managing their resource requirements of their jobs (through T-Shirt size warehouse options). It is primarily geared towards those proficient in SQL but lacks the flexibility to easily deal with raw, unstructured data.

Databricks

One of Databricks’ selling points is it employs an open-source storage layer known as Delta Lake— which aims to combine the flexibility of cloud data lakes, with the reliability and unified structure of a data warehouse—and without the challenges associated with vendor lock-in. Databricks has pioneered this so-called ‘data lakehouse’ hybrid structure as a cost-effective solution for data scientists, data engineers, and analysts alike to work with the same data—regardless of structure or format.

Databricks data lakehouse works by employing three layers to allow for the storage of raw and unstructured data—but also stores metadata  (such as a structured schema) for warehouse-like capabilities on structured data. Notably, this data lakehouse provides ACID transaction support, automatic schema enforcement—which validates DataFrame and table compatibility before writes—and end-to-end streaming for real-time data ingestion—some of the most desirable advancements for data lake systems.

Sync’s Take:  Lakehouses bring the speed, reliability and fast query performance of data warehouses to the flexibility of a Data Lake. The drawback is that as a relatively new technology, new and less technical users have been occasionally unable to locate tables and have to rebuild them.

Databricks vs Snowflake Scalability

Snowflake and Databricks continue to battle for dominance of enterprise workloads. While both have been proven to be industry leaders in this capacity, the largest practical difference between the two lies in their resource management capabilities. 

Snowflake

Snowflake offers compute resources as a serverless offering. Meaning users don’t have to select, install, configure, or manage any software and hardware. Instead, Snowflake uses a series of virtual warehouses—independent compute resources containing memory and CPU—to run queries.. This separation of memory and compute resources allows Snowflake to scale infinitely without slowing down, and multiple users can query concurrently against the same single segment of data.

In terms of performance, Snowflake has been shown to process up to 60 million rows in under 10 seconds.

Snowflake employs a simple “t-shirt” sizing model to their virtual warehouses, with 10 sizes with each double the computing power as the size before it. The largest is 6XL which has 512 virtual nodes. Because warehouses don’t share compute resources or store data, if one goes down it can be replaced in minutes without affecting any of the others.

A diagram of the virtual nodes associated with each size data warehouse

Most notably, Snowflake’s multi-cluster warehouses provide both a “maximized” and “auto-scale” feature which gives you the ability to dynamically shut down unused clusters, saving you money.

Databricks

Databricks started out with much more “open” and traditional infrastructure, where basically all of the compute runs inside a user’s cloud VPC.  This is the complete opposite of the “serverless” model where the compute is run inside Databrick’s VPC, since all of the cluster configurations are exposed to end users.  This has its pros and cons, the main advantage is that users can hyper optimize their clusters to improve performance, but the drawback is that it can be painful to use or require an expert to maintain.

More recently, Databricks is evolving towards the “serverless” model with Databricks SQL Serverless, and likely extending this model to the other products, such as notebooks.  The pros and cons here flip, in that the pro is users don’t have to worry about cluster configurations, however the con is that users have no access nor visibility into the underlying infrastructure and are unable to custom tune clusters to meet their needs.

Since Databricks is currently in a “transition” period between classic and “serverless” offerings, their scalability really depends on which use case people select.  

One major note is Databricks has a diverse set of compute use cases, from SQL warehouses, Jobs, All Purpose Compute, Delta Live Tables, to streaming – each one of these has slightly different compute configurations and use cases.  For example SQL warehouses can be used as a shared resource, where multiple queries can be submitted to the warehouse at any time from multiple users.  Jobs are more singular, in which one notebook is run on one cluster, and is shut down (Jobs can also be shared now, but this is used less).  

The different use cases need to fit the end user’s needs, which can also impact scalability.  This one example symbolizes both the strength and weakness of Databricks, there are so many options at so many levels it can be great if you know what you’re doing, or it can be a nuisance.

Sync’s Take: When it comes to scaling to large workflows, both Snowflake and Databricks can handle the workload. However, Databricks is better able to boost and fine tune the performance of large volumes of data which ultimately saves costs.

Databricks vs Snowflake: Cost

Both Databricks and Snowflake are marketed as pay-as-you-go models. Meaning the more compute you reserve/request, the more you pay. Their models starkly differ from more traditional “usage based pricing schemes” where customers pay only for the usage they actually consume. In both Databricks and Snowflake, users can and will pay for requested resources whether or not those resources are actually necessary or optimal to run the job. 

Another big difference between the two services is that Snowflake runs and charges for the entire compute stack (virtual warehouses and cloud instances), whereas Databricks only runs and charges for the management of compute, requiring users still have to pay a separate cloud provider bill. It is worth noting that Databricks’ new serverless product mimics the Snowflake operating model. Databricks works off a compute/time units called Databricks Units (or DBUs) per second and Snowflake uses a Snowflake credit system.

As a formula, it breaks down like this:

  • Databricks (Classic compute) = Data storage + Cost of Databricks Service (DBUs) + Cost of Cloud Compute (Virtual machine instances) 
  • Snowflake = Data storage (Daily average volume of bytes stored on Snowflake) + Compute (number of virtual warehouses used) 

Both Databricks and Snowflake offer tiers and discounts of pricing based on company size, and both allow you to save money by pre-purchasing units or credits.

Databricks has more variance in price as it has different prices depending on the type of workload, with certain types of computes costing 5x more per compute hour than the simple jobs.

One major advantage Databricks has in terms of costs, is it allows users to utilize Spot instances in their cloud provider – which can translate to significant cost savings.  Snowflake obfuscates all of this, and the end user has no option to benefit from utilizing Spot instances.

Sync’s Take: There is no concrete answer to which service is “cheaper” as it really depends on how much of the service or platform you’re using, and for what types of tasks. However, the control and introspection capabilities that Databricks provides is fairly unmatched in the Snowflake ecosystem. This gives Databricks a significant edge when optimizing for large compute workloads.

If you’d like a further guide on the breakdown of Databricks pricing, we recommend checking out our complete pricing guide.

Databricks vs Snowflake Speed Benchmarks

Databricks claims they are 2.5x faster than Snowflake. Snowflake also claims they are faster than databricks. While this is a contentious issue between the two giants the reality is benchmarks merely only serve as a vanity metric.  In reality, your workloads will likely look nothing like the TPC-DS benchmarks that either company ran, and hence their benchmarks would not apply to your jobs.  Our opinion here is that benchmarks don’t matter at this level.

While this may be an unsatisfying answer, if you’re looking for a solution that is all about the absolute fastest way to run your code – there are likely other solutions that are less well known but do focus on performance.  

Most companies we speak to value both platforms due to their ease of use, having all of their data in one place, ability to share code, and not having to worry about low level infrastructure.  Pure raw speed is rarely a priority for companies. If this sounds like your company, likely the speed metrics don’t really matter so much.  

However, cost likely does matter in aggregate, and hence doing an actual comparison of runtime and cost on the different platforms with your actual workloads is the only real way to know.  

Databricks vs Snowflake: Ease of Use

All things equal, Snowflake is largely considered the “easier” cloud solution to learn between the two. It has an intuitive SQL interface and as a serverless experience, doesn’t require users to manage any virtual or local hardware resources. Plus as a managed service, using Snowflake doesn’t require any installing, maintaining, updating or fine-tuning of the platform. It’s all handled by Snowflake.

From a language perspective, Snowflake is all SQL-based (excluding their new foray into Snowpark) making it accessible for many business analysts. While Databricks SQL has  data warehouse functionality in line with Snowflake, the large use case of Databricks is being able to write in Python, R and Scala and reviews on Gartner and Trust Radius have consistently rated it a more technical setup than Snowflake

Snowflake also has automated features like auto-scaling and auto-suspend to help start and stop clusters without fine-tuning. While Databricks also has autoscaling and autosuspend, it is designed for a more technical user and there is more involved with fine-tuning your clusters (watch more about how we help do this here).

Sync’s Take: While Databricks UI has a steeper learning curve than Snowflake, it ultimately has more advanced control and customization than Snowflake, making this a tradeoff that is largely dependent on how complex you intend your operations to be.  

Databricks vs Snowflake: Security

Both Databricks and Snowflake are GDPR-compliant, offer role-based access control (RBAC), and both organizations encrypt their data both at rest and in motion. Both have very good records with data security and offer a variety of role-based access controls and support for compliance standards.

Databricks offers additional isolation at multiple levels including workspace-level permissions, cluster ACLs, JVM whitelisting, and single-use clusters. For organizations that employ ADS or AMS teams, Databricks provides workload security that includes code repository management, built-in secret management, hardening with security monitoring and vulnerability reports, and the ability to enforce security and validation requirements.

Snowflake allows users to set regions for data storage to comply with regulatory guidelines such as HIPAA and PCI DSS. Snowflake security levels can also be adjusted based on requirements and has built-in features to regulate access levels, and control things like IP allows and blocklists. Snowflake also allows advanced features of Time Travel and Fail-safe which allow you to restore tables, schemas, and databases from a specific time point in the past or protect and recover historical data.

Historically the only issue for Snowflake was the inability to on-premise storage on a private-cloud infrastructure, which is needed for the highest level of security like government data. In 2022, Snowflake started adding in on-premise storage, however as of yet there is limited information on this has been received.

Sync’s Take: Both Databricks and Snowflake have an excellent reputation with data security, as it is mission-critical to their businesses. There is really no wrong choice here and it largely comes down to making sure individual access levels match your intent.

Databricks vs. Snowflake: Ecosystem and Integration

Databricks and Snowflake are becoming the abstractions on top of Cloud Vendors for data computation workloads. As such, they both plug into a variety of vendors, tools, and products. 

From the vendor space, both Databricks and Snowflake provide marketplaces that allow other predominant tools and technology to be co-deployed. There are also community built and contributed features, such as the Databricks Airflow Operators / Snowflake Airflow Operators.

On the whole though, the Databrick’s ecosystem is typically more “open” than Snowflake, since Databricks still runs in a user’s cloud VPC.  This means, users can still install custom libraries, or even introspect low-level cluster data.  Such access is not possible in Snowflake, and hence integrating with your favorite tools may be harder. Databricks also tends to be generally more developer / integration friendly than Snowflake for this exact reason.

Other FAQ on Databricks vs Snowflake?

  • Is Databricks a data warehouse? Databricks bills itself as the world’s first “Data Lakehouse”, combining the best of data lakes and data warehouses. However, despite having the capability, Databricks is not typically thought of as a data warehouse solution, as its learning curve and fine-tuning are often unnecessarily for someone seeking a just straightforward data warehouse.
  • Can Snowflake and Databricks integrate with each other? It is possible and not entirely uncommon to integrate Databricks and Snowflake with each other. Typically in this manner, Databricks acts as a Data Lake for all unstructured data, manipulating it and processing it as part of an ETL pipeline where it is then stored on Snowflake like a traditional data warehouse.
  • What data types does Snowflake accept? Snowflake is optimized for structured and semi-structured data, meaning it can only accept certain data formats, notably JSON, Avro, Parquet and XML.
  • Can Snowflake and Databricks create dashboards for business intelligence? Yes, both Snowflake and Databricks are able to create dashboards and visualizations for business intelligence.

Databricks vs Snowflake: Which Is Better?

Both Databricks and Snowflake have a stellar reputation within the business and data community. While both cloud-based platforms, Snowflake is most optimized for data warehousing, data manipulation and querying, while Databricks is optimized for machine learning and heavy data science. 

Broken down into components, here are a list of pros for each:

Platform/FeatureDatabricksSnowflake
StorageBetter for raw, unstructured data. Better for reliability and ease of use for structured data
Use CaseBetter for ML, AI, Data Science and Data Engineering. Collaborative notebooks in Python/Scala/R a big plus Easier for analysts in business intelligence and companies looking to migrate existing data warehouse system
PriceCheaper at high compute volumes. Not as predictable on cost. Efficient at scaling down unused resources. More consistent, predictable costs. 
ScalabilityInfinitely scalable. Effective at high volume workloads.Separate storage and compute makes for seamless concurrent queries. 
SecurityGDPR-compliant, role-based access control, encrypted at rest and and in motionGDPR-compliant, role-based access control, encrypted at rest and in motion

If you want to integrate structured data an existing ETL pipeline using structured data and programs like Tableau, Looker and Power BI, Snowflake could be the right option for you. If you instead are looking for a unified analytics workspace where you build compute pipelines, Databricks might be the right choice for you. 

Interested in using Databricks further? Check out Sync’s Gradient solution – the only ML-powered Databricks cluster optimization and management tool.  At a high level, we help maintain the openness of Databricks but now with the “ease” of Snowflake.  On top of that, we also actively drive your costs lower and lower.

What is the Databricks Job API?

The Databricks Jobs API allows users to programmatically create, run, and delete Databricks Jobs via their REST API solution.  This is an alternative to running Databricks jobs through their console UI system.  For access to other Databricks platforms such as SQL warehouses, delta live tables, unity catalog, or others, users will have to implement other API solutions provided by Databricks.

The official Databricks Jobs API reference can be found here.  

However, for newcomers to the Jobs API, I recommend starting with the Databricks Jobs documentation which has great examples and more detailed explanations.  

Why should I use the Jobs API?

Users may want to use an API, vs. the UI, when they need to dynamically create jobs due to other events, or to integrate with other non-Databricks workflows, for example Airflow or Dagster.   Users can implement job tasks using notebooks, Delta Live Tables pipelines, JARS, or Python, Scala, Spark submit, and Java applications.

Another reason to use the Jobs API is to retrieve and aggregate metrics about your jobs to monitor usage, performance, and costs.  The information in the Jobs API is far more granular than those present in the currently available System Tables. 

So if your organization is looking to monitor thousands of jobs at scale and build dashboards, you will have to use the Jobs API to collect all of the information.  

What can I do with the Jobs API?

A full list of the Jobs API PUT and GET requests can be found in the table below, based on the official API documentation.  

ActionRequestDescription
Get job permissions/api/2.0/permissions/jobs/{job_id}Gets the permissions of a job such as ‘user name’, ‘group name’, ‘service principal’, ‘permission level’ 
Set job permissions/api/2.0/permissions/jobs/{job_id}Sets permissions on a job.
Update job permissions /api/2.0/permissions/jobs/{job_id}Updates the permissions on a job. 
Get job permission levels 
/api/2.0/permissions/jobs/{job_id}/permissionLevels
Gets the permission levels that a user can have on an object
Create a new job /api/2.1/jobs/createCreate a new Databricks Job
List jobs /api/2.1/jobs/listRetrieves a list of jobs and their parameters such as ‘job id’, ‘creater’, ‘settings’, ‘tasks’
Get a single job /api/2.1/jobs/getGets job details for a single job
Update all job settings (reset) /api/2.1/jobs/resetOverwrite all settings for the given job.
Update job settings partially /api/2.1/jobs/updateAdd, update, or remove specific settings of an existing job
Delete a job /api/2.1/jobs/deleteDeletes a job
Trigger a new job run /api/2.1/jobs/run-nowRuns a job with an existing job-id
Create and trigger a one-time run /api/2.1/jobs/runs/submitSubmit a one-time run. This endpoint allows you to submit a workload directly without creating a job. Runs submitted using this endpoint don’t display in the UI. 
List job runs /api/2.1/jobs/runs/listList runs in descending order by start time.  A run is a job that has already historically been run.
Get a single job run /api/2.1/jobs/runs/getRetrieve the metadata of a single run.
Export and retrieve a job run /api/2.1/jobs/runs/exportExport and retrieve the job run task.
Cancel a run /api/2.1/jobs/runs/cancelCancels a job run
Cancel all runs of a job /api/2.1/jobs/runs/cancel-allCancels all job runs
Get the output for a single run /api/2.1/jobs/runs/get-outputRetrieve the output and metadata of a single task run. 
Delete a job run /api/2.1/jobs/runs/deleteDeletes a job run
Repair a job run /api/2.1/jobs/runs/repairRepairs a job run by re-running it

Can I get cost information through the Jobs API?

Unfortunately, users cannot obtain jobs cost directly through the Jobs API.  You’ll need to use the accounts API to access billing information, or use System tables.  One big note, is the billing information retrieved through either the accounts API or the system tables is only the Databricks DBU costs.

The majority of your Databricks costs could come from your actual cloud usage (e.g. on AWS it’s the EC2 costs).  To obtain these costs you’ll need to separately retrieve cost information from your cloud provider.

If this sounds painful – you’re right, it’s crazy annoying.  Fortunately, Gradient does all of this for you and can retrieve both the DBU and cloud costs for you in a simple diagram to monitor your costs.  

How does someone intelligently control their Jobs clusters with the API?

The Jobs API is an input/output system only.  What you do with the information and abilities to control and manage Jobs is entirely up to you and your needs.  

For users running Databricks Jobs at scale, one dream ability is to optimize and intelligently control jobs clusters to minimize costs and hit SLA goals.  Building such a system is not trivial and requires an entire team to develop a custom algorithm as well as infrastructure.

Here at Sync, we built Gradient to solve exactly this need.  Gradient is an all-in-one Databricks Jobs intelligence system that works with the Jobs API to help automatically control your jobs clusters.  Check out the documentation here to get started.

Updating From Jobs API 2.0 to 2.1

The largest update from API 2.0 to 2.1 is the inclusion of multiple tasks in a job, as described in the official documentation.  To explain a bit more, Databricks jobs can contain multiple tasks in a single job, where each task can be a different notebook, for example.  All API 2.1 requests must conform to the multi-task format and responses are structured in the multi-task format.

Databricks jobs api example 

Here is an example, borrowed from the official documentation, of how to create a job:

To create a job with the Databricks REST API, run the curl command below, which creates a cluster based on the parameters located in the create-job.json

curl --netrc --request POST \

https://<databricks-instance>/api/2.0/jobs/create \

--data @create-job.json \

| jq .

An example of what goes into the create-job.json is found below

{

  "name": "Nightly model training",

  "new_cluster": {

    "spark_version": "7.3.x-scala2.12",

    "node_type_id": "r3.xlarge",

    "aws_attributes": {

      "availability": "ON_DEMAND"

    },

    "num_workers": 10

  },

  "libraries": [

    {

      "jar": "dbfs:/my-jar.jar"

    },

    {

      "maven": {

        "coordinates": "org.jsoup:jsoup:1.7.2"

      }

    }

  ],

  "email_notifications": {

    "on_start": [],

    "on_success": [],

    "on_failure": []

  },

  "webhook_notifications": {

    "on_start": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_success": [

      {

        "id": "bf2fbd0a-4a05-4300-98a5-303fc8132233"

      }

    ],

    "on_failure": []

  },

  "notification_settings": {

    "no_alert_for_skipped_runs": false,

    "no_alert_for_canceled_runs": false,

    "alert_on_last_attempt": false

  },

  "timeout_seconds": 3600,

  "max_retries": 1,

  "schedule": {

    "quartz_cron_expression": "0 15 22 * * ?",

    "timezone_id": "America/Los_Angeles"

  },

  "spark_jar_task": {

    "main_class_name": "com.databricks.ComputeModels"

  }

}

Azure databricks jobs api 

The REST APIs are identical across all 3 cloud providers (AWS, GCP, Azure).  Users can toggle between the different cloud versions in the reference page on the top left corner

Conclusion

The Databricks Jobs API is a powerful system which enables to programmatically control and monitor their jobs.  Likely this is useful for “power users” who want to control many jobs or for users who need to use an external orchestrator, like Airflow, to orchestrate their jobs.

To add automatic intelligence to your Databricks Jobs API solutions to help lower costs and hit SLAs, check out Gradient as a potential fit.

Databricks Pricing Page

Databricks Pricing Calculator

Pricing For Azure

How To Optimize Databricks Clusters

Databricks Instructor-Led Courses

Databricks Guided Access Support Subscription

Migrate Your Data Warehouse to Databricks

Databricks Support Policy 

Introducing the Sync Databricks Workspace Health Check

blue bricks representing databricks workspace

Introducing the Sync Databricks Workspace health check, a program that we’ve spearheaded to help Databricks users identify common mistakes in their Workspaces.

Here at Sync, we’ve worked with a ton of companies and looked at their overall Databricks workspace usage. We’ve seen all sorts of usage from jobs, all-purpose compute, SQL warehouses, to Delta Live tables and have seen many recurring patterns.

While many companies do operate Databricks well, there are some patterns we’ve observed that have led to wasted compute resources and inflated costs. As a result, we built a tool to help quickly identify these common pitfalls and give users a quick rundown of the health of their overall usage.

With our personalized health check, you’re able gain insight into:

  • Your top 10 jobs most qualified for Gradient
  • Candidates for EBS, Photon, and autoscaling
  • Compute cluster utilization scoring
  • SQL Warehouse utilization efficiency
  • 12-month projected usage growth
  • Estimated overall cost savings
  • Incorrectly run jobs on all-purpose compute clusters

While we foresee the health check to continue to evolve and grow, let’s dive into some of the popular metrics used today:

Nail down APC vs. jobs compute usage to significantly reduce costs by identifying and leveraging the most cost effective job compute option, immediately allowing for savings up to 50%. Companies small and large often incorrectly use all-purpose compute clusters for their production jobs, when they should be using jobs clusters. While this is a subtle detail, it can instantly lead to a 2x cost reduction with just a few clicks.

Visualize APC and warehouse utilization to identify underused clusters and warehouses within your workspace. Both APC and SQL warehouses can fall into the same pitfall of being “always on” even though nobody is using them. With our health check, you’re able to quickly see where that is happening and how to prevent it.

Efficiently select instances across an organization to determine if your users are opting out of default settings in an effort to optimize. Platform teams thrive when they’re able to see the distribution of instances that are being used. This helps identify what kind of clusters are popular and effective. If the Databricks default cluster is used often (e.g. in AWS it’s “i3”), it’s likely that team members are opting for default settings and aren’t spending much time trying to find better instances for optimal performance.

Gain a better understanding of EBS, Photon, and autoscaling optimization insights by identifying how many clusters use these features to assess potential savings that could add major benefit to your jobs. Photon and Autoscaling are options Databricks often recommends for job clusters. However, these features are only beneficial some of the time, ultimately depending on the characteristic details of your job.

Rank your top Jobs candidates for Gradient based on schedule, duration, and consistency. One of the largest sources of cost are jobs clusters used in production. Sync’s core product offering helps to automatically optimize these clusters for cost and performance. When you’re working with hundreds, or even thousands, of jobs in your workload, it can be daunting to identify which jobs should take priority. To help with this, your Workspace health check includes a proprietary ranking system that identifies jobs to see if Gradient’s cluster optimizations are a good fit.

Our health check notebook is an easy-to-use solution that you can run on your own at zero cost to you. 

Want to get a head start and learn more about integrating Gradient into your stack? Head here to request your personalized Databricks health check.

Everything You Need To Know About Azure Databricks Pricing 2024

When trying to determine Databricks pricing, one of the most important aspects to consider is the cost of your cloud provider. This means one of three companies: Microsoft Azure, Amazon AWS, or Google Cloud. 

All three cloud service providers are extremely popular, and there’s no finite answer to which is best. Most companies choose one based on their existing software stack or suite of products. This is a perfectly fine way to make your decision on cloud service provider, and all three work very well with Databricks. 

However there are some small differences in price, users experience, and integration between each of the three. That’s why we decided to put together a quick guide explaining the exact costs of Databricks Azure. 

If you’re looking for a full guide on everything related to Databricks pricing, including how to calculate your compute cost, check out our full guide here. For everyone else, continue on here. 

First off, what is a “Cloud Service Provider”?

According to Google, a Cloud Service Provider is “a third-party company that provides scalable computing resources that businesses can access on demand over network”. 

In practical terms, this means storage, computing power or database access that enterprise companies use over the internet. Amazon Web Services, Google Cloud Platform, Oracle, Alibaba Cloud, IBM Cloud and Microsoft Azure are among the most popular web service providers. 

In the context of Databricks specifically, your cloud service provider is the processing layer on which the Databricks analytics platforms runs. Your virtual machines (VMs), data storage, securities and compute costs are all tied to your cloud provider. For this reason, there are slightly different costs based on instance type, compute type virtual machine and adds on like security and storage. 

Cost of Databricks Azure vs AWS vs Google Cloud

Among the three cloud services provider, AWS generally seems to be the cheapest, while Azure is the most expensive. The difference in cost depends on the type of compute but an apples-to-apples comparison of Jobs Compute for the Standard tier is $0.07 per dbu hour for AWS, $0.10 for Google Cloud and $0.15 for Azure Databricks. However this difference is negated for All Purpose Computes, with all three providers coming in at $0.40 per dbu/hour. 

Part of the reason for the higher cost of Azure is that with Azure Databricks is considered a Microsoft first party service—meaning it’s natively integrated with Microsoft and optimized for a host of their products including Power BI, Azure Synapse Analytics and Azure Data Lake Storage. This is a very unusual move and puts Databricks in rarified air for third party companies with which Microsoft has made a partnership.

As the Microsoft suite of products is so popular among analysts, Azure often has the benefit of a network effects as companies will often already being using its cloud service when they start using Databricks. 

This has been a largely successful partnership and users have reported positively on the ease of access and sharing with Azure Databricks (the portal can be setup and accessed with a single click), and the included mission critical tech support — which can itself turn into thousands of dollars spent annually when purchased through Databricks. 

How Instances and VMs Affect Azure Databricks Pricing 

When it comes to calculating your total cost for Databricks with Azure, your instance number and type will play a large role in total cost. 

Instances refer to virtual machines (VMs), which, as the name suggests, are processing hardware you are allocated by your cloud service provider (in this case Azure). The important factors to consider with Instance is the type and size of your virtual machine. 

For Azure there are several different types of virtual machines that will be referred to under Instance type: 

  • General Purpose 
  • Compute Optimized 
  • Memory Optimized 
  • Accelerated Computing 
  • Storage Optimized 

These instance types are straightforward, with the name explaining what they are best used for. You will notice families of Instance denoted by the first letter in their naming convention which classify them as running large workloads (R series) or sustained high performance (G series). Instances will often list their generation, denoted as a “v” with the generation or version number next to it (v1, v2, v3, etc). Newer generation instances will generally cost most. 

Apart from instance type, the instance size determines how much you pay for processing power. The two factors here are number of CPU cores, and total RAM. CPU cores are often listed after the first letter of an instance (for example E16d has 16 cores). The larger the instance, the more it will cost with instances of the same type and generation costing $0.825 per DBU hour as with a 4-core instance, and up to $18.15 per DBU hour with a 96-core instance. 

To get your total cost of Databricks, add your DBU compute price to your monthly instance price. 

Azure Databricks Compute Pricing 

Here is a quick breakdown of compute type for a standard plan in the U.S. Central Zone:

  • Jobs Light Compute: $0.07/Dbu-hour
  • Jobs Compute: $0.15/Dbu-hour
  • All-Purpose Compute: $0.40/Dbu-hour

Here is a breakdown of services only available in premium pricing plan in the U.S. Central Zone:

  • SQL Compute: $0.22/dbu-hour
  • SQL Pro Compute: $0.44/dbu-hour
  • Serverless SQL: $0.44/dbu-hour 
  • Serverless Real-Time Inference: $0.079/dbu-hour 

For a full breakdown check out the dedicated Microsoft Azure page on Databricks pricing

Saving with Pre-purchase Plans.

One of the big ways you can save on Azure Databricks pricing is through the use of pre-purchase plans—also known as Databricks Commit Units. In essence, you are predicting a certain amount of Databrick’s usage and paying for that amount up front. The incentive for doing this are large savings—up to 37%. 

Pre purchase plans come in 1 year plans or 3 years plans. The more you buy—both in terms of Dbus and time duration of your deal—the more you will save. Here’s a breakdown of how much you can save for each level of Databricks pricing. 

Databricks commit unit (DBCU)Price (with discount)DiscountYear of Contract
25,000$23,5006%1 year
50,000$46,0008%1 year
100,000$89,00011%1 year
200,000$1,72,00014%1 year
350,000$2,87,00018%1 year
500,000$4,00,00020%1 year
750,000$5,77,50023%1 year
1,000,000$7,30,00027%1 year
1,500,000$10,50,00030%1 year
2,000,000$13,40,00033%1 year
75,000$69,0008%3 year
150,000$135,00010%3 year
300,000$261,00013%3 year
600,000$504,00016%3 year
1,050,000$819,00022%3 year
1,500,000$1,140,00024%3 year
2,250,000$1,642,50027%3 year
3,000,000$2,070,00031%3 year
4,500,000$2,970,00034%3 year
6,000,000$3,780,00037%3 year

Enhanced Security & Compliance Add-on

For premium tier Azure customers processing regulated data, Azure Databricks offers enhanced security and controls for their compliance needs. This is offered at 10% of list price added to the Azure Databricks product spend in a selected workspace. Read more about the security and complain add on here

Conclusion: Is Azure Databricks Worth It? 

If you’re a company that really values the Microsoft suite and enjoys working with Azure, then Azure Databricks is 100% worth it. It’s highly integrated, and has a tremendous support team and UI experience. 

If you are not yet invested with Azure or the Microsoft Suite of products, it may not be worth the additional cost premium to run Databricks on Azure. If you’re looking for the cheapest cloud service provider on which to run Databricks, your best bet is likely Amazon AWS. 

Again, all three major cloud service providers are popular, and it’s really hard to go wrong with one. Make sure to evaluate cost and integration with existing software when evaluating the choice that’s best for you. 

Gradient New Product Update Q4 2023

Today we are excited to announce our next major product update for Gradient to help companies optimize their Databricks Jobs clusters.  This update isn’t just a simple UI upgrade…

We upgraded everything from the inside out! 

Without burying the lead, here’s a screenshot of the new project page for Gradient above.

Back in the last week of June of this year (2023), we debuted our first release of Gradient.  In the past few months we gathered all of the user feedback on how we can make the experience even better.

So what were the high level major feature requests that we learned in the past few months?

  • Visualizations – Visual graphs which show the cost and runtime impact of our recommendations to see the impact and ROI of Gradient
  • Easier integration – Easier “one-click” installation and setup experience with Databricks
  • More gains – Larger cost savings gains custom tailored to the unique nature of each job
  • Azure support – A large percentage of Databricks users are on Azure, and obviously they wanted us to support them

Those features requests weren’t small and required pretty substantial changes from the backend to the front, but at the end of the day we couldn’t agree more with the feedback.  While a sane company would prioritize and tackle these one by one, we knew each one of these were actually interrelated behind the scenes, and it wasn’t just a simple matter of checking off a list of features.

Here’s our high level demo video to see the new features in action!

So we took the challenge head on and said “let’s do all of it!”   With all of that in mind, let’s walk through each awesome new features!

Feature #1:  See Gradient’s ROI with cost and runtime Visualizations

With new timeline graphs users can see in real-time the performance of their jobs and what impact Gradient is having. As a general monitoring tool, users can now see the impact of various cloud anomalies on their cost and runtime. A summary of benefits is below:

  • Monitor your jobs total costs across both DBUs and cloud fees in real-time to stay informed
  • Ensure your job runtimes and SLAs are met
  • Learn what anomalies are impacting your jobs’ performances
  • Visualize Gradient’s value in by watching your cost and runtime goals being met

Feature #2:  Cluster integrations with AWS and Azure

Gradient now interfaces with both AWS and Azure cloud infrastructure to obtain low level metrics. We know many Databricks enterprises utilize Azure and this was a highly requested feature. A summary of benefits is below:

  • Granular compute metrics are obtained by retrieving cluster logs beyond what Databricks exposes in their system tables
  • Integrate with Databricks Workflows or Airflow to plug Gradient into how your company runs your infrastructure
  • Easy metrics gathering as Gradient does the heavy lifting for you and automatically compiles and links information across both Databricks and cloud environments

Feature #3:  A new machine learning algorithm that custom learns each job

A huge upgrade from our previous solution is a new machine learning algorithm that learns the behavior of each job individually before optimizing. One lesson we learned is each job is unique, from python, to SQL, to ML, to AI, the variety of codebases out there is massive. A blanket “heuristic” solution was not scalable, and it was clear we needed something far more intelligent. A summary of the benefits is below:

  • Historical log information is used to train custom models for each of your jobs.  Since no two jobs are alike, custom models are critical to optimizing at scale.
  • Accuracy is ensured by training on real job performance data
  • Stability is obtained with small incremental changes and monitoring to ensure reliable performance

Feature #4: Auto-import and setup all of your jobs with a single click

Integrating with the Databricks environment is not easy, as most practitioners can attest to. We invested a lot of development into “how do we make it easy to on-board jobs?” After a bunch of work and talking to early users – we’ve built the easiest system we could find – just push a button.

Behind the scenes, we’re interacting with the Databricks API, tokens, secrets, init scripts, webhooks, logging files, cloud compute metrics, storage – just to name a few. A summary of the benefits is below:

  • Gradient connects to your Databricks workspace behind the scenes to make importing and setting up job clusters as easy as a single click
  • Non-invasive webhook integration is used to link your environment with Gradient without any modifications to your code or workflows

Feature #5:  View and approve recommendations with a click

With all of the integration setup done in the previous feature, applying recommendation is now a piece of cake. Just click a button and your Databricks jobs will be automatically updated. No need to go into the DB console or change anything in another system. We take care of all of that for you! A summary of the benefits is below:

  • View recommendations in the Gradient UI for your approval before any changes are actually made
  • Click to approve and apply a single recommendation so you are always in control

Feature #6:  Change your SLA goals at any time

We always believed that business should drive infrastructure, not the other way around. Now you can change your SLA goals at anytime and Gradient will change your cluster settings to meet your goals. With the new visualizations, you can see everything changing in real time as well. A summary of the benefits is below:

  • Runtime SLA goals ultimately dictate the cost and performance of your jobs.  Longer SLAs can usually lead to lower costs, while shorter SLAs could lead to higher costs.
  • Goals change constantly for your business, Gradient allows your infrastructure to keep up at scale
  • Business lead infrastructure allows you to start with your business goals and work backwards to your infrastructure, not the other way around

Feature #7:  Enable auto-apply for self-improving jobs

One big request was for users at scale, who have hundreds or thousands of jobs. There’s no way someone would want to click an “apply” button 1000x a day! So, for our ultimate experience, we can automatically apply our recommendations and all you have to do is sit back and watch the savings. A summary of the benefits is below:

  • Focus on business goals by allowing Gradient to constantly improve your job clusters to meet your ever changing business needs
  • Optimize at scale with auto apply, no need to manually analyze individual jobs – just watch Gradient get to work across all of your jobs
  • Free your engineers from manually tweaking cluster configurations and allowing them to focus on more important work

Try it yourself!

We’d love to get your feedback on what we’re building.  We hope these features resonate with you and your use case.  If you have other use cases in mind, please let us know! 

To get started – see our docs for the installation process!

Connect with us now via booking a demo, chatting with us, or emailing us at support@synccomputing.com.