databricks

March 2024 Release Notes

release notes

Our team has been hard at work to deliver industry-leading features to support users in achieving optimal performance within the Databricks ecosystem. Take a look at our most recent releases below.

Worker Instance Recommendations

Introducing Worker Instance Recommendations directly from the Sync UI. With this feature, you are able to tap into optimal cluster configuration recos so that your configs are optimized for individual jobs.

The instance recos within Gradient not only optimize the number of workers, but also the worker size. For example, if you are using i3.2xl instances, Gradient will find the right instance size (such as i3.xl, i3.4xl, i3.8xl, etc) in the i3 instance type.


Instance Fleet Support

If your company is using Instance Fleet Clusters, Gradient is now compatible!  There are no changes required on the user flow, as this feature is automatically supported in the backend.  Just onboard your jobs like normal into Gradient, and we’ll handle the rest.

Hosted Log Collection


Running Gradient is now more streamlined than ever! You’re now able to opt into hosted log collection entirely in the Sync environment with Sync-hosted or user-hosted collection options. What does this mean? It means that there are no extra steps or external clusters needed to run Gradient, allowing Sync to do all the heavy lifting while minimizing the impact on your Databricks workspace. 

With hosted DBX log collection within Gradient, you’re able to minimize onboarding errors due to annoying permission settings while increasing visibility into any potential collection failures, ultimately giving you and your team more control over your cluster log data.


Getting Started with Collection Setup
The Databricks Workspace integration flow is triggered when a user clicks on Add → Databricks Workspace after they have configured their workspace and webhook. Users will also now have a toggle option to choose between Sync-hosted (recommended) or User-hosted collection.

  • Sync-hosted collection – The user will be optionally prompted to share their preference for cluster logs stored for their Databricks Jobs. This will initially be an immutable setting saved on the Workspace.
    • For AWS – Users will need to add a generated IAM policy and IAM Role to their AWS account. The IAM policy allows us to ec2:DescribeInstances, ec2:DescribeVolumes, and optionally an s3:GetObject and s3:ListBucket to the specific bucket and prefix to which users have configured uploading cluster logs. S3 permissions are optional because they may be using DBFS to record cluster logs. The user needs to add a “Trusted Relationship” to the IAM role to give our Sync IAM role permissions to sts:AssumeRole using an ExternalId we provide them. Gradient will then generate this policy and trust relationship for the user in a JSON format to be copy and pasted.
    • For Azure – Coming soon!
  • User-hosted collection – For both Azure/AWS will proceed as the normal workspace integration requirements dictate

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Request a demo or reach out to us at info@synccomputing.com.

February 2024 Release Notes

release notes

We’re excited to share all the new and improved features that our team has recently released to help our customers gain full governance over their Databricks infrastructure.

Databricks Workspace Integration
Introducing the Databricks Workspace Integration for Gradient. With this new feature, you’re able to further simplify the process of connecting your Databricks Workspace to the Sync platform. This capability eases the tedious process of consolidating with the Gradient UI without the use of the Sync CLI.

To get started, head to the integrations tab in your Sync dashboard. Here you’ll see a list that includes Databricks Workspace. Navigate to the Add dropdown menu and click on the Databricks Workspace dropdown option to trigger the integration flow.


Log in to Gradient to get started.

Project Reset Data
As users integrate their projects into Sync, they are often faced with sudden config changes. Project Reset is a capability built directly into the Sync platform in which users will be able to perform a hard  “reset” on the data for a project, ultimately triggering the build of a new custom model for the related job.

Now available via the Sync API, coming soon to the Sync UI


With this new capability, you’re able to reset the following directly from the Sync UI:

  • Historical logs
  • Resets the selected project back to “learning” mode
  • Clears project graphs
  • Clears the project’s history table
{
  "result": [
    {
      "created_at": "2024-02-21T02:35:46.806Z",
      "updated_at": "2024-02-21T02:35:46.806Z",
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "name": "string",
      "app_id": "string",
      "cluster_path": "string",
      "job_id": "string",
      "workspace_id": "string",
      "workflow_id": "string",
      "creator_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "product_code": "aws-emr",
      "description": "string",
      "status": "Pending Setup",
      "cluster_log_url": "string",
      "prediction_preference": "performance",
      "auto_apply_recs": true,
      "prediction_params": {
        "sla_minutes": 0,
        "force_ondemand_workers": true,
        "fix_worker_family": true,
        "fix_driver_type": true,
        "fix_scaling_type": true
      },
      "tuned_cost": 0,
      "tuned_runtime": 0,
      "project_model_id": "UNASSIGNED",
      "metrics": {
        "job_success_rate_percent": 0,
        "sla_met_percent": 0
      },
      "latest_prediction_id": "string",
      "latest_prediction_created_at": "string",
      "creator": {
        "created_at": "2024-02-21T02:35:46.806Z",
        "updated_at": "2024-02-21T02:35:46.806Z",
        "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "sync_tenant_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
        "email": "string",
        "name": "string",
        "last_login": "string"
      },
      "phase": "LEARNING",
      "optimize_instance_size": true,
      "project_periodicity_type": "DAILY_SINE",
      "product_name": "string"
    }
  ]
}


User Management
With User Management, you’re able to take a hands-on approach to managing your users in Gradient. With this feature, account owners can:

  • Add a user
  • Deactivate a user
  • Assign a specific role to a user

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation.

Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.

Rethinking Serverless: The Price of Convenience

As is the case with many concepts in technology, the term Serverless is abusively vague. As such, discussing the idea of “serverless” usually invokes one of two feelings in developers. Either, it’s thought of as the catalyst for this potential incredible future, finally freeing developers from having to worry about resources or scaling concerns, or it’s thought of as the harbinger of yet another “we don’t need DevOps anymore” trend. 

The root cause of this confusion has to do with the fact that the catch-all term “Serverless” actually compromises two large operating models: functions and jobs. At Sync – we’re intimately familiar with optimizing jobs, so when our customers gave us feedback to focus a portion of our attention on serverless functions, we were more than intrigued.

The hypothesis was simple. Could we extend our expertise and background in optimizing Databricks large scale/batch compute workloads to optimizing many smaller batch compute workloads. 

Serverless Functions

First, let’s see how we got here.

One of the most painful parts of the developer workflow is “real world deployment.” In the real world, deploying code that was written locally to the right environment and to work in the same way was extraordinarily painful. Libraries issues, scaling issues, infrastructure management issues, provisioning issues, resource selection issues, and a number of other issues plagued developers. The cloud just didn’t mimic the ease and simplicity of local developer environments. 

Then Serverless functions emerged. All of a sudden, developers could write and deploy code in a function with the same level of simplicity as writing it locally. Then never had to worry about spinning up an EC2 instance or figuring out what the material differences between AMI and Ubuntu are. They didn’t have to play with docker files or even have to do scale testing. They wrote the exact same Python or NodeJS code that they wrote locally in a Cloud IDE and it just worked. It seemed perfect.

Soon, mission critical pieces of infrastructure were supported by double digit line python functions deployed in the cloud. Enter: Serverless frameworks. All of a sudden, it became even easier to adopt and deploy serverless functions. Enterprises adopted these functions like hotcakes. Many deployed in the hundreds or even thousands of these functions. 

Why We Care

At Sync, our focus since our inception has been optimizing large scale compute jobs. Whether through Spark, EMR, or Databricks, the idea of introspecting a job and building a model through which we can understand and optimize that job, is our bread and butter. As we continued our development, multiple customers began asking for support of serverless technologies. Naturally, we assumed they were talking about Serverless Job functionality (which many were), but there was a substantial portion focused on Serverless Function functionality. 

So we set out to answer a simple question: Are Serverless Functions in their current form working for the modern enterprise? 

The answer, as it happens, is a resounding no. 

Industry Focus

In 2022, an IBM blog post titled “The Future Is Serverless” was published, which cited the “energy-efficient and cost-efficient” nature of serverless applications as a primary reason that the future will be serverless. They make the – valid – case that reserving cloud capacity is challenging and consumers of cloud serverless functions are better served by allowing technologies such as KNative to streamline the “serverless-ification” processes. In short, their thesis is that complex workloads, such as those run in Kubernetes, are better served by Serverless offerings.

In 2023, Datadog released their annual “State of Serverless” post, where they show the continued adoption of Serverless technologies. This trend is present across all of the 3 major cloud vendors.

https://www.datadoghq.com/state-of-serverless/

The leader of the pack is AWS Lambda. Lambda has traditionally been the entry point for developers to deploy their Serverless workloads. 

But hang on, 40%+ of Lambda Invocations happen in NodeJS? NodeJS is not traditionally thought of as a distributed computing framework, nor is it generally used for some large scale orchestration of computate tasks. But it seems to be dominating the Lambda serverless world.

So, yes, IBM argues that Serverless is great for scaling distributed computation tasks, but what if that’s not what you’re doing with Serverless?

https://www.datadoghq.com/state-of-serverless/

What Serverless Solved 

Before we get into the details of what’s missing, let’s talk about where things are currently working. 

Where Things Work 1: Uptime Guarantees 

One of the critical, but most frustrating pieces of the developer lifecycle is uptime requirements. Many developers hear the term five-nines, and shudder. Building applications that have specific uptime guarantees is not only challenging, it’s also time-intensive. When large scale systems are made up of small, discrete pieces of computation, the problem can become all the more complex. 

Luckily, the Lambda SLAs guarantee a fairly reasonable amount of uptime, right out of the box. This can save otherwise substantial developer efforts of scoping, building, and testing highly available systems. 

Where Things Work 2: Concurrency + Auto Scaling

Introspecting a large scale system isn’t easy. Companies like DataDog and CloudFlare run multi-billion dollar businesses off of this exact challenge. In an environment where requests can burst unexpectedly, creating and designing systems that scale based on spot user demand is also difficult. 

One of the most powerful aspects of a serverless or hosted model (such as AWS Lambda), is the demand-based auto-scaling capabilities offered by the infrastructure. These effects are compounded, especially when the functions themselves are stateless. This effectively eliminates developers having to care about the operational concerns of autoscaling. There are unquestionably still the cost concerns, serverload concerns, and others, but serverless function offerings give developers a good starting point. 

Problem 1: Developer Bandwidth 

In a typical Serverless Function deployment, the initial choice of configuration tends to be the perpetual choice of configuration.

Wait, hang on, “initial choice of configuration”? Meaning, users still have to manually select their own configuration? It turns out, yes, users still need to manually pick a particular configuration for each serverless function they deploy. It’s actually a bit ironic – with the promise of true 0-management jobs, users are still required to intelligently select resource configuration. 

If an engineer deploys and accidently overspecs a serverless function initially, it’s fairly unlikely that they will ever revisit the function to optimize it. This is generally the case for a few reasons:

  1. Time – Most engineers don’t have the time to go back and ensure that functions they have written weeks, months, or even years ago are operating under the ideal resources. This largely feeds into #2.
  2. Incentives – Engineers are not incentivized by picking the optimal resource configuration for their jobs. They’d rather have the job be guaranteed to work, while spending a bit more of their company’s compute budget.
  3. Employee Churn – Enterprises have inherent entropy and employees are oftentimes transient. People start jobs and people leave jobs. The knowledge generally leaves with them. When other engineers inherently previous work, they are significantly more incentivized to just ensure it works, rather than ensure that it works optimally. 

Problem 2: Serverless Still Requires Tuning

Lambda is predicated on a simple principle – the resource requirements for workloads that take less than 15 minutes to run, can be pretty easily approximated. Lambda makes it easy for developers to set-and-forget, offering only one knob for them to worry about.

That knob is memory. Using Lambda, you can configure the memory allocated to a lambda function as a value between 128 MB and 10,240 MB. Lambda will automatically decide how much vCPU to allocate to you based on the memory setting. 

This… sounds great. “I only have to pick one lever and, and all of a sudden, I get everything else figured out for me? That’s perfect!” If that were the end of the story, I would get to finish this post right now. 

Instead, life is all about tradeoffs – generally correlated tradeoffs. In this case, it’s cost and performance. As an engineer, it’s easy for me to pick the largest memory setting available to me just to ensure my Lambda function works, regardless of what its actual resource requirements are. Once it works, why would I ever touch it again?

Well, it turns out that picking large, frequently uncorrelated-to-necessary-resources values isn’t the most cost effective thing to do. So much so, in fact, that an AWS Solutions Engineer built and open sourced a tool to help users actually find the correct memory levels for their Lambda functions. The tool uses AWS Step Functions to walk users down to the minimum necessary level. It’s been so popular that it has 5K stars on GitHub… and 18.8K 

deployments. 

Clearly, the one-knob-rules-all solution isn’t working.  


Problem 3: Serverless Is Hard to Introspect

The scale and growth testing that plagued engineers for decades before the rise of Serverless, was unfortunately not in vain. Understanding how users will be interacting with an application, in terms of number of requests or compute load gives engineers a powerful understanding of what to expect when things go live. 

In the Serverless Function architecture, engineers don’t think about these considerations and push the burden onto the infrastructure itself. As long as the infrastructure works – it’s unlikely that an already oversubscribed engineer would spend time digging into the performance or cost characteristics of the Serverless function. 

Absent home-rolled solutions, there are few tools that allow for the detailed observability of a single serverless function. Furthermore, there are usually hundreds if not thousands of serverless functions deployed. Observability across a fleet of functions is nearly impossible.

Furthermore, the primary mechanism folks can use for per-function observability is AWS CloudWatch. Cloudwatch logs events for each lambda invocation and stores a few metrics. The major problem though, is that just collecting this information in CloudWatch has been observed to be more expensive than Lamba itself. In fact, there are full articles, posts, and best practices around just managing the costs associated with Lambda CloudWatch logs.

Problem 4: No Auto-Optimization

The year 2023 brought on a material shift in the mentality of “compute” consumers. Enterprises that were previously focused on growth at all costs shifted their focus to efficiency. Vendors in the generic Cloud, Snowflake, and Databricks ecosystem popped up at increasing rates. Most had a simple goal – provide high level visibility into workloads. 

They provided interactive charts and diagrams to show ongoing cost changes… But they didn’t provide the fundamental “healing” mechanisms. It would be like going to the doctor, having them diagnose a problem, but provide no recourse. 

Consistent with their focus on efficiency, enterprises had a few options. Larger ones deployed full teams to focus on this effort. Smaller ones that didn’t have the budget or manpower turned to observability tools… nearly all of which fell short, as they missed the fundamental optimization component.  

Providing detailed visibility across a few, large scale jobs is considered table stakes for many observability providers, but for some reason providing that same level of visibility across many, small scale jobs, in an efficient and easy to optimize way hasn’t become standard. 

Conclusion

We’re in a fairly unique period as an industry. Job visibility, tuning, introspection, and optimization have reemerged as key pieces of the modern tech stack. But most focus on the whales, when they should be focusing on the barracudas. 

If these problems resonate with you – drop us a line at info@synccomputing.com. We’d love to chat. 

Databricks vs Snowflake: A Complete 2024 Comparison 

Databricks and Snowflake are two of the largest names in cloud data solutions – and for good reason. Both platforms have been instrumental in helping companies generate value from their internal (and external) data assets. Each platform has distinct advantages and features and they’ve increasingly overlapped in their offerings—leaving many confused about which solution is best suited for their business needs.

Unfortunately, as is the case in most things in life, there isn’t a simple answer for “which one is better.” But, at Sync, we’ve seen, debugged, and optimized thousands of workloads across numerous enterprises, which has given us a unique perspective. 

To begin a meaningful comparison, we have to understand the history and core competency of each offering.  To help in that process, we at Sync broke down the major differences between Snowflake and Databricks — including price, performance, integration, security and best use cases—in order to best address the individual needs of each user.

While this topic has been debated plenty of times in the past, here at Sync we aim to provide emphasis on the total cost of ownership and ROI for companies.  With all that said, let’s begin!

Below is our complete 2024 guide to all things Databricks vs. Snowflake.

Databricks vs. Snowflake: What Are The Key Differences?

The first thing to understand about the two platforms is what they are, and what solution they are hoping to provide.

Databricks is a cloud-based, unified data analytics platform, for building, deploying, and sharing data analytics solutions at scale. Databricks aims to provide one unified interface where users can store data and execute jobs in interactive, shareable workspaces. These workspaces contain cloud-based notebooks—which are the backbone of Databricks— through which all compute functions are built to be computed by cloud based machines.

Snowflake, on the other hand, is a fully-managed, SaaS cloud-based data warehouse. Whereas Databricks was initially designed to unify data pipelines, Snowflake was designed to be the easiest to manage data warehouse solution on the cloud. While the target market for Databricks is data scientists and data engineers, the target market for Snowflake is typically data analysts—who are highly proficient in SQL queries and data analysis—but not as interested in complex computations or machine learning workflows.

Over time, Databricks and Snowflake have been increasingly in competition, as each hopes to expand their offerings to be an all-in-one cloud data platform solution. New products like Snowflake’s Snowpark (which offers Python functionality) and Databricks’s DBSQL (their serverless data warehouse) have made it increasingly difficult to differentiate the offering of each product. 

For the time being, most would agree that Snowflake tends to be the dominant name for easy-to-use cloud data warehouse solutions, and Databricks is the winner for cloud-based machine learning and data science workflows.

Databricks vs Snowflake: Data Storage

At the moment, Snowflake has the edge for querying structured data, and Databricks has the edge for raw and unstructured data needed for ML. In the future, we think Databricks data lakehouse platform could be the catch-all solution for all data management.

One of the largest differences between Snowflake and Databricks is how they store and access data. Both lead the industry in speed and scale. The largest difference between the two is the architecture of data warehouse vs data lakehouse, and the storage of unstructured vs structured data.

Snowflake

Snowflake, at its core, is a cloud data warehouse. It stores structured data in a closed, proprietary format for quick, seamless data querying and transformation. Their proprietary format allows for high speed and reliability with tradeoffs on flexibility. More recently, Snowflake is allowing the ingestion of data and storage of data in additional formats (such as Apache Iceberg), but the vast majority of its’ customer data still sits in their own format. 

Snowflake utilizes a multi-cluster shared disk architecture, in which compute resources share the same storage device, but retain their own CPU and memory. To achieve this, Snowflake  ingests, optimizes, and compresses data to a cloud object storage layer, like Amazon S3 or Google Cloud Storage. Data here is organized into a columnar format and segmented into micro-partitions, anywhere from 50 to 500MB. These micro-partitions store metadata, which helps dramatically with speed. Interestingly enough – Snowflake’s own internal storage file format is not open source – keeping most customers locked in. 

To function efficiently, Snowflake uses multiple layers to provide an enterprise-experience to the cloud processing workload. Snowflake maintains a cloud services layer that handles the enterprise authentication and access control. 

For execution, Snowflake uses virtual warehouses, which are abstractions on top of regular cloud instances (such as EC2).These warehouses query data from a separate Data storage layer, effectively separating storage and compute.  . This separation of compute and storage makes Snowflake infinitely scalable and allows users to run concurrent queries off the same data, with reasonable isolation., 

Snowflake and Databricks are cloud agnostic meaning they run all three major cloud service providers, Amazon AWS, Microsoft Azure and Google Cloud Platform (GCP).

Sync’s Take: Snowflake’s architecture allows for fast and reliable querying of structured data, at scale. It has appeal to those who want simple methods for managing their resource requirements of their jobs (through T-Shirt size warehouse options). It is primarily geared towards those proficient in SQL but lacks the flexibility to easily deal with raw, unstructured data.

Databricks

One of Databricks’ selling points is it employs an open-source storage layer known as Delta Lake— which aims to combine the flexibility of cloud data lakes, with the reliability and unified structure of a data warehouse—and without the challenges associated with vendor lock-in. Databricks has pioneered this so-called ‘data lakehouse’ hybrid structure as a cost-effective solution for data scientists, data engineers, and analysts alike to work with the same data—regardless of structure or format.

Databricks data lakehouse works by employing three layers to allow for the storage of raw and unstructured data—but also stores metadata  (such as a structured schema) for warehouse-like capabilities on structured data. Notably, this data lakehouse provides ACID transaction support, automatic schema enforcement—which validates DataFrame and table compatibility before writes—and end-to-end streaming for real-time data ingestion—some of the most desirable advancements for data lake systems.

Sync’s Take:  Lakehouses bring the speed, reliability and fast query performance of data warehouses to the flexibility of a Data Lake. The drawback is that as a relatively new technology, new and less technical users have been occasionally unable to locate tables and have to rebuild them.

Databricks vs Snowflake Scalability

Snowflake and Databricks continue to battle for dominance of enterprise workloads. While both have been proven to be industry leaders in this capacity, the largest practical difference between the two lies in their resource management capabilities. 

Snowflake

Snowflake offers compute resources as a serverless offering. Meaning users don’t have to select, install, configure, or manage any software and hardware. Instead, Snowflake uses a series of virtual warehouses—independent compute resources containing memory and CPU—to run queries.. This separation of memory and compute resources allows Snowflake to scale infinitely without slowing down, and multiple users can query concurrently against the same single segment of data.

In terms of performance, Snowflake has been shown to process up to 60 million rows in under 10 seconds.

Snowflake employs a simple “t-shirt” sizing model to their virtual warehouses, with 10 sizes with each double the computing power as the size before it. The largest is 6XL which has 512 virtual nodes. Because warehouses don’t share compute resources or store data, if one goes down it can be replaced in minutes without affecting any of the others.

A diagram of the virtual nodes associated with each size data warehouse

Most notably, Snowflake’s multi-cluster warehouses provide both a “maximized” and “auto-scale” feature which gives you the ability to dynamically shut down unused clusters, saving you money.

Databricks

Databricks started out with much more “open” and traditional infrastructure, where basically all of the compute runs inside a user’s cloud VPC.  This is the complete opposite of the “serverless” model where the compute is run inside Databrick’s VPC, since all of the cluster configurations are exposed to end users.  This has its pros and cons, the main advantage is that users can hyper optimize their clusters to improve performance, but the drawback is that it can be painful to use or require an expert to maintain.

More recently, Databricks is evolving towards the “serverless” model with Databricks SQL Serverless, and likely extending this model to the other products, such as notebooks.  The pros and cons here flip, in that the pro is users don’t have to worry about cluster configurations, however the con is that users have no access nor visibility into the underlying infrastructure and are unable to custom tune clusters to meet their needs.

Since Databricks is currently in a “transition” period between classic and “serverless” offerings, their scalability really depends on which use case people select.  

One major note is Databricks has a diverse set of compute use cases, from SQL warehouses, Jobs, All Purpose Compute, Delta Live Tables, to streaming – each one of these has slightly different compute configurations and use cases.  For example SQL warehouses can be used as a shared resource, where multiple queries can be submitted to the warehouse at any time from multiple users.  Jobs are more singular, in which one notebook is run on one cluster, and is shut down (Jobs can also be shared now, but this is used less).  

The different use cases need to fit the end user’s needs, which can also impact scalability.  This one example symbolizes both the strength and weakness of Databricks, there are so many options at so many levels it can be great if you know what you’re doing, or it can be a nuisance.

Sync’s Take: When it comes to scaling to large workflows, both Snowflake and Databricks can handle the workload. However, Databricks is better able to boost and fine tune the performance of large volumes of data which ultimately saves costs.

Databricks vs Snowflake: Cost

Both Databricks and Snowflake are marketed as pay-as-you-go models. Meaning the more compute you reserve/request, the more you pay. Their models starkly differ from more traditional “usage based pricing schemes” where customers pay only for the usage they actually consume. In both Databricks and Snowflake, users can and will pay for requested resources whether or not those resources are actually necessary or optimal to run the job. 

Another big difference between the two services is that Snowflake runs and charges for the entire compute stack (virtual warehouses and cloud instances), whereas Databricks only runs and charges for the management of compute, requiring users still have to pay a separate cloud provider bill. It is worth noting that Databricks’ new serverless product mimics the Snowflake operating model. Databricks works off a compute/time units called Databricks Units (or DBUs) per second and Snowflake uses a Snowflake credit system.

As a formula, it breaks down like this:

  • Databricks (Classic compute) = Data storage + Cost of Databricks Service (DBUs) + Cost of Cloud Compute (Virtual machine instances) 
  • Snowflake = Data storage (Daily average volume of bytes stored on Snowflake) + Compute (number of virtual warehouses used) 

Both Databricks and Snowflake offer tiers and discounts of pricing based on company size, and both allow you to save money by pre-purchasing units or credits.

Databricks has more variance in price as it has different prices depending on the type of workload, with certain types of computes costing 5x more per compute hour than the simple jobs.

One major advantage Databricks has in terms of costs, is it allows users to utilize Spot instances in their cloud provider – which can translate to significant cost savings.  Snowflake obfuscates all of this, and the end user has no option to benefit from utilizing Spot instances.

Sync’s Take: There is no concrete answer to which service is “cheaper” as it really depends on how much of the service or platform you’re using, and for what types of tasks. However, the control and introspection capabilities that Databricks provides is fairly unmatched in the Snowflake ecosystem. This gives Databricks a significant edge when optimizing for large compute workloads.

If you’d like a further guide on the breakdown of Databricks pricing, we recommend checking out our complete pricing guide.

Databricks vs Snowflake Speed Benchmarks

Databricks claims they are 2.5x faster than Snowflake. Snowflake also claims they are faster than databricks. While this is a contentious issue between the two giants the reality is benchmarks merely only serve as a vanity metric.  In reality, your workloads will likely look nothing like the TPC-DS benchmarks that either company ran, and hence their benchmarks would not apply to your jobs.  Our opinion here is that benchmarks don’t matter at this level.

While this may be an unsatisfying answer, if you’re looking for a solution that is all about the absolute fastest way to run your code – there are likely other solutions that are less well known but do focus on performance.  

Most companies we speak to value both platforms due to their ease of use, having all of their data in one place, ability to share code, and not having to worry about low level infrastructure.  Pure raw speed is rarely a priority for companies. If this sounds like your company, likely the speed metrics don’t really matter so much.  

However, cost likely does matter in aggregate, and hence doing an actual comparison of runtime and cost on the different platforms with your actual workloads is the only real way to know.  

Databricks vs Snowflake: Ease of Use

All things equal, Snowflake is largely considered the “easier” cloud solution to learn between the two. It has an intuitive SQL interface and as a serverless experience, doesn’t require users to manage any virtual or local hardware resources. Plus as a managed service, using Snowflake doesn’t require any installing, maintaining, updating or fine-tuning of the platform. It’s all handled by Snowflake.

From a language perspective, Snowflake is all SQL-based (excluding their new foray into Snowpark) making it accessible for many business analysts. While Databricks SQL has  data warehouse functionality in line with Snowflake, the large use case of Databricks is being able to write in Python, R and Scala and reviews on Gartner and Trust Radius have consistently rated it a more technical setup than Snowflake

Snowflake also has automated features like auto-scaling and auto-suspend to help start and stop clusters without fine-tuning. While Databricks also has autoscaling and autosuspend, it is designed for a more technical user and there is more involved with fine-tuning your clusters (watch more about how we help do this here).

Sync’s Take: While Databricks UI has a steeper learning curve than Snowflake, it ultimately has more advanced control and customization than Snowflake, making this a tradeoff that is largely dependent on how complex you intend your operations to be.  

Databricks vs Snowflake: Security

Both Databricks and Snowflake are GDPR-compliant, offer role-based access control (RBAC), and both organizations encrypt their data both at rest and in motion. Both have very good records with data security and offer a variety of role-based access controls and support for compliance standards.

Databricks offers additional isolation at multiple levels including workspace-level permissions, cluster ACLs, JVM whitelisting, and single-use clusters. For organizations that employ ADS or AMS teams, Databricks provides workload security that includes code repository management, built-in secret management, hardening with security monitoring and vulnerability reports, and the ability to enforce security and validation requirements.

Snowflake allows users to set regions for data storage to comply with regulatory guidelines such as HIPAA and PCI DSS. Snowflake security levels can also be adjusted based on requirements and has built-in features to regulate access levels, and control things like IP allows and blocklists. Snowflake also allows advanced features of Time Travel and Fail-safe which allow you to restore tables, schemas, and databases from a specific time point in the past or protect and recover historical data.

Historically the only issue for Snowflake was the inability to on-premise storage on a private-cloud infrastructure, which is needed for the highest level of security like government data. In 2022, Snowflake started adding in on-premise storage, however as of yet there is limited information on this has been received.

Sync’s Take: Both Databricks and Snowflake have an excellent reputation with data security, as it is mission-critical to their businesses. There is really no wrong choice here and it largely comes down to making sure individual access levels match your intent.

Databricks vs. Snowflake: Ecosystem and Integration

Databricks and Snowflake are becoming the abstractions on top of Cloud Vendors for data computation workloads. As such, they both plug into a variety of vendors, tools, and products. 

From the vendor space, both Databricks and Snowflake provide marketplaces that allow other predominant tools and technology to be co-deployed. There are also community built and contributed features, such as the Databricks Airflow Operators / Snowflake Airflow Operators.

On the whole though, the Databrick’s ecosystem is typically more “open” than Snowflake, since Databricks still runs in a user’s cloud VPC.  This means, users can still install custom libraries, or even introspect low-level cluster data.  Such access is not possible in Snowflake, and hence integrating with your favorite tools may be harder. Databricks also tends to be generally more developer / integration friendly than Snowflake for this exact reason.

Other FAQ on Databricks vs Snowflake?

  • Is Databricks a data warehouse? Databricks bills itself as the world’s first “Data Lakehouse”, combining the best of data lakes and data warehouses. However, despite having the capability, Databricks is not typically thought of as a data warehouse solution, as its learning curve and fine-tuning are often unnecessarily for someone seeking a just straightforward data warehouse.
  • Can Snowflake and Databricks integrate with each other? It is possible and not entirely uncommon to integrate Databricks and Snowflake with each other. Typically in this manner, Databricks acts as a Data Lake for all unstructured data, manipulating it and processing it as part of an ETL pipeline where it is then stored on Snowflake like a traditional data warehouse.
  • What data types does Snowflake accept? Snowflake is optimized for structured and semi-structured data, meaning it can only accept certain data formats, notably JSON, Avro, Parquet and XML.
  • Can Snowflake and Databricks create dashboards for business intelligence? Yes, both Snowflake and Databricks are able to create dashboards and visualizations for business intelligence.

Databricks vs Snowflake: Which Is Better?

Both Databricks and Snowflake have a stellar reputation within the business and data community. While both cloud-based platforms, Snowflake is most optimized for data warehousing, data manipulation and querying, while Databricks is optimized for machine learning and heavy data science. 

Broken down into components, here are a list of pros for each:

Platform/FeatureDatabricksSnowflake
StorageBetter for raw, unstructured data. Better for reliability and ease of use for structured data
Use CaseBetter for ML, AI, Data Science and Data Engineering. Collaborative notebooks in Python/Scala/R a big plus Easier for analysts in business intelligence and companies looking to migrate existing data warehouse system
PriceCheaper at high compute volumes. Not as predictable on cost. Efficient at scaling down unused resources. More consistent, predictable costs. 
ScalabilityInfinitely scalable. Effective at high volume workloads.Separate storage and compute makes for seamless concurrent queries. 
SecurityGDPR-compliant, role-based access control, encrypted at rest and and in motionGDPR-compliant, role-based access control, encrypted at rest and in motion

If you want to integrate structured data an existing ETL pipeline using structured data and programs like Tableau, Looker and Power BI, Snowflake could be the right option for you. If you instead are looking for a unified analytics workspace where you build compute pipelines, Databricks might be the right choice for you. 

Interested in using Databricks further? Check out Sync’s Gradient solution – the only ML-powered Databricks cluster optimization and management tool.  At a high level, we help maintain the openness of Databricks but now with the “ease” of Snowflake.  On top of that, we also actively drive your costs lower and lower.

Sync Computing Partners with Databricks for Lakehouse Job Cluster and Usage Optimization

Self-improving machine learning algorithms provide job cluster optimization and insights for Databricks users

CAMBRIDGE, Mass. – Sync Computing, the industry-leading data infrastructure management platform built to leverage machine learning (ML) algorithms that allow users to automatically maximize data compute performance, today announced that it has joined forces with Databricks go-to-market (GTM) teams and their Technology Partner Program. The end goal is to help Databricks customers achieve lower costs, improved reliability, and automatic management of compute clusters at scale. With the collaboration of the two technology powerhouses efforts, Databricks customers will gain the opportunity to take advantage of Sync Computing’s Gradient solution for SLA optimization, real-time insights, and significant cost savings so that teams are able to focus on greater business objectives and ROI.

Platform and data engineering teams are constantly faced with changing pressures as the data infrastructure landscape becomes increasingly complex. They are met with ongoing needs to iterate quickly, gain real-time insights, and maximize performance all while managing cost. The Gradient platform by Sync Computing provides a single source of truth for cost tracking, data governance, and unified metrics monitoring.

The management and cost of data pipelines is top of mind for engineering teams especially in the current economic climate. However, tuning clusters to hit cost and runtime goals is a task nobody has time for,” said Jeffrey Chou, CEO and co-founder of Sync Computing. “Databricks customers who use Sync’s Gradient toolkit are now open to a whole new world of opportunities as they can offload these tasks to Gradient while they focus on more urgent business goals. Organizations absolutely love the ROI they see almost immediately.”

Sync Computing’s machine learning-powered optimization delivers recommendations for Databricks clusters, without making any changes at the code level. Using a closed-loop feedback system, Gradient automatically builds custom-tuned machine learning models for each Databricks job it is managing using historical run logs — continuously driving Databricks jobs cluster configurations to hit user-defined business goals.

Sync for Databricks allows companies to:

  • Enable platform teams full governance over config changes to meet business demands
  • Slash Databricks compute and operating costs by up to 50%
  • Gain coveted insights into DBU, cloud costs, and cluster anomalies
  • Hit SLAs even as data pipelines change

Sync integrates with leading cloud platforms like Amazon Web Services (AWS) and Microsoft Azure to programmatically optimize for tools like Apache Airflow and Databricks workflows, without changing a single line of code.

Learn how Sync helps organizations large and small optimize Databricks clusters at scale here.

About Sync Computing
Having been recognized as a Gartner Cool New Vendor, Sync Computing was originally spun out of MIT with the goal to make data and AI cloud infrastructure easier to control. With Sync’s one-of-a-kind solution, Gradient, users are given full ability to enable self-improving job clusters to hit SLA goals, gain infrastructure insights, and leverage tailored recommendations to achieve optimal performance. Recognized names such as Insider, Handelsblatt, Abnormal Security, Duolingo, and Adobe have relied on Sync to get the most out of the data-driven landscape with automated data optimization. To learn more, visit https://www.synccomputing.com.

Contact
McKinley Culbert
Marketing at Sync Computing
mckinley.culbert@synccomputing.com

January 2024 Release Notes

release notes

Exciting things are happening at Sync as we move further into the new year!

Ensuring that our users are equipped with the tools to fully manage the automation of their infrastructure is always top of mind. With the most recent iteration of Gradient, Sync users are able to take advantage of a toolkit that makes optimizing Databricks clusters even better.

Here’s what’s new in the latest version of Gradient:

Org Settings

Org Settings is now available in the main navigation bar in the Sync Dashboard. Users are able to navigate to the Org Settings tab to find personal user information, a comprehensive list of API keys, and a list of organization users with their user details.

With Org Settings, users will see a consolidated list of personal information, API keys, and account users directly in the Sync UI.

Stay up to date with the latest feature releases and updates at Sync by visiting our Product Updates documentation. Ready to start getting the most out of your Databricks job clusters? Reach out to us at info@synccomputing.com.