As a quick intro, choosing your data warehouse provider can be overwhelming. You can either choose blindly by trusting your data engineering friend who "knows his sh*t" or spend weeks benchmarking the features of those services. I am here to provide a third alternative: a series of articles to compare performance on specific topics.
In this series, you'll find a fair feature comparison between Redshift, BigQuery and Snowflake. I will try as much as possible to provide unbiased comparisons. You won't find any recommendations to choose a service more than another, nor performance comparisons (which are marginal anyway)
You'll find all the articles of the series and the topics tackled below:
I’ve included the factors that should typically be considered when evaluating pricing for each database below. Most of the products have either a fixed cost component or a variable cost component depending on which option or plan you go down.
Since Google BigQuery pricing is all based on usage, there are primarily only (3) core aspects of your BigQuery data storage you need to consider when estimating the costs:
Storage Data (~2 cents per GB / month for warm data)
Long Term Storage Data (1 cent per GB / month for colder data)
Query Data Usage
The on-demand pricing model is based on bytes scanned whilst purchasing fixed slots (reservations) will have a defined cost per 500 / 1000 slots. Costs vary by region. You are only charged for the queries you run (or for the auto-refreshing Tableau Dashboard your analyst set up).
Redshift pricing is on-demand and a la carte. Nodes are simply priced by usage per hour. Other cloud services, including Amazon Managed Storage or Redshift Spectrum, are priced separately.
Pricing depends on the node type (ds2 / dc2 / RA3), but also the number of nodes, and you can save money if you reserve/pay in advance.
First, Snowflake has created pre-defined packages: Standard, Enterprise, Business Critical, and Virtual Private Snowflake (VPS). The packages vary in the types of cloud services provided, including encryption levels, materialized views, compliance, and more.
Second, Snowflake uses a credit system instead of straightforward usage-per-hour pricing.
This is where Snowflake pricing gets tricky. Depending on the size of your warehouse, your cluster could consume anywhere from 1 to 128 credits per hour.
As for Enterprise pricing, it is possible to get discounts that are lower than the publicly advertised list prices. Generally, it's the vendor who chooses to provide discounts and credits if he thinks you're worthy. It really depends on 1) commitment and 2) budget size.
You can expect most of the queries you run to return within 30 seconds, if not faster. It's possible to install cost/quota levels in BigQuery but this tool suffers from a lack of granularity. For instance, you'll only be able to set quotas globally at a user/service account/project level.
Traditional Redshift nodes — the storage-heavy DS2 and compute-intensive DC2 nodes — have coupled storage and compute. That makes scaling your clusters more time- and resource-intensive. Every time you need to scale your compute, you also need to add storage capacity and vice versa.
In response to Snowflake, Redshift introduced de-coupled nodes — the RA3 — in December 2019. Like Snowflake, RA3 nodes provide virtually instant and infinite scalability. Users can now pay for computing by the hour, which enables limited-time spikes in usage. RA3 nodes also allow for instant storage upgrades without adding compute resources.
Redshift introduced pause/resume semantics as of March 2020 in which you can ‘shut down’ a cluster for a period of time and reduce cost. This can be done manually through the API or console or scheduled at certain times. At the moment there is no smart sleep / smart resume functionality based on workloads. Pause/resume typically completes in the amount of time it takes to make a cup of tea (4-10 minutes, from our testing) depending on your node type and snapshot size.
The added benefit of Redshift, in terms of scalability, is that users can seamlessly pair de-coupled RA3 nodes with the less-expensive DS2 and DC2 nodes. You can mix-and-match resources to create the optimal data warehouse cluster.
Most node types will scale up to 100 nodes – anything beyond that is asking for trouble. Most queries will mostly complete within milliseconds – hours, but this is getting better with features like workload management, concurrency scaling, short query acceleration (SQA), and advanced query accelerator (on RA3 nodes only).
Snowflake was built with a fundamentally different architecture than Redshift. It de-coupled storage and compute functions to provide near-instantaneous scalability. This provided a major advantage for Snowflake and made it a viable Redshift competitor.
Pause, resume semantics (both manual and automated based on workload). Pause/resume is typically completed in less than 60 seconds and can be tuned to shut wake or sleep based on workload activity. Queries typically completed within seconds to < 5 minutes. It’s fast to change between warehouse types (e.g., small (S) => medium (M)) which roughly correlate to the number of vCPUs / memory you are getting and cluster size manually if required. Scaling policies can adjust the number of clusters automagically according to running workloads in either standard or economy modes. It’s a particularly elegant implementation for a scaling mechanism.
Hope you enjoyed the reading of those comparison/benchmarking point regarding:
If you want to find more benchmarking points, please find them in the following links :
Castor is building the documentation layer on top of your data warehouse. We help any data people within your company to get all the context around data assets.