As a quick intro, choosing your data warehouse provider can be overwhelming. You can either choose blindly by trusting your data engineering friend who "knows his sh*t" or spend weeks benchmarking the features of those services. I am here to provide a third alternative: a series of articles to compare performance on specific topics.
In this series, you'll find a fair feature comparison between Redshift, BigQuery and Snowflake. I will try as much as possible to provide unbiased comparisons. You won't find any recommendations to choose a service more than another, nor performance comparisons (which are marginal anyway)
You'll find all the articles of the series and the topics tackled below:
Let’s get started!
Query Processing (or Compute layer)
The compute layer of your workload manages requests from external systems, controlling access, and ensuring requests are appropriately authorized. It contains the runtime environment that your business logic will be deployed and executed by.
Runs on distributed compute. Runs on Dremel, a large multi-tenant cluster that executes Standard SQL queries. BQ is orchestrated via Borg, Google’s precursor to Kubernetes. Compute and storage talk to each other through the petabit Jupiter network.
Proprietary fork of ParAccel (which was partly forked from Postgres) running on AWS virtual machines. Its proximity to Postgres shouldn't fool you, they don't have much in common.
Proprietary compute engine with intelligent predicate pushdown + smart caching running on commodity virtual machines (AWS, GCP or Azure) depending on your cloud choice. Hybrid columnar system inspired by C-Store, MonetDB among others.
All 3 databases have implementations of hot / warm / cold storage. This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.
Proprietary, stored on the Colossus filesystem using ColumnIO as a storage format. Complete separation of distributed compute and storage.
The concept of “platforms” is a key principle of technical infrastructure at Google, and Colossus is the unified global platform for much of storage, from Spanner to Bigtable, from internal object stores to Cloud Storage, and even Compute Engine Persistent Disks. Colossus helps address encryption, low-level security primitives, global scale, durability, performance, hot-spotting, cross-data center replication, maintenance, upgrades, and so on.
Proprietary but is generally SSD (dc1, dc2) / HDD (ds1, ds2) or mixed including S3 based (for RA3) using a proprietary columnar format. RA3 separates compute and storage, whilst all other node types colocalize your compute and storage together.
The database storage layer resides in a scalable cloud storage service, such as Amazon S3, which ensures data replication, scaling, and availability without any management by customers. Snowflake optimizes and stores data in a columnar format within the storage layer, organized into databases as specified by the user.
Data compression is a reduction in the number of bits needed to represent data. Compressing data can save storage capacity, speed up file transfer, and decrease costs for storage hardware and network bandwidth.
Proprietary compression managed by the ColumnIO columnar format. BigQuery compresses data under the hood for you on an ongoing basis. Yet, your queries are still billed as if you are scanning uncompressed bytes. Mosha Pasumansky in April of 2016 described Capacitor, BigQuery’s next generation columnar storage format, which replaced ColumnIO (on which Parquet and ORC are based).
Redshift achieves transparent compression by implementing open algorithms e.g., LZO, ZStandard. They released their own proprietary compression algorithm (AZ64).
You can select the columns you would like to be compressed and how. You can generate recommendations to select the best compression algorithm with the query ANALYZE COMPRESSION.
Snowflake also provides its own compression layer that is opaque to the user. Unlike BigQuery you aren’t billed for bytes scanned but it does mean the query planner can take advantage of the compression and table statistics to scan less data (and reduce compute cost). There’s little you can do to tune this externally – however that’s a pro and not a con.
All of these databases follow cloud-only deployments. If you have to run on-premise you’re going to be a little stuck for the moment unless you plan to migrate that data into the cloud.
Cloud only – within Google Cloud Platform. Each dataset can be distributed across a region (US or EU) or you pick a specific region from a list.
Cloud only – within Amazon Web services. You can pick from pretty much every AWS region in a list of 25 regions at a cluster level.
Cloud only – within Amazon Web Services / Google Cloud Platform / Azure depending on your preferred flavor of cloud at an account level.
Hope you enjoyed the reading of those comparison/benchmarking point regarding:
If you want to find more benchmarking points, please find them in the following links :
Castor is building the documentation layer on top of your data warehouse. We help any data people within your company to get all the context around data assets.