In recent years, we have seen the technology and tools for building a data stack change dramatically. Best practices are evolving at an ever-increasing pace, but fortunately, it's now easier than ever to build a world-class stack without massive investment. We've analyzed what we believe is the most efficient and scalable approach to building a stack in 2019.
We conducted more than 150 interviews in 100 companies of various sizes, industries, and maturities.
In each interview, we asked several types of questions.
To get to know the company and try to draw patterns, we asked qualification questions :
To have an idea of their process maturity we deep-dived into their daily routine and activity process
Then, we wondered whether they felt pain points in the data management area.
At last, we asked if they had any ideas of what the perfect solution would look like
Data warehouse systems, of all the elements in a modern data stack, have seen the most significant improvements in recent years. Highly scalable and managed cloud data warehouses allow terabytes of data to be transformed into a few lines of SQL without infrastructure. And with on-demand pricing, the technology is affordable for everyone, with only minutes of setup time. This has an impact on the entire data stack:
Most companies don’t need to worry about the costs of storing data
Google BigQuery, Amazon Redshift, and Snowflake are currently the best offerings, with low storage costs, high processing speeds, and on-demand pricing. Tools like Intermix can be used to optimize your Redshift warehouse for performance.
Before the cloud data warehouse, data pipelines followed the ETL process: extract data from underlying systems, transform it into a usable format, load it into your analytical database. The cloud data warehouse allowed teams to defer the transformation step later in the process, so teams now follow an ELT process.
In fact, I like to break down the data pipeline into four stages: collection, loading, transformation, analysis.
The benefits of moving the transformation step into the warehouse are significant :
A new open-source tool called DBT allows anyone comfortable with SQL to own data analytics engineering (ie ELT).
At a minimum, there are two types of data that need to be tracked:
Segment and Snowplow are popular choices that make it easy for engineering teams to track important user events.
There are many more sources of data (Google Analytics, Shopify, Adwords) available to companies today. The more sources that can be integrated into the data stack, the better.
As well as event data and transactional data, your company may have access to many more sources of data: Google Analytics, Shopify, Ad spend data, Intercom data etc. Loading each of these sources into your cloud warehouse will enable your team to perform comprehensive analytics.
Stitch and Fivetran are two popular services that make loading hundreds of data types easy. It can take as little as an hour to set up scheduled ingestion jobs, and you should be able to avoid needing to write any code.
For the data to be usable, they must first be transformed into clean, descriptive, reliable and easily searchable data sets.
The transformation layer is where the uniqueness and complexity of your business are introduced. Transforming raw data into datasets that make sense for your business requires skill, and your transformation rules will change over time. More and more data is being generated, so datasets must be continually updated and maintained. For these reasons and many others, the best data teams now integrate software engineering best practices in their transformations: testing, error reporting, version control, code review, and alerts.
Until recently, Airflow, DBT, and Luigi were popular choices to manage transformations. However, these services require teams to maintain the supporting infrastructure, which turns out to be time-consuming, challenging, and ultimately frustrating.
Data is at the heart of decision making in modern business. To enable the business to be truly data-driven, the data analysis solutions can’t rely on analysts manually generating reports, because this doesn’t scale. Transforming raw data into meaningful, easy to use and understandable datasets is the first step. The best companies are backing that up with BI solutions that allow everyone to self-serve their data requests. Employees should know where to find the data they need, be able to quickly access it, and then trust the data and insights they find.
Looker is an example of a BI tool built with this contemporary approach to data in mind, and we see it being used by a large share of companies with an advanced approach to data.
Startups are seeing explosive growth in both the amount of data and the number of internal data resources: data tables, dashboards, reports, metrics definitions, etc. Additionally, their growth demands to recruit new people every week or month. On one hand, the growth in data resources is healthy and reflects their heavy investment in data tooling to promote data-informed decision making. However, it also creates a new challenge: effectively navigating a sea of data resources of varying quality, complexity, relevance, and trustworthiness. To better manage the data assets tech companies came up with a fully automated data discovery/catalog tool.
Lyft Amundsen is a renowned open-source version but might be quite technical to set up. Castor is the new plug and play, automated, collaborative data catalog, that you can get up and running in 5 minutes.
The best, and most popular combo we've seen in modern scale-ups is the following :
Castor is a data documentation tool for the Notion-Figma-Slack generation. Or data-wise, for the Fivetran-Looker-Snowflake-DBT aficionados. We designed our catalog to be easy to use, automated, delightful, and friendly.
Want to check it out? Reach out to us and we will show you a demo.