In recent years, we have seen the technology and tools for building a data stack change dramatically. Best practices are evolving at an ever-increasing pace, but fortunately, it's now easier than ever to build a world-class stack without massive investment. We've analyzed what we believe is the most efficient and scalable approach to building a stack in 2019.
We conducted more than 150 interviews in 100 companies of various sizes, industries, and maturities.
In each interview, we asked several types of questions.
To get to know the company and try to draw patterns, we asked qualification questions :
We wanted to understand how they organized their data infrastructure. What were the types of data they gathered, for what purpose? What were approximately the volumes of data treated?
How big was their team? How fast did they grow as companies? What was their data growth rate?
To have an idea of their process maturity we deep-dived into their daily routine and activity process
How do they access/share data? What software are they using? Is it part of a process?
How do they trust data? Do they write tests?
How do they prioritize data assets? Is there effort in cost/resources management?
Then, we wondered whether they felt pain points in the data management area.
What are the types of problems they faced: storage, discovery, quality, problem-solving
How important are those problems? What is the financial impact of those problems?
At last, we asked if they had any ideas of what the perfect solution would look like
We started very basic, with questions on their tools?
Then we moved into potential solutions that aren’t on the market already
The cloud-first data warehouse
Data warehouse systems, of all the elements in a modern data stack, have seen the most significant improvements in recent years. Highly scalable and managed cloud data warehouses allow terabytes of data to be transformed into a few lines of SQL without infrastructure. And with on-demand pricing, the technology is affordable for everyone, with only minutes of setup time. This has an impact on the entire data stack:
Most companies don’t need to worry about the costs of storing data
Most data preparation and transformation can be run in the warehouse, using SQL
BI and reporting tools can serve from the warehouse, in real-time
Google BigQuery, Amazon Redshift, and Snowflake are currently the best offerings, with low storage costs, high processing speeds, and on-demand pricing. Tools like Intermix can be used to optimize your Redshift warehouse for performance.
The rise of ELT
Before the cloud data warehouse, data pipelines followed the ETL process: extract data from underlying systems, transform it into a usable format, load it into your analytical database. The cloud data warehouse allowed teams to defer the transformation step to later, so teams now follow an ELT process.
In fact, I like to break down the data pipeline into four stages: collection, loading, transformation, analysis.
The benefits of moving the transformation step into the warehouse are significant :
Business logic can be removed from the extract and load stage. This allows teams to use off-the-shelf third-party vendors to collect and load their data with minimal investment of resources.
Business logic can be defined in SQL. Rather than having teams of engineers defining transformations in Java, Python, Scala, etc., the analyst working closely with the business teams can own and manage this logic.
A new open-source tool called DBT allows anyone comfortable with SQL to own data analytics engineering (ie ELT).
At a minimum, there are two types of data that need to be tracked:
Event data: What are your users doing when they’re interacting with your product?
Transactional data history: To enable advanced analytics, you need to see not only the current state of the system but also how it’s changed over time. Ideally, you should be creating a new event for every change (insert, edit, delete) to your transactional database. If this isn’t possible, regular snapshots of the DB are a minimum requirement.
There are many more sources of data (Google Analytics, Shopify, Adwords) available to companies today. The more sources that can be integrated into the data stack, the better.
As well as event data and transactional data, your company may have access to many more sources of data: Google Analytics, Shopify, Ad spend data, Intercom data etc. Loading each of these sources into your cloud warehouse will enable your team to perform comprehensive analytics.
Stitch and Fivetran are two popular services that make loading hundreds of data types easy. It can take as little as an hour to set up scheduled ingestion jobs, and you should be able to avoid needing to write any code.
For the data to be usable, they must first be transformed into clean, descriptive, reliable and easily searchable data sets.
The transformation layer is where the uniqueness and complexity of your business are introduced. Transforming raw data into datasets that make sense for your business requires skill, and your transformation rules will change over time. More and more data is being generated, so datasets must be continually updated and maintained. For these reasons and many others, the best data teams now integrate software engineering best practices in their transformations: testing, error reporting, version control, code review, and alerts.
Until recently, Airflow, DBT, and Luigi were popular choices to manage transformations. However, these services require teams to maintain the supporting infrastructure, which turns out to be time-consuming, challenging, and ultimately frustrating.
Data is at the heart of decision making in modern business. To enable the business to be truly data-driven, the data analysis solutions can’t rely on analysts manually generating reports, because this doesn’t scale. Transforming raw data into meaningful, easy to use and understandable datasets is the first step. The best companies are backing that up with BI solutions that allow everyone to self-serve their data requests. Employees should know where to find the data they need, be able to quickly access it, and then trust the data and insights they find.
Looker is an example of a BI tool built with this contemporary approach to data in mind, and we see it being used by a large share of companies with an advanced approach to data.
Data documentation and GDPR compliance
Startups are seeing explosive growth in both the amount of data and the number of internal data resources: data tables, dashboards, reports, metrics definitions, etc. Additionally, their growth demands to recruit new people every week or month. On one hand, the growth in data resources is healthy and reflects their heavy investment in data tooling to promote data-informed decision making. However, it also creates a new challenge: effectively navigating a sea of data resources of varying quality, complexity, relevance, and trustworthiness. To better manage the data assets tech companies came up with a fully automated data discovery/catalog tool.
Lyft Amundsen is a renowned open-source version but might be quite technical to set up. Castor is the new plug and play, automated, collaborative data catalog, that you can get up and running in 5 minutes.
The best, and most popular combo we've seen in modern scale-ups is the following :
Snowflake as a data warehouse
DBT as an ETL/ELT tool
Looker or Tableau as data visualization tool
Castor or Amundsen to manage data discovery and GDPR issues