This article has been co-written with Mikkel Dengsøe, co-founder of Synq.
While hiring has slowed in most industries, many data teams are still growing. With tools such as dbt and Looker, creating new data models and dashboards for everyone to use has never been easier.
But this has come with a set of scaling challenges and complexities that Tristan Handy from dbt has articulated in the article The next big step forwards for analytics engineering.
The outline from dbt on the number of projects based on model size confirms that many teams have large deployments
As a rule of thumb, in our experience the data stack complexity starts to get difficult to manage around 300 data models. Based on the numbers from dbt more than 50% dbt projects and companies using dbt are potentially dealing with scaling challenges every day.
If we put the spotlight on 50 well known scaleups, most have surpassed the 10 person data mark. The size of the data team and the number of data models and dashboards is loosely correlated and can be a good indicator for when complexity kicks in.
In this post we’ll look into what happens across the lifecycle of larger data teams, from onboarding to deployment, monitoring and self-serve - and some steps you can take to overcome these challenges.
“Collaboration becomes harder as no one is familiar with the entire code base. Time spent in meetings goes up relative to time spent getting things done.”
As data teams grow, so does the challenge of onboarding new members.
A data person is considered operational when they have a clear understanding of the data projects, their evolution, the company's owned data, and where to locate it.
When teams were dealing with fewer data assets, it was possible to walk a new hire through the relevant projects and dashboards in a matter of days. The sheer volume of data models and dashboards have now made this impossible.
In the past, a single person or team could document all the necessary tables and dashboards, or veterans within the team could offer guidance. However, as data teams scale, this siloed knowledge approach is no longer feasible.
Traditional methods of documentation also fall short as no one person knows all the data, making it increasingly difficult to onboard new hires and get them up to speed quickly.
“Velocity and agility have slowed, creating frustration both inside and outside the data team.”
One of the most painful realizations for data teams is the gradual decline of development speed. This slows the pace of which new ideas and hypotheses are brought to life and makes data practitioners less productive.
“We want to build a world where any data scientist can have an idea on the way to work, and explore it end-to-end by midday.” - Monzo, UK scale-up
A few things start to happen as the data team grows
“Quality becomes harder to enforce over a growing surface area and user-reported errors increase.”
It gets exponentially more difficult to ensure the quality of your data as you scale. One area that’s especially difficult at scale is monitoring issues. Monitoring in many ways is akin to the broken windows theory. The broken windows theory can be traced back to criminology and suggests that if you leave a window broken in a compound everything else starts to fall apart. If residents start seeing that things are falling apart they stop caring about other things as well
In the early days, data tests are carefully implemented in the places where it makes sense and each alert is addressed and resolved, often within the same day.
As the data team scales it’s not uncommon for the Slack channel to look like a jammed traffic intersection with dozens of new alerts each morning. Ownership of issues become unclear, and in some cases many of them go unaddressed. This has several negative consequences
“As pools of confidential data grow, more users will want access. Data owners don't want users to take more from the pool than authorized, especially in tightly regulated industries.” Forbes.
Scaling data teams can make self-service analytics practically impossible due to two primary barriers: a psychological barrier and a data accessibility one.
The psychological barrier arises from the overwhelming number of dashboards and tools that come with an increasing number of data projects. Business users are expected to leverage data on their own, but the sheer volume of information makes it challenging for stakeholders to determine which data assets are relevant, which dashboard to use, or what calculation to apply for specific metrics.
The accessibility barrier relates to the need to control data access in accordance with tightening data regulations, such as GDPR or CCPA. Currently, many organizations designate a single person or team responsible for managing data access and keep data under control. With today’s data volumes, it is impossible for a single team to manage data access without hindering self-service.
Additionally, the pipelines run a bit slower every day meaning that stakeholders can only consume up-to-date data around midday. You have so many dependencies that you no longer know what depends on what. Before you know it you find yourself in a mess that’s hard to get out of. That upstream data model with hundreds of downstream dependencies is made 30 minutes slower by one quirky join that someone made without knowing the consequences. Your data pipeline gradually degrades until stakeholders start complaining that data is never ready before noon. At that point you have to drop everything to fix it and spend months on something that could have been avoided.
The most impactful initiative to declutter your data stack is getting rid of data assets that are not needed. While there’s no set definition of an unneeded data assets, signs you can look out for are
Decluttering a bloated data stack is no easy task and if possible you should have an ongoing investment into getting rid of unused tables, data models, columns and dashboards. However, many scaling companies find themselves on the backfoot and only address this once they have thousands of data models and dashboards.
Here’s a real-life story from a scaling company to dramatically reduce the number of Looker dashboards which made self serve near impossible
You can apply many of the same steps if you’re working on decluttering your data models in dbt. We’ve seen teams particularly benefit from having a well mapped column level lineage that extends from dbt to their BI tool. This helps you quickly and confidently assess the full downstream impact of deleting a data model or column.
Another step to keep the level of self-serve ability on your warehouse high is to make sure that your data assets are documented with the right context.
There’s no one solution that fits all for keeping documentation of data assets up to date but the most common options are leveraging yml files in dbt or using a data catalog such as Castor.
We often see teams benefit from explicitly defining the following metadata
The biggest challenge is for teams to be consistent and rigorous when documenting the data. It’s a common pitfall that teams start out with good intentions when documenting data assets but gradually stop doing it.
First, data documentation should be crowdsourced so that documentation can be leveraged from the people who are close to the data. You can use tools such as dbt pre-commit to enforce a specific set of metadata to be documented before committing it to the code base. This is a slippery slope that often leads to people thinking that documentation is not important.
Second, the data assets should be discoverable so everyone has one place to go to see the latest definitions. Good places for this can be dbt docs or your data catalog tool. Traditionally, documentation has been siloed in various tools, making it challenging for users to find the information they need. By gathering all data documentation in a centralized repository such as a data catalog and pushing it back to all the relevant tools, users can access the information they need without leaving their native work environment. This approach ensures that users have access to the most up-to-date information and can work more efficiently.
If you treat all issues equally important, you’re likely doing something wrong. Some are minor issues that don’t have any material business impact, and others mean you have to drop everything to fix a critical issue.
Dealing with these two types of issues requires a very different approach, but too often, this is not explicitly defined. This leads to negative side effects such as important issues not being acted on fast enough or non-important data issues derailing the data team.
We recommend using three parameters to assess the severity of an issue:
You should aim to be able to assess all three within 5 minutes of being made aware of an issue.
Read the full guide on Designing severity levels for data issues guide for concrete steps to get started.
Your first resort should be explicit controls such as dbt tests as these can help cover gaps and tightly couples your business knowledge to expectations from the data. Adding checks to automatically detect anomalies on your data can be helpful to learn about issues that your explicit controls may not capture.
Anomaly detection controls can help you detect issues across quality, freshness, volume and schema issues.
If you’ve already got a lot of issues that are being flagged from your dbt tests, the last thing you want is another in-flow of alerts to deal with.
We recommend that you take the following steps if you want to get started with anomaly detection tests.
Too many teams haven’t adjusted their alerting workflow to fit their scale. If you’re using an orchestration tool like Airflow you can build bespoke alerts that fit your use cases.
Here are some steps we’ve teams who’re best in control of their alerting workflow take:
There’s no shortage of ways you can measure data quality (or opinions of how it should be done). In reality, it often comes down to the business context of the company you work in as well as expectations of the data from the business. To be on the forefront of data gradually running slower and issues going unaddressed, these three metrics can be a good starting point
In this post we’ve looked at the challenges data teams face with scale around onboarding, development, monitoring and self serve. If you find yourself with struggling with challenges in these areas as you scale, we suggest exploring the following actions
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.
Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.