This article is the continuation of a series about The Data Experience, which we introduced in previous blog posts. In those, we outlined what The Data Experience is and explained why we think it's broken.
This piece is about outlining a plan for improving the Data Experience in a fragmented data ecosystem. But, let’s remind ourselves of what the Data Experience is first.
When you're already ingesting, storing, and modeling data, what's next?
You have to build the experience on top of these existing tools. The Data Experience is how people feel when they interact with the company’s data. It’s the extent to which they can find, understand, use, and trust the data they manipulate.
A good Data Experience helps you leverage the tools you already have in place. It’s a secondary layer that cuts across your data stack and makes the first layer more efficient - one that makes it easier and smoother for everyone involved to get what they need out of the data.
In a previous article, we defined three pillars for building a good Data Experience: Discovery, Community, and Health.
In this article, I describe how to build the right Data Experience in your organization by creating a strong foundation that relies on all three pillars. The purpose of this piece is to understand how you can create this foundation in the fragmented data ecosystem we evolve in. The key: think in terms of capabilities, not tools.
Many people make the mistake of approaching the Data Experience with a tools-first mindset. But this thought process doesn't work anymore—at least not if you want to be successful at solving complex problems with your data.
This tool-first approach doesn’t work because you might end up being overwhelmed by all the tools on the market and over-investing in tools that don’t actually achieve the goal of improving your Data Experience.
The only way to solve complex problems is by first identifying what capabilities are needed to solve the three pillars of the Data Experience, and then figuring out what tool will best serve these capabilities.
This approach is better, as you can select tools based on the outcomes they support. It also allows you to consolidate investment in maybe 1-2 tools that solve the majority of the capabilities associated with a good Data Experience.
The right combination of capabilities will help you meet your goal of building a strong Data Experience. The wrong combination of tools can lead you to buy ten tools that only end up covering 20% of the capabilities you need. Going for one or two tools that cover 80% of your capabilities is the way to go. Thinking in terms of capabilities enables you to analyze and evaluate tools more accurately.
So what capabilities do you need in order to build Discovery, Health, and Community?
Unlocking Data Discovery means that people in your organization can Find, Understand and Use the data smoothly, and efficiently. You should look for the following capabilities in terms of Discovery:
Data lineage: Trace data to its origin, what happens to it, and where it moves over time. Data Lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption. Data Lineage is both a component of health and discovery. Tick this capability and you’re progressing on both pillars.
Search: Find data faster through a powerful search engine-like feature. Make it easy to find data assets using the metadata, glossary terms, classifications, and more.
Context: Enrich your data assets with the right context. Allow everyone in the company to understand assets right away. Find information on the table name, owner, purpose, last updates, frequent users, and tags.
Data lineage: Trace data to its origin, what happens to it, and where it moves over time. Data Lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption.
Popularity: Automatically assign a popularity score to your data assets. Identify immediately the most popular tables.
Query: Make it easy for everyone to query the data, with or without code. Re-use queries from more experienced data people on your team.
Access controls: Control access to sensitive data and enforce compliance with data regulations.
Community is a fundamental pillar of the Data Experience. Community is defined as “a feeling of fellowship with others, as a result of sharing common interests, and goals”. Here are the capabilities that can help you build a good Community experience:
Data Assets usage: Report and read which data products are used, and for which use cases.
Popular queries: Find the most popular queries and avoid duplicate work thanks to a parser referencing the most popular queries made by people within the company.
Collaboration: Allow users to build on the work of their peers. Users can tag each other to flag an issue, make comments or ask questions. Table owners can approve recommendations made by others.Sharing: Share the right information across the platforms your team use every day.
Alignment around metrics: Access the knowledge base where KPIs (key performance indicator) and analytics metrics are defined. Get everyone on the same page when it comes to metrics.
Healthy data means that everyone in the organization can access the information they need, when they need it, without having to worry about whether it is accurate. The capabilities associated with Health are the following:
Anomaly detection: Get an immediate warning when your data displays any kind of anomaly.
Root cause and impact analysis: Identify the root cause of a data issue and its possible downstream consequences.
Access controls: Identify and manage access to private data.
Policy management: Monitor who accesses the data, when, and for which purposes. Grant or restrict access to different data assets according to company roles.
Data quality: Automatically assign a quality score to your data assets, allowing people to know which data assets can be trusted and used safely.
Real-time data monitoring: Collect and store performance metrics for data as it traverses your network.
Reduce storage & compute spend: Identify outdated, unused, or inefficient data assets or queries so you can delete them or change their storage level accordingly.
Data incidents resolution: Resolve incidents quickly once you have located them.
In choosing the tools to build your desired Data Experience, you shouldn’t compromise on the number of capabilities you solve. Unless you address all three pillars— Discovery, Community, and Health—you will not be able to deliver a good, unified Data Experience.
By thinking in terms of capabilities, you open up a myriad of possibilities for achieving your desired outcome. You can choose one tool that combines multiple features, or you can use separate tools: a data catalog, a data observability tool, a SQL notebook, a metrics store, and an access management tool.
Typically, we would recommend an in-between solution where you neither go for the all-in-one tool nor for 5 or 6 different tools.
We advocate for building your Data Experience with just two tools: A standalone Data Catalog and a standalone Data Observability tool. This mix covers most of the desired capabilities. Modern Data Catalogs cover Discovery capabilities, while modern Observability tools cover Health capabilities. Both tools cover enough collaboration features to ensure the third pillar: Community.
There are two main reasons why we would choose two standalone tools that integrate rather than one tool covering all capabilities:
I’ll address these two points in turn.
Do you know the old saying, "Jack of all trades, master of none"? That's exactly what happens when you try to cover all your capabilities with one tool.
Choosing an all-in-one tool to cover many capabilities might help you save on integration costs but at the risk of not having all these capabilities covered in the best way. With the tools that exist today on the market, adding more capabilities to a tool means you are usually compromising on quality.
You will have a hard time finding a tool that excels both in terms of search & context and in terms of data quality & anomaly detection. Getting the most of your data means you need to use multiple tools.
Again, we think building a strong Data Experience only requires a Data Catalog and a Data Observability tool.
A modern, third-generation data catalog covers most of the Health and Community capabilities: Search, Content, Data Lineage, Data Asset Popularity & Usage, Query, Metrics Alignment, Popular Queries, Sharing, and Collaboration. Some data catalogs also cover some of the Health capabilities such as Access Controls and Policy Management. If you can’t find a data catalog covering access control, policy management, and other pure data governance feature, we recommend adding a new tool to your mix that solely focuses on data governance and managing data privacy.
Modern Data Observability tools cover most of the capabilities associated with Health: Anomaly Detection, Data Quality, Real-time Data Monitoring, Root cause, and impact analysis.
Choosing a Data Catalog and Data Observability tool that jointly cover these capabilities should help you provide an excellent Data Experience to the people interacting with your company’s data.
Another argument in favor of decoupling different parts of your stack is that different tools are purposely built to cater to the workflow of different personas.
Primary users of capabilities associated with Discovery are different from the primary users of capabilities associated with Health.
Data analysts, product scientists, and other data consumers care about data discovery, as they need to find, understand, and use the data to do their work.
Data engineers on the other side, care more about data pipeline Health because they’re responsible for fixing and preventing issues in the data pipelines to ensure that data is reliable.
These different primary users engage in different workflows. Data engineers want to capture data quality information in their own internal systems, while data analysts want to see this information displayed in a data catalog.
Even in terms of UI, data engineers and data analysts have different expectations. A standalone Data Observability platform can cater to a data engineer’s workflows, while a standalone Data Catalog can cater to the workflows of analysts and business users without compromising either.
In summary, the Data Experience is how the data stakeholders in your company feel when they interact with the data.
The Data Experience is built on three pillars: Discovery, Community, and Health. To create these pillars in your organization, we recommend focusing on capabilities instead of tools.
Thinking in terms of capabilities is the most efficient way to build a strong Data Experience. It will keep you level-headed when evaluating tools and ensure that the tools you pick do solve the capabilities you need.
We find that the tools that tick most of the boxes associated with the three pillars are Data catalogs and Data Observability tools. We recommend going for two standalone tools that integrate well together.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.
Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.