This article has been co-written with Hugo Lu, co-founder of Orchestra.
Data Catalogs have been around for years, and have until now been tools of large enterprises, used for airy topics such as “Discovery”, “Governance” or “Master Data Management”. Although relatively unknown to technical teams of data engineers, Catalogs play an important role in bridging gaps between Technical and Business users - after all, without business users, data is simply data, existing, being processed, and representing a huge cost-center in the budget of CFO.
According to Oracle, a Data Catalog can be defined as follows:
"A Data Catalog is an organized collection or inventory of your data assets and processes. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance." - Oracle
In plain terms, a Data Catalog is an application that gathers metadata from other Data Applications and arranges it in such a way that business and technical users can gain value from it. There are obvious use cases for Data Catalogs, that are well illustrated by examples:
- By showing which assets are used by what systems, technical users can easily pinpoint data assets that frequently cause problems to end users, however, they can also use Catalogs to proactively identify and secure bottlenecks in data pipelines
- Non-technical users can easily use a catalog for discovery; working out what assets, tables, or dashboards can be used to solve a given problem
- Budget owners can use catalogs to identify data assets that are potentially costly or under-utilized
However, from an architectural perspective, Data Catalogs stand alone. They tend to be applications with well-designed UIs that poll services for Metadata. This creates a problem since that metadata lacks context. In the rest of this article, we dive into this problem statement and understand how Orchestration tools can enrich catalogs to help data teams truly understand business value.
The problem: all your metadata, all at once
Let’s consider a simple data product or a data pipeline, that’s designed to populate a table of orders over time. This pipeline uses data applications to ingest data from various sources including databases and third-party software applications. The data is transformed in a data warehouse, there are a few tables that are created, and the final table is simply an `orders_aggregated` table. There are a few dashboards that utilize this table.
Using a Data Catalog, the Data Catalog will present a consolidated view of the lineage for this pipeline. As a business user focused on driving business value, you’ll be able to see these ingestion jobs, tables, and dashboards with all the relevant metadata. CastorDoc has some visualizations of these as you can see below:
However, that requires you to know what data product their part of - the `orders` data product. On their own, Data Catalogs cannot help you understand:
- Which assets relate to which products
- When assets are refreshed or operations are run, whether this corresponds to an expected refresh of the data product
- Metrics at a product level (such as time to refresh, cost, success rates and so on) over time
How can business users overcome these challenges to unlock value?
The solution: end-to-end orchestration and observability capabilities
Using an end-to-end orchestration and observability solution can gather metadata and populate fields in Data Catalogs using an extremely fine level of granularity.
Orchestration tools introduce the concept of runs. These are also known as refreshes or materializations, depending on the tool you are using. A run is simply the action of running all the jobs in an order required to update data in an asset, such as a table or dashboard.
The run is either successful or unsuccessful. It’s a helpful concept because it allows data engineers and analytics to monitor their jobs over time, and is critical for debugging. Every run executes a directed acyclic graph (“DAG”), and as such, each run creates a lineage chart.
Having this information elevates the context a Data Catalog has access to:
In the example above, before populating the Catalog with an Orchestration tool, the Catalog has the context of the Snowflake Queries on the left hand side. These are essentially a long list of query materializations without context.
When the catalog is populated using an Orchestration tool, we see metrics can be correlated across pipeline runs, task runs, and importantly, Data Products. Having the `product_id` is incredibly powerful. Now, any metric can be aggregated over time at a data product level. For the “Insurance data” product, a user could see which tables relate to it, which dashboards relate to it, the number if times it gets refreshed per day (and how successful those schedules are), the number of queries it powers…the possibilities are almost endless.
Conclusion: build data products from orchestration upwards
Having a set of data pipelines and tasks that power specific data products can be an incredibly helpful mindset to adopt, as it encourages data teams to separate different data models and ingestion paths as powering different value use cases. In a time where cost is key and business value is at the center of discussions, being able to accurately identify which data operations are actually powering business-critical analytical workloads and applications is important for doing this.
Using a Data Catalog like CastorDoc represents a step in the right direction for many teams, as it provides a convenient way to create maps of data assets and bridge the gap between technical and non-technical teams, giving business users an easy way to discover, monitor and understand data products.
Enriching a Data Catalog using an Orchestration and Observability platform like Orchestra can take the levels of discoverability to the next level. Data Products are emerging as the de-facto way to think about building data pipelines. By populating data catalogs with invocation-specific information, business users can correlate and aggregate important metrics such as usage, failure rates and dependencies at a data product-level over time.
Learn more about Orchestra
As far as tools that focus on interoperability go, Orchestra is one of those. Orchestra is stitching the MDS and populating Catalogs in a way that allows Data Teams to access enterprise-grade orchestration and observability capabilities from day one. If you’d like to try the platform or simply discuss, reach out.
We write about all the processes involved when leveraging data assets: the modern data stack, data teams composition, and data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data. At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful, and friendly.
Want to check it out? Reach out to us and we will show you a demo.
You might also like
Investigate the potential impact of active metadata on data orchestrators and how CastorDoc enables organizations to adapt to evolving data landscapes.
Discover the importance of metadata and how it helps you manage and understand your data.
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data