Data Catalog Benchmark for Mid-Market Companies

Top 10 data catalogs for the modern data stack

6 min read


In the past decades, organizations have come to realize the importance of leveraging data efficiently. We are witnessing a "data race", in which businesses seek to hire the best data talents. The result? businesses are now equipped with data engineers, data scientists, and data analysts, mastering cutting-edge tools to produce meaningful data analysis.

These talented data people are expected to conduct high-quality and valuable data analysis, but the story often unfolds differently. They encounter a great deal of frustration when they realize they spent most of their time dealing with boring questions:

  • Where is the best data to answer my question?
  • What does the column name "XXXX" mean?
  • Can I trust it?
  • When was it last updated? What is the process to create it?
  • Who can I contact if I see something wrong?
  • Has someone already worked on this question?

That is, data people are spending more time on metadata management than on meaningful value-generating data analytics work. Thankfully, the enterprise data catalog is a tool that can help with all these questions, allowing data people to focus on the core of their work. This is why data catalogs tools have flourished in the past 10 years, and there are now so many tools to choose from that businesses have a hard time making up their minds. Today, we take on the difficult task of untangling the vibrant data catalog ecosystem.

What is a data catalog?

Gartner, a specialized research business, defines the notion of data catalog as follows:

“A data catalog creates and maintains an inventory of data assets through the discovery, description, and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other data consumers to find and understand relevant datasets for the purpose of extracting business value.”.

Gartner, Augmented Data Catalogs 2019.

Which data catalog should I choose?

For a full data catalog landscape:

There are three generations of data catalog tools:

  • 1st generation: basic software, similar to an Excel, that syncs with your data warehouse.
  • 2nd generation: software designed to help the data steward in maintaining data documentation (metadata), lineage, and treatments.
  • 3rd generation: software designed to deliver business value to end-users automatically hours after the deployment. It then guides users to document in a collaborative painless way.

After outlining the characteristics of each category, we propose a benchmark of the current players in the market.

Data Catalog 0.0: no dedicated tool

Companies that deal with very small amounts of data often don't use specific data cataloging toosl. If this is your case, you can use whatever tool to describe columns and tables you have in your data infrastructure. Excel and Word can be used to write definitions of your data assets and columns. The good news is that it takes 1 minute to get started. The bad news is that it takes 1 minute to be depreciated. It is hard to maintain and not scalable.

Data Catalog 1.0: synced metadata inventory

The first data catalogs came to existence in the 1990's and early 2000's. They are basic softwares, similar to an Excel, that syncs with your data warehouse. The concept is dead simple: with this tool, the times where you had to write on an excel document the name of the different tables and columns manually was over. Those tools were syncing automatically the content data warehouse, allowing you to escape the painful and time consuming task of updating what is created or deleted in your data infrastructure.

Data cataloging tools belonging to this category propose basic documentation features: plain text documentation, manual tagging, ownership, metadata curation, and maintenance of governance practice. The search for data assets, if any, is not really powerful. Data catalog 1.0s demand high setup and maintenance efforts, not to mention high costs.

Data catalog 2.0: data steward centered catalogs

As data assets grew exponentially and more people used the data catalog, companies realised all this data had to be managed in terms of meaning, quality and admin rights. This was the birth of the data steward role.

Data Catalogs 2.0 were designed for this new role. They help data stewards maintain data documentation, treatments, lineage, personal information mapping, ownership, etc.

In this context, the second-generation data catalog displays more advanced features:

Search and discovery

Data catalogs 2.0 allow business and data analysts to find and understand the data assets they need. They allow allows you to contextualize information, and to build a Wikipedia-like page for each data asset in the company.

Strong process embedded in the catalog tool

A good documentation strategy revolves around three things: tools, people, and processes. People need to know what the process is to document and make sure it is actionable. For instance, before releasing a table in the production database, it needs to: have an identified owner,  all columns well-documented, several data quality tests.

Advanced rights management features

This allows you to restrict access to data assets. This works by granting data people specific roles in practice, a user will only manage to access a data asset if he has the permission to do so.

Project management features

Data governance teams need to have an overview of the data documentation advancement. You want them to organize the workload efficiently.

New additional data features emerge : data lineage, data quality, SQL editor

There are two levels of data documentation: level 1 of documentation is concerned with writing column and table definitions. Level 2 engages with adding business context around data : what are the tables used to create data asset? what is the code behind? how often is it refreshed? etc.Data catalogs 2.0 made level 2 accessible, but mostly in a manual way. You still have to declare upstream and downstream dependencies for lineage, choose refresh frequency from a dropdown menu, etc.

These catalogs are still process-based: without the processes, the data catalog doesn't bring any business value. That is, they rely on a data steward, in charge of guiding the documentation and labeling of databases. This is changing with the 3rd generation.

Data catalog 3.0: decentralized and intelligent catalogs

The third-generation enterprise data catalog characterises the important shift that has occurred around metadata management. Modern data catalogs are softwares designed to deliver business value to end-users automatically hours after the deployment. They then guides users to document in a collaborative painless way.

People are not coming for the manual documentation anymore. They are coming for the value delivered from day 1. The data catalog 3.0 is gathering up to 80% of the business context automatically (lineage, popularity, versioning, quality, etc) then adds a collaborative layer to encourage user documenting. The rest is a bonus.

And guess what? Because lots of people are using the tool on a daily basis, the documentation grows organically with people's comments, discussions, interactions, and feedback. You don't need to have an expensive (time and money wise) data documentation program anymore. Plug the tool and people will create value while they get value.

The catalog 3.0 is based on three principles:

1- Value is delivered from day 1, thanks to automated context gathering

When you start an analysis what is the information you need that you can get automatically? You want to know the business context of the data asset creation: where it comes from, what's the code and process that created it, who are the creator and frequent users, when it was last refreshed, what the popular joins are, whether it is tested or not, the quality level, the presence of duplicates, who has access to this asset, etc.

Well, you can get all those information, just after having plugged the data catalog 3.0 to your data warehouse and get value straight away.

2- The solution features an integrated intelligence that can replace or superpower the data steward.

We observe a clear departure from the process-based data catalog, and value is created as soon as you plug in the tool. This new generation of data catalogs features an integrated intelligence that replaces the data steward. The self-sustaining tool has a strength of proposal: it guides, prioritizes, and optimizes metadata management.For example, it detects when a database has been created, and sends a notification to its owner, reminding him to describe and label it. Thanks to popularity and user query features, it also identifies which data assets are the most used. This incentivizes users to document the most pertinent databases first, ensuring they don't spend time labeling databases that are rarely used by employees or are deprecated.

3- Collaboration becomes the core of metadata management.

On the model of Github or Notion, a collaborative data catalog allows users to benefit from each other's insights regarding a data asset. For example, employees can flag or upvote definitions to notify the owner that they need rework. People can define KPI, debate definition in the chat section attached, build a data knowledge center that is linked to the data catalog. Through new features such as query history, employees have access to the manipulations and queries that have been performed on a data asset. Instead of starting working on a dataset from scratch, they can continue building on previous work done on a specific dataset and create business value right away. More than just improving productivity in the organization, this allows for collaborative, deeper analysis.

The modern enterprise data catalog marks the entry into a new era, in which your data management is automated and collaborative, ensuring tremendous productivity gains.

Data catalog landscape*

Below, you will find a data catalog landscape, which can hopefully help you choose a metadata management tool adapted to your needs.

For a full data catalog landscape:

A Cloud data catalog connects to the cloud data warehouse, the cloud business intelligence sources. It indexes all the metadata from various sources into a search engine. This enables users to view, write and read documentation from the data source to learn what exists in the cloud data warehouse and BI tools. The technical capabilities of a data catalog are:

  • understand how to use technical assets for non-technical people thanks to the query history
  • view the technical dependence of a data asset through the lineage reports and service
  • access the knowledge base where KPI (key performance index) and analytics metrics are defined
  • provide support to the data users across the company on cloud data infrastructure
  • report to the head of data and data managers on data-driven decision making and insights
  • report and read which data products are used, for which use-cases.
  • improve cloud data discovery in enterprise organization to learn which technical analysis and report users can find

Are you looking for a modern data catalog?

At Castor, we are building a new generation of data catalog/governance software. Our product is plug-and-play, scales with your team, and everything is done to improve collaboration among users.

If you want to try Castor, out of pure interest, or for a real business case, we’d be more than happy to help. 30-minutes set up guaranteed.

Louise de Leyritz

Growth Analyst Intern

Linkedin Profil

More From Castor Blog

Get more value from the data you already have

Start your free 14-day trial now or schedule a product tour.
We have a flexible pricing that works for companies of all sizes.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
logo castor color
Your data has never been so clear and friendly
Linkedin Profil
© 2021 Castor. All registered.