Data catalogs were introduced to help data people find and understand data. Before data catalogs existed, data engineers, data analysts, and data scientists worked blind, deprived of visibility into data sets, their content, their quality or their usefulness. Consequently, they spent most their time trying to locate and understand data, often recreating data sets that already existed. This is the kind of issues that data catalogs seek to address.
Data catalogs began with the modest aim of managing data inventory and improving data discovery. Soon enough, they grew in functionality, popularity and importance. Modern data catalogs have considerably expanded their reach, and are now central to data stewardship and data governance. Data team leaders view data catalogs as strategically important and key drivers of analytic quality and data teams' productivity.
The thing is, the selection of data cataloging tools has grown exponentially in recent years and there is now a myriad of data cataloging tools to choose from. Which one is right for you? That's what we help you uncover today.
Gartner, a specialized research business, defines the notion of data catalog as follows:
“A data catalog creates and maintains an inventory of data assets through the discovery, description, and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other data consumers to find and understand relevant datasets for the purpose of extracting business value”
Gartner, Augmented Data Catalogs 2019
The first step to choosing a data catalog is to understand your exact need for a data catalog. As we mentioned already, data catalog vendors have multiplied in the past years, and they cater to different needs. Are you looking for a data governance tool? A pure data discovery tool? You need to define exactly what you're looking for before going on a data catalog quest. To this end, you should start by identifying your pain points, and then find which data catalog addresses them. The first exercise is thus to identify the top challenges that affect your productivity and to map them to data catalog features. To facilitate the task, we've done the mapping. Tell us what bothers you, we'll tell you which data catalog features you're interested in. In this exercise, it's important that you get your team to speak. If you're leading a data team, make sure you understand what bothers team members. They might have different pain points affecting their productivity. You want to make sure you pick a catalog which alleviates their frustrations and allows them to fulfill their mission.
Now that you have a clearer idea of the features you're interested in, rank them in order of preference.
You've now established with features you need in a data catalog, and you're ready to scan the market to find your ideal catalog. Wait a second, we're not done yet. There are other considerations you should take into account. Namely, think about what would make your team use the data catalog. In fact, the whole value of a data catalog resides in its usage. When people use the data catalog, documentation levels increase, quality of data assets improve, and more people use the data catalog. On the contrary, this can easily turn into a vicious circle where no one uses the catalog. In this case, not only do have poor quality data assets, but you've wasted your money in a data catalog. So when you contract with a data catalog vendor, you want to make sure your team actually likes the tool and plans to use it. We thus propose to look at the following four variables when evaluating a data catalog.
Once you have clearly defined what you're looking for in a data catalog, it's time to find your perfect match. This is no easy task, as there is a plethora of options to choose from. We've attempted to untangle the data catalog ecosystem to help you find the perfect fit. We found that data catalogs can be divided in three generations:
Here is a brief listing of the pros and cons of each option.
Data catalog landscape
Below, you will find a data catalog landscape, which can hopefully help you choose a metadata management tool adapted to your needs.
*This is a brief attempt at classifying the tools on the market. If anything seems wrong, or if you don't see your data catalog and want to have it placed, feel free to reach out.
If you want to know more about vendors, their offerings, and the data catalog ecosystem , you will find our data catalog benchmark here.
You have now selected a few catalogs that seem to math your pre-defined criteria and answer your business needs. It's time for the next step: take a demo.
If you sit as a passive viewer during the demo, you're unlikely to get much value out of it. You should be participating actively and leave with a clear idea of how the data catalog software will help address your specific needs.
We encourage you to plan for the key topics you want to cover and share the features that matter to you the most to the vendors in advance. This will ensure a much more tailored experience.
We thus propose setting the following agenda beforehand covering the following topics:
Price is obviously a concern when choosing a catalog software. However, price often involves more than the price declared by the vendor. Total cost of ownership involves how much the software costs to purchase, implement and maintain.
Purchasing: Ensure you have understood what's comprised in every pricing tier. Enquire about potential additional purchases charges, such as extra users.
Implementation: Enquire about implementation costs, as it can make a significant difference. For example, choosing an open source data cataloging solution will save you from purchasing cost, but will lead to important implementation costs.
Maintenance: Make sure you understand clearly what the vendor charges post purchases, such as updates. Even without updates, the software might be expensive to maintain. For example, legacy data catalogs (1st generation) often require a full time engineering team to maintain the tool. Ensure that you factor these additional costs within the total cost of ownership.
What relationship will you have with the vendor after completing the purchase? Will you be on your own? If so, does that work for you? This is not a negligible question. A lot of Tesla owners love their car but have encountered such frustration due to bad customer service experience that they bitterly regret their purchase choice. For this reason, ensure you have understood the following:
Companies can lose serious amount of money and customer trust following data security breaches. Be sure to understand exactly what data the vendor has access to, the kind of security the vendor uses for its databases and what processes he's got in place to keep your information safe.
We also advise you to attend the demo with stakeholders from different teams. This will allow you to gather the most comprehensive feedback, and thus choose the right tool that suits all kinds of users. Finally, ensure that the data catalog is compatible with your current data infrastructure as well as well as with your vision and roadmap for the next 1-5 years.
We have also pulled together a more detailed version of "what to check before/during a data catalog demo", would you be interested.
A cloud data catalog connects to the cloud data warehouse and the cloud business intelligence sources. It helps an organization index all the metadata from various sources into a search engine. This enables users to view, write and read documentation from the data source to learn what exists in the cloud data warehouse and BI tools. The technical capabilities of a data catalog are: