What is "Data Discovery" ?

What it is, and how tools can help

3 min read

May 9, 2023

By Xavier de Boisredon

Data assets grow exponentially

Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).

Inevitably, those companies observe an explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every week.

On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?

Data discovery tools are not an option anymore

Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, etc. Not one of these companies could do without a data discovery tool.

Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can't afford to build them. However, it doesn't mean that they don't need them. As data scientists, we have experienced this issue- and after having interviewed more than 200 data people in 100+ companies, we started building Castor. Castor is a collaborative automated data discovery tool. It is designed to be used by anyone in the company, and you can get it up and running in 6 minutes. The following sections describe what we built.

What does a Data Discovery tool do?

Search

When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.

Best tools chose to build a Google-like search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.

Context and Metadata

Once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, data discovery tools have a built-inn Wikipedia-like page for each data asset in the company. You will find information on:

Programmatically curated information

Table and column names
Column type
Last updates
Owners
Frequent users
Source code of the table

Manually curated information

Description
Tags

Dashboard indexing (newer tools)

Recent cloud data discovery tools are working hard to build connectors to your favorite BI tools.

They reference dashboards and views so that one can document their usage and appoint an owner. By linking dashboards to the tables used to build them, it will become even easier to understand the data and get the full context around it. Data engineers will be able to spot which dashboards might break after a column change or ETL modification.

Usage (newer tools)

Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp: _TZ or TIMESTAMP? What are the frequently joined tables?

Well, if the answer is yes, you'll like this feature. We plan to parse and reference all the queries made by data people within the company to:

Highlight the most popular tables
Highlight the most popular queries and table joins
Notify most frequent users after a change in the table schema
Map knowledge within the company to programmatically assign data asset experts

Are you looking for a data discovery tool?

At Castor, we are building the new generation of data catalog / discovery / governance tool. Our product is plug-and-play, scales with your team and everything is done to improve collaboration among users.

If you want to try Castor, out of pure interest, or for a real business case, we'd be more than happy to help. 6-minutes set up guaranteed. Please contact us at xavier.de-boisredon@castordoc.com

‍

Subscribe to the Castor Blog

‍

New Release

Table of Contents

Why Look for Atlan Alternative?

Resources

Louise de Leyritz

September 19, 2022

Data Discovery: Unlocked

Unlock the full potential of data discovery with CastorDoc, enhancing data exploration and empowering data-driven organizations.

Learn more

Xavier de Boisredon

April 20, 2020

CastorDoc: get tech giants data discovery tools in a click

Obtain tech giants' data discovery tools in a click using CastorDoc, enhancing data exploration and management for your business.

Learn more

Xavier de Boisredon

August 1, 2020

Modern Data Catalogs are AI Augmented

Castor looks at the modern wave of AI-augmented data catalogs and how they can help organizations make the most of their data. Get started today!

Learn more

Get in Touch to Learn More

See Why Users Love Coalesce Catalog

Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data

Data assets grow exponentially

Data discovery tools are not an option anymore

What does a Data Discovery tool do?

Search

Context and Metadata

Dashboard indexing (newer tools)

Usage (newer tools)

Are you looking for a data discovery tool?

Subscribe to the Castor Blog

You might also like

Get in Touch to Learn More