What is "Data Discovery" ?

What it is, and how tools can help

What is "Data Discovery" ?

Data assets grow exponentially

Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).


Inevitably, those companies observe an explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every week.


On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?


Data discovery tools are not an option anymore

Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, etc. Not one of these companies could do without a data discovery tool.


Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can't afford to build them. However, it doesn't mean that they don't need them. As data scientists, we have experienced this issue- and after having interviewed more than 200 data people in 100+ companies, we started building Castor. Castor is a collaborative automated data discovery tool. It is designed to be used by anyone in the company, and you can get it up and running in 6 minutes. The following sections describe what we built.


What does a Data Discovery tool do?

Search

When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.


Best tools chose to build a Google-like search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.


Context and Metadata

Once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, data discovery tools have a built-inn Wikipedia-like page for each data asset in the company. You will find information on:


Programmatically curated information

  • Table and column names
  • Column type
  • Last updates
  • Owners
  • Frequent users
  • Source code of the table

Manually curated information

  • Description
  • Tags


Dashboard indexing (newer tools)

Recent cloud data discovery tools are working hard to build connectors to your favorite BI tools.


They reference dashboards and views so that one can document their usage and appoint an owner. By linking dashboards to the tables used to build them, it will become even easier to understand the data and get the full context around it. Data engineers will be able to spot which dashboards might break after a column change or ETL modification.


Usage (newer tools)

Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp:  _TZ or TIMESTAMP? What are the frequently joined tables?


Well, if the answer is yes, you'll like this feature. We plan to parse and reference all the queries made by data people within the company to:

  • Highlight the most popular tables
  • Highlight the most popular queries and table joins
  • Notify most frequent users after a change in the table schema
  • Map knowledge within the company to programmatically assign data asset experts


Are you looking for a data discovery tool?

At Castor, we are building the new generation of data catalog / discovery / governance tool. Our product is plug-and-play, scales with your team and everything is done to improve collaboration among users.


If you want to try Castor, out of pure interest, or for a real business case, we'd be more than happy to help. 6-minutes set up guaranteed. Please contact us at xavier.de-boisredon@castordoc.com

Subscribe to the Castor Blog

New Release
Share

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data