Data assets grow exponentially
Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).
Inevitably, those companies observe an explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every week.
On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?
Data discovery tools are not an option anymore
Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, etc. Not one of these companies could do without a data discovery tool.
Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can't afford to build them. However, it doesn't mean that they don't need them. As data scientists, we have experienced this issue- and after having interviewed more than 200 data people in 100+ companies, we started building Castor. Castor is a collaborative automated data discovery tool. It is designed to be used by anyone in the company, and you can get it up and running in 6 minutes. The following sections describe what we built.
Castor, a plug-and-play collaborative data discovery tool
Castor is a data discovery solution inspired by tech giants’ products to solve their data problems. We worked on building a platform to document data that is:
As collaborative as a Google Doc (comments, history, edit/view, access rights)
As integrated as Slack (automation bots, handle, link/doc sharing)
As easy as a Google Search (powerful search on definitions and names)
As sexy as Airbnb (neat UX and simple features)
What does Castor do?
When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.
We chose to build a Google-like search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.
Context and Metadata
Once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, we built a Wikipedia-like page for each data asset in the company. You will find information on:
Programmatically curated information
Table and column names
Manually curated information
Source code of the table (soon programmatic)
The chat feature is our personal touch. All data engineers or experienced data scientists receive dozens of DM on Slack every day asking for the meaning of a column, the purpose of a table or the query to join table A with table B. It's annoying, time-consuming, and definitely not the best way to use their valuable time.
We designed a chat interface that would solve this problem by creating a dynamic FAQ as questions are asked. Every data asset has its own chat interface where people can talk and ask questions as they face problems. The questions and answers are kept and public so that questions are just asked once.
Lineage (to be continued)
Lineage is one of the most tricky features of our product. Indeed, we have noticed during our 200+ interviews that companies are using a lot of various ways to process their ETL (scripts to create tables into the data warehouse). They are also using a lot of various tools/frameworks making it really hard to develop a fully-automated lineage solution.
As a result, and because we are targeting smaller companies we came up with a manual lineage feature for now (see above). In the near future, we are developing integrations with DBT or other lineage providers like Datakin to extract the lineage programmatically.
We are currently working on Neo4j Bloom integration for the visualization of the lineage. Here are the first screenshots of the user interface (image above). This interface would enable data consumers to browse through relationships between tables and select only relevant resources to their analysis through the powerful interface of graph visualizations.
As we believe that collaboration is essential to a successful management strategy we needed to implement a version history. It currently tracks all the modifications happening on the platform.
Example of use-cases :
If a data user thinks the definition of a column is not right, he can look into the version history to ask the person who modified the documentation why he wrote this.
If the owner of the table realize that someone has been editing definitions in the wrong way, he can discuss with him and solve a possibly expensive bad use of data
It also records the different changes in the table schemas that happened in the past (if a column was added/deleted …)
Data Quality (manually curated)
For now, we haven’t added data quality features. We are looking into integration for this part with data quality specialists like Toro Data or the custom made data quality scoring made by the company.
We added a slider so people can select the level of confidence they can place in this data resource. As an example, as a data analyst, I need to create a temporary dataset for the purpose of ongoing research. I don’t want other people to use what I am doing until I am done. I will thus assign a data quality score on my table of 0%. On the contrary, I just finished a major data table, that will replace the former one that will no longer be updated. I will then assign 100% to the new table and lower the score of the old one.
Dashboard indexing (coming soon)
We are working hard to build connectors to your favorite BI tools. As most of the data people we interviewed were using Tableau and Looker, we prioritize those tools.
We plan to reference dashboards and views so that one can document their usage and appoint an owner. By linking dashboards to the tables used to build them, it will become even easier to understand the data and get the full context around it. Data engineers will be able to spot which dashboards might break after a column change or ETL modification.
Usage (coming soon)
Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp: _TZ or TIMESTAMP? What are the frequently joined tables?
Well, if the answer is yes, you'll like this feature. We plan to parse and reference all the queries made by data people within the company to:
Highlight the most popular tables
Highlight the most popular queries and table joins
Notify most frequent users after a change in the table schema
Map knowledge within the company to programmatically assign data asset experts
If you want to try Castor, out of pure interest, or for a real business case, we'd be more than happy to help. 6-minutes set up guaranteed. Please contact us at firstname.lastname@example.org