Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).
Inevitably, those companies observe an explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every week.
On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?
Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, etc. Not one of these companies could do without a data discovery tool.
Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can't afford to build them. However, it doesn't mean that they don't need them. As data scientists, we have experienced this issue- and after having interviewed more than 200 data people in 100+ companies, we started building Castor. Castor is a collaborative automated data discovery tool. It is designed to be used by anyone in the company, and you can get it up and running in 6 minutes. The following sections describe what we built.
Castor is a data discovery solution inspired by tech giants’ products to solve their data problems. We worked on building a platform to document data that is:
When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.
We chose to build a Google-like search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.
Once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, we built a Wikipedia-like page for each data asset in the company. You will find information on:
Programmatically curated information
Manually curated information
The chat feature is our personal touch. All data engineers or experienced data scientists receive dozens of DM on Slack every day asking for the meaning of a column, the purpose of a table or the query to join table A with table B. It's annoying, time-consuming, and definitely not the best way to use their valuable time.
We designed a chat interface that would solve this problem by creating a dynamic FAQ as questions are asked. Every data asset has its own chat interface where people can talk and ask questions as they face problems. The questions and answers are kept and public so that questions are just asked once.
Lineage is one of the most tricky features of our product. Indeed, we have noticed during our 200+ interviews that companies are using a lot of various ways to process their ETL (scripts to create tables into the data warehouse). They are also using a lot of various tools/frameworks making it really hard to develop a fully-automated lineage solution.
As a result, and because we are targeting smaller companies we came up with a manual lineage feature for now (see above). In the near future, we are developing integrations with DBT or other lineage providers like Datakin to extract the lineage programmatically.
We are currently working on Neo4j Bloom integration for the visualization of the lineage. Here are the first screenshots of the user interface (image above). This interface would enable data consumers to browse through relationships between tables and select only relevant resources to their analysis through the powerful interface of graph visualizations.
As we believe that collaboration is essential to a successful management strategy we needed to implement a version history. It currently tracks all the modifications happening on the platform.
Example of use-cases :
It also records the different changes in the table schemas that happened in the past (if a column was added/deleted …)
For now, we haven’t added data quality features. We are looking into integration for this part with data quality specialists like Bigeye or the custom made data quality scoring made by the company.
We added a slider so people can select the level of confidence they can place in this data resource. As an example, as a data analyst, I need to create a temporary dataset for the purpose of ongoing research. I don’t want other people to use what I am doing until I am done. I will thus assign a data quality score on my table of 0%. On the contrary, I just finished a major data table, that will replace the former one that will no longer be updated. I will then assign 100% to the new table and lower the score of the old one.
We are working hard to build connectors to your favorite BI tools. As most of the data people we interviewed were using Tableau and Looker, we prioritize those tools.
We plan to reference dashboards and views so that one can document their usage and appoint an owner. By linking dashboards to the tables used to build them, it will become even easier to understand the data and get the full context around it. Data engineers will be able to spot which dashboards might break after a column change or ETL modification.
Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp: _TZ or TIMESTAMP? What are the frequently joined tables?
Well, if the answer is yes, you'll like this feature. We plan to parse and reference all the queries made by data people within the company to:
If you want to try Castor, out of pure interest, or for a real business case, we'd be more than happy to help. 6-minutes set up guaranteed. Please contact us at email@example.com