Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).
Inevitably, those companies observe explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every months.
On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?
Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, Lyft Amundsen etc. Not one of these companies could do without a data discovery tool.
Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can’t afford to build them. However, it doesn’t mean that they don’t need them.
This resulted in a race to build the best tools for mid-market companies. Several startups raised money to build the best tool. Here's a comprehensive benchmark of the solutions out there.
In the meantime, let me take you through the progress made in the last year for Castor.
Castor is a data catalog solution inspired by tech giants’ products to solve their data problems. We worked on building a platform to document data that is:
I chose to show you some features along with a screenshot from a year ago and one of now. We are proud of the product we have and are looking forward to what's next.
When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.
We chose to build a dead simple, yet powerful, search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.
Imagine you are looking at a data table (it is a data asset, similar to an excel) referencing all the "customer orders" with a column "payment":
Because you need to answer those questions, once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, we built a Wikipedia-like page for each data asset in the company.
In Castor, you will find information on:
Programmatically (automated) curated information
Manually (manual) curated information
The chat feature is our personal touch. All data engineers or experienced data scientists receive dozens of DM on Slack every day asking for the meaning of a column, the purpose of a table, or the query to join table A with table B.
It’s annoying, time-consuming, and definitely not the best way to use their valuable time.
We designed a chat interface that would solve this problem by creating a dynamic FAQ as questions are asked. Every data asset has its own chat interface where people can talk and ask questions as they face problems.
The questions and answers are kept and public so that questions are just asked once and everyone gets value from it.
Lineage is a powerful feature. It tells you where your data come from and what are the data resources within the company that depend on the data you have in front of you.
It is kind of a family tree but for data. You can see the parent tables (aka the tables used to create the data asset) and the children tables (aka the tables born/created from this data asset).
Data lineage is particularly powerful for:
Castor provides lineage from data tables inside the data warehouse to BI dashboards in Looker, Tableau, Metabase, etc.
We compute the lineage in minutes thanks to a powerful parsing algorithm. Additionally, we also extract the lineage from other tools like DBT.
As we believe that collaboration is essential to a successful management strategy, by design everyone can edit descriptions (right management will be available for enterprise companies). As a result, we needed to implement a version history. It currently tracks all the modifications happening on the platform.
Example of use-cases :
It also records the different changes in the table schemas that happened in the past (if a column was added/deleted …)
Get to know everything that happened when you were out.
We worked hard to build connectors to BI tools. As most of the data people we interviewed were using Tableau, Looker, and Metabase we prioritized those tools. But, we now have most BI tools integrations available. Check them out here.
BI tools' search experience is not optimal. For example, if you search for a specific field in Looker, you have to open each explores and search for this field. End users end up lost in all the Explores, Looks, and Dashboards when they are looking for specific content.
To improve the search experience, we designed a process where: we show the content used to create a dashboard, as well as the content depending on this dashboard and we prioritize the search based on popularity. This helps increase trust and visibility in BI tools.
Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp: _TZ or TIMESTAMP? What are the frequently joined tables?
Well, if the answer is yes, you’ll like this feature. We are parsing and referencing all the queries made by data people within the company to:
This popularity index adds a lot of value, across various features. With it, we can show the most relevant data assets in the search, order lineage results to show the most popular first, encourage data people across your company to document the data assets that are the most queried by a large number of people.
One thing that we hear all the time in sales calls is this:
"At my last board meeting, I felt stupid. As CFO, I came up with a number for Active Users but the CMO had another, the CPO another one. This can't happen anymore."
As a result, we added a KPI/Definition module in Castor.
It helps to align everyone on a single source of truth that is linked to the data, that can be validated by a set of owners. Basically, a Notion-like interface that is connected to your data warehouse, BI, and data quality tools. In this module, everyone can define a concept, assign owners and tags, add code snippets to explain how to technically extract this KPI, assign related tables and dashboards, and add an "approved by owners" badge.
We are working on a feature to make sure that every department has their say in the approval process.
Castor is designed to grow organically in companies with a high level of product automation and collaboration. Yet, some companies might want to push the data documentation efforts further. This is why we developed the admin panel. It helps the data governance lead to prioritize and leverage efforts.
As an official data governance lead or just someone that cares about data documentation, you want to track progress and prioritize documentation work. The key to a good documentation strategy is putting effort where it matters the most. We built this feature to help data governance lead to identifying the most relevant assets to document first.
It basically answers: what are the 20% tables that account for 80% of the data consumption? With a minimum effort, your team can completely change the trust in data for the whole company.
Once the most popular tables are documented, Castor has a powerful feature that recommends administrators column-definition propagation suggestions. For example, you have a column called "daily_time_spent_per_active_user" in one table called "customer_success_metrics". These columns also exist in 20 other tables and reports. You could go in each one of them and copy/paste the definition or ... propagate in a few clicks with Castor.
Castor enables admins to choose the tables for which they want to propagate the definition.
We noticed our clients had source tables that were imported from several tools (Salesforce, Google Analytics, Zendesk, etc). Definitions for tables are always the same. We decided to build a powerful repository of all those tools definitions and admins can choose to propagate automatically definitions for these tables. For some clients, it represents thousands of columns documented in seconds.
Data quality tools are mostly used by data engineers (increasingly by analytics engineers) to ensure the quality of their pipelines. There's tremendous value to having data quality information displayed next to your documentation.
Data quality results need to appear next to the documentation for two main reasons:
We are looking into integration for this part with data quality specialists like Bigeye, MonteCarlo, GreatExpectation or provide a simple API to address the custom-made data quality testing made by our customers.
We have a lot of ambition. An ambition to build the most automated and collaborative metadata platform on the market. An ambition to make data documentation sexy. This is just the beginning.
Coming next is more automation, more integrations, more collaboration, a new module (surprise), and a fair share of gamification. Stay tuned.