Data assets are STILL growing exponentially
Scale-ups are companies growing much faster than average companies thanks to their scalable business model and high amounts of money invested ($20m minimum).
Inevitably, those companies observe explosive growth in the amount of data they collect and create. Consequently, the number of internal data resources (tables, dashboards, reports, KPIs…) explode. New people and tools are required to manage and fully benefit from the potential of their data. It is not rare for scale-ups to recruit and onboard new data people every months.
On one hand, this exponential growth in data assets is a good thing as it reflects the sound investment in a data-driven ecosystem. On the other hand, it brings up an ocean of new problems: how to find and trust the relevant data to conduct meaningful analysis? How to gather the knowledge of data people and easily share it with new employees? How to make sure everyone is on the same page when analyzing a specific KPI?
Data discovery tools are hot
Tech giants, with hundreds of thousands of data assets, built internally their own tool to solve this problem. Airbnb built DataPortal, Uber DataBook, LinkedIn DataHub, Spotify Lexicon, WeWork Marquez, Lyft Amundsen etc. Not one of these companies could do without a data discovery tool.
Yet, those tools are expensive and time-consuming to build and maintain. Smaller scale-ups can’t afford to build them. However, it doesn’t mean that they don’t need them.
This resulted in a race to build the best tools for mid-market companies. Several startups raised money to build the best tool. Here's a comprehensive benchmark of the solutions out there.
In the meantime, let me take you through the progress made in the last year for Castor.
Castor is an automated and collaborative data discovery tool
Castor is a data catalog solution inspired by tech giants’ products to solve their data problems. We worked on building a platform to document data that is:
- As collaborative as a Google Doc (comments, history, edit/view, access rights)
- As integrated as Slack (automation bots, handle, link/doc sharing)
- As easy as a Google Search (powerful search on definitions and names)
- As sexy as Airbnb (neat UX and simple features)
What does Castor do?
I chose to show you some features along with a screenshot from a year ago and one of now. We are proud of the product we have and are looking forward to what's next.
When it comes to finding something fast, there are two methods: either you spend time to organize things well (the way libraries work), either you decide to index every resource you have (the way Google works). The library model works great with a limited amount of assets. It quickly becomes impossible to maintain when lots of people interact with large numbers of assets. The Google model is working perfectly but is expensive to develop.
We chose to build a dead simple, yet powerful, search to help anyone within a company to find and understand data assets, even without any knowledge of how databases or SQL work.
Context and Metadata
Imagine you are looking at a data table (it is a data asset, similar to an excel) referencing all the "customer orders" with a column "payment":
- is the column payment in $ or €?
- has it been updated today? last week? last year?
- does it contain orders from Europe only or Worldwide?
- who created it? how and why?
- can I trust the data inside to build the board meeting's slides?
Because you need to answer those questions, once you found what you are looking for, you need to 1) understand 2) trust the data in front of you. For that reason, we built a Wikipedia-like page for each data asset in the company.
In Castor, you will find information on:
Programmatically (automated) curated information
- Table and column names
- Column type
- Last update (Freshness)
- Frequent users
- Source Query
- Descriptions already existing in other tools (data warehouse, dbt, data visualization tools)
- Personal Information
- Most frequent queries
- Tables frequently joined together
Manually (manual) curated information
- Tags and owners
- Comments and discussions
The chat feature is our personal touch. All data engineers or experienced data scientists receive dozens of DM on Slack every day asking for the meaning of a column, the purpose of a table, or the query to join table A with table B.
It’s annoying, time-consuming, and definitely not the best way to use their valuable time.
We designed a chat interface that would solve this problem by creating a dynamic FAQ as questions are asked. Every data asset has its own chat interface where people can talk and ask questions as they face problems.
The questions and answers are kept and public so that questions are just asked once and everyone gets value from it.
Lineage is a powerful feature. It tells you where your data come from and what are the data resources within the company that depend on the data you have in front of you.
It is kind of a family tree but for data. You can see the parent tables (aka the tables used to create the data asset) and the children tables (aka the tables born/created from this data asset).
Data lineage is particularly powerful for:
- Data engineers: it helps them debug their pipeline, identify problems faster and propagate errors through the whole data pipeline to perform impact analysis
- Data analysts: they can understand where the data come from quickly, see the SQL query used to create the table, and trust the data before performing an analysis.
- Data stewards: they have to track the data transformation across the data infrastructure for conformity purposes (GDPR, HIPAA).
Castor provides lineage from data tables inside the data warehouse to BI dashboards in Looker, Tableau, Metabase, etc.
We compute the lineage in minutes thanks to a powerful parsing algorithm. Additionally, we also extract the lineage from other tools like DBT.
As we believe that collaboration is essential to a successful management strategy, by design everyone can edit descriptions (right management will be available for enterprise companies). As a result, we needed to implement a version history. It currently tracks all the modifications happening on the platform.
Example of use-cases :
- If a data user thinks the definition of a column is not right, he can look into the version history to ask the person who modified the documentation why he wrote this.
- If the owner of the table realize that someone has been editing definitions in the wrong way, he can discuss with him and solve a possibly expensive bad use of data
It also records the different changes in the table schemas that happened in the past (if a column was added/deleted …)
Get to know everything that happened when you were out.
We worked hard to build connectors to BI tools. As most of the data people we interviewed were using Tableau, Looker, and Metabase we prioritized those tools. But, we now have most BI tools integrations available. Check them out here.
BI tools' search experience is not optimal. For example, if you search for a specific field in Looker, you have to open each explores and search for this field. End users end up lost in all the Explores, Looks, and Dashboards when they are looking for specific content.
To improve the search experience, we designed a process where: we show the content used to create a dashboard, as well as the content depending on this dashboard and we prioritize the search based on popularity. This helps increase trust and visibility in BI tools.
Usage and Popularity
Have you ever looked at a table and wondered: how are people usually querying this table? Which column should I use as a timestamp: _TZ or TIMESTAMP? What are the frequently joined tables?
Well, if the answer is yes, you’ll like this feature. We are parsing and referencing all the queries made by data people within the company to:
- Highlight the most popular tables
- Highlight the most popular queries and most frequently joined tables
- Notify most frequent users after a change in the table schema
This popularity index adds a lot of value, across various features. With it, we can show the most relevant data assets in the search, order lineage results to show the most popular first, encourage data people across your company to document the data assets that are the most queried by a large number of people.
One thing that we hear all the time in sales calls is this:
"At my last board meeting, I felt stupid. As CFO, I came up with a number for Active Users but the CMO had another, the CPO another one. This can't happen anymore."
As a result, we added a KPI/Definition module in Castor.
It helps to align everyone on a single source of truth that is linked to the data, that can be validated by a set of owners. Basically, a Notion-like interface that is connected to your data warehouse, BI, and data quality tools. In this module, everyone can define a concept, assign owners and tags, add code snippets to explain how to technically extract this KPI, assign related tables and dashboards, and add an "approved by owners" badge.
We are working on a feature to make sure that every department has their say in the approval process.
Castor is designed to grow organically in companies with a high level of product automation and collaboration. Yet, some companies might want to push the data documentation efforts further. This is why we developed the admin panel. It helps the data governance lead to prioritize and leverage efforts.
As an official data governance lead or just someone that cares about data documentation, you want to track progress and prioritize documentation work. The key to a good documentation strategy is putting effort where it matters the most. We built this feature to help data governance lead to identifying the most relevant assets to document first.
It basically answers: what are the 20% tables that account for 80% of the data consumption? With a minimum effort, your team can completely change the trust in data for the whole company.
Once the most popular tables are documented, Castor has a powerful feature that recommends administrators column-definition propagation suggestions. For example, you have a column called "daily_time_spent_per_active_user" in one table called "customer_success_metrics". These columns also exist in 20 other tables and reports. You could go in each one of them and copy/paste the definition or ... propagate in a few clicks with Castor.
Castor enables admins to choose the tables for which they want to propagate the definition.
We noticed our clients had source tables that were imported from several tools (Salesforce, Google Analytics, Zendesk, etc). Definitions for tables are always the same. We decided to build a powerful repository of all those tools definitions and admins can choose to propagate automatically definitions for these tables. For some clients, it represents thousands of columns documented in seconds.
Data Quality Integrations (coming soon)
Data quality tools are mostly used by data engineers (increasingly by analytics engineers) to ensure the quality of their pipelines. There's tremendous value to having data quality information displayed next to your documentation.
Data quality results need to appear next to the documentation for two main reasons:
- As a consumer of a data asset, you will want to trust the data you are looking at as fast as possible. Knowing what tests are run and if they succeeded is essential.
- Domain experts won't have access, nor go, to the data quality tool. As a result, they won't know the issues before it's too late. Data quality results need to be in the tools they use on a daily basis.
We are looking into integration for this part with data quality specialists like Bigeye, MonteCarlo, GreatExpectation or provide a simple API to address the custom-made data quality testing made by our customers.
We have a lot of ambition. An ambition to build the most automated and collaborative metadata platform on the market. An ambition to make data documentation sexy. This is just the beginning.
Coming next is more automation, more integrations, more collaboration, a new module (surprise), and a fair share of gamification. Stay tuned.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.
Subscribe to the Castor Blog
You might also like
Obtain tech giants' data discovery tools in a click using CastorDoc, enhancing data exploration and management for your business.
Discover three compelling reasons to invest in a data discovery tool like CastorDoc, optimizing your data management and analysis processes.
Fantastic tool for data discovery and documentation
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.”
Michal, Head of Data, Printify