Every time you ask a data engineer or data scientist "Would you like to have all your data documented?". The answer is "YES!". A big and loud "YES!".
The next natural question would be "Do you want to document it yourself ?" and this time, answers are less enthusiastic "hmm, no" or "not really" or even "I don't even want to come close to this !"
DocMaster is a feature of Castor that documents automatically your columns and tables coming from common data sources (Salesforce, Google Analytics, Zendesk, etc). We created a documentation repository with high-quality definitions for the fields of popular data sources. Now, when we get access your metadata, we can identify the table and fields coming from those sources, with 99% accuracy and document it for you.
DocMaster gathers documentation from many sources (like Salesforce, Google Ads, Zendesk, Marketo, Jira, etc). It locates the data tables linked to these sources in your data warehouse (BigQuery, Redshift, Snowflake, etc) and automatically adds metadata to it.
For example, in the screenshot below, the table "account" from the "Salesforce" schema is synced through ETL tools like Fivetran or Airbyte. Documentation for this "salesforce.account" table is available online. We auto-populate it with DocMaster.
And we do the same with 61 sources. You'll find below some of the sources that we are currently able to document for you.
DocMaster's vision is to bring as much documentation as possible in an automated yet 100% accurate way. We will be adding new sources every month, and increasing our repository.
DocMaster is ready to document up to 27% of your data warehouse in a few minutes. At Castor, we have a 5:1 ratio of automatically documented data to manually documented data.
DocMaster is designed to be always right. The system is conservative by design, which means that false positives (adding a wrong description for a column or table) are very unlikely. There are multiples levels and mechanisms of validation before the approval of a description.
The main idea for the design was to allow descriptions to be added if and only if the 4 following steps were completed:
We ensure that after the pre-processing step, the name of the column corresponds exactly to the term we have defined in DocMaster. If it does, then we add the description to the column.
Here are a few rules we implemented:
- The column isn't private (ends with _sdc, or __c, etc...) which means created by the customer.
- The proposed description isn't already used in the same table
- One column can't appear twice in the same table. If there is already a perfect match, the second match is a false positive. We added pre-processing steps to remove some special characters ('_', '-', making everything lowercase) in order to standardize perfect matching.
Sometimes, based on ETL tool providers tables names can differ. We computed a similarity score based on the table name and all the columns in the table.
Here is an example :
At Castor, we dream of a world where data is accessible throughout an entire organization, and to everyone regardless of their data literacy level. To achieve this goal, we always focus on bringing solutions that do not require effort from our users. Tools like the open-sourced Amundsen are great. They can have a huge impact (as they did at Lyft) but to explore their full potential, it's necessary to invest a lot more than just deploying it. You need motivation, people, and processes to make a tool like Amundsen stick
The biggest challenge with Amundsen is the cultural change needed inside your corporation. Most of the features need engineering effort to set up and maintain. Then, you need active leadership to push everyone in the company to write documentation. This means that, if everybody doesn't get on board with the idea, the quality of documentation is going to degenerate very quickly. You might have noticed already that sometimes bad documentation is worse than no documentation at all.
At Castor, our focus is to maximize the "time spent/impact" ratio. 10 minutes on Castor is worth two hours on an excel spreadsheet. We built Castor with 4 core focus:
We do a lot of the work for you, finding popular data, declaring lineages, documenting your columns and tables, etc... in short, we let you focus where you really need to.
A lot of ours features are focused on this aspect, such as looking for the popularity of the tables and columns, the most used SQL's, etc..
You don't need to repeat the work done by your peers.
When documenting your data, we let you propagate the information throughout your entire data warehouse, effectively turning 1 hour of work, into 10.
Want to try what I built? More information here.
I'm Victor, really happy to be part of the Beaver Gang 🦫, and as some of you, also a coffee addict ☕️ .
Originally from Brazil 🇧🇷 I'm now pursuing an engineering double degree in France 🇫🇷. This project was developed in a partnership with CentraleSupélec (Paris) through an internship program.
Here's a gif that I designed and wanted to share: