Why I hate the "Data Catalog" term

The messier your data warehouse, the sooner you should implement a data catalog

Why I hate the "Data Catalog" term

For the last twelve months we've been building Castor, and yes, according to industry standards, it's a data catalog.

Here is the problem, too many people believe that the end goal of a data catalog is to document nicely your data assets. Basically, you plug a data catalog only after you've cleaned your warehouse and defined your data KPIs. Like the icing on the cake.

But no, we have a different vision. To bring visibility to the internet, we didn't organize it in clean folders. We plugged Google on top. If your data warehouse is messy, if it takes time to find the relevant data, if you have trouble trusting your data, don't spend weeks cleaning it, plug a search engine. Castor is a powerful search engine meant to help you find and trust data assets.

At Castor, our end goal is to make data users more efficient in answering questions.

Thanks to Castor, they find, understand, and use their assets faster. No matter how messy your warehouse is, even if you do not have any documentation yet.

The messier your data... The more value

The best metaphor I've found so far deals with hiking in a forest. Without Castor, any analyst newcomer gets dumped in the forest. If staffing is ok at that time, an experienced analyst comes along for a few hours to give him a quick tour, show where water can be found, where the grizzly sleeps... Then the buddy leaves and the new analyst is left alone, tasked to do the job on his own. Yes, from time to time a question can be asked around, but that depends on the team availability.

And with Castor? Your new analyst gets a fancy map of the forest, showing points of interest and paths frequently used. Oh, and this map is automatically updated. To add two metaphors on top of the first one, it's kind of like the Marauder's Map in Harry Potter, or the Age of Empires map when using cheat codes (I swear these are the last metaphors of this article).

So this map is amazing if you're in the darkest forest ever (like Fangorn's forest, oopsy, another super geeky metaphor) but it's also useful in our beautiful city of Paris in France.

How does that translate in the data world? Key features such as:

  • Never getting lost using an unknown/unpopular/depreciated table thanks to tables and dashboards popularity
  • Find all tables containing a specific column
  • Not reinventing the wheel thanks to a query history organized by tables
  • Grasping dependencies thanks to lineage

That the messier your data are, the more value Castor can bring. Castor is a data exploration/discovery tool (leveraging data cataloging features, yes). If it is a mess, use cheat codes now, clean later

Clean, Migrate, Build with Castor

Clean with Castor

We have some useful features to help Castor admins make their warehouse cleaner. The main one being that content is always prioritised by popularity. It puts the focus on popular content, so that documentation effort can be aligned with content popularity.

A reflex we've seen a lot is to document source tables, even if these are never used directly. We advice our clients to start with the top 10 most popular tables, listed in Castor. Of course, these tables, thanks to our very own SQL parser, are already enriched with Lineage information and Query history.

Also, at first, some of our clients only wanted to show their neatest schemas in Castor, well structured and approved, and hide the ugly ones, not considering how much these were used.

After a few weeks working together, another strategy emerged:  we added back all schemas in Castor. Our clients tagged their officially approved tables and dashboards. Finally, they added redirects from soon-to-be-depreciated popular ones to their new counterparts. Castor played here again its "map" role. This pattern is even stronger when clients are doing a data migration to a new warehouse.

To put things simply: make your users more efficient, on your brand new well documented dbt models but also on that old production database that you never want to hear about.

Migrate with Castor

I love hearing that sentence "oh, we're in the middle of a migration from Redshift to Snowflake, we'll plug Castor when we're done". Why do I love that sentence? Simply I know by heart the arguments to plug Castor as soon as possible.

Remember the paragraph above? About "The messier your data... The more value". Could a warehouse be messier than during data migration? We typically see clients use Castor to map new & old tables, their users can see all old and new content in the same place, with links between them. (note: a hidden reference to the best TV show ever is hidden in this paragraph)

Build with Castor

Are you laying down the foundation of a modern data stack in a company starting its data journey? Lucky you!! These are amazing times indeed. Plug Castor, now. Why? Because the sooner you enable exploration and documentation, the less work and hassle it will be. It's super hard to climb that mountain when it's 8000 high...

The question isn't if Castor, but when Castor? 

  • You have more than 500 tables and the only way to get some knowledge about these is by asking questions on slack? Plug Castor
  • You're building a data team and stack from scratch (see this article), plug Castor to avoid legacy, and enable your users from month 1
  • You're migrating from Redshift to Snowflake or from Snowflake to BigQuery (yes we've seen that too), plug Castor to help your users find their way between old datasets and new ones
  • If you don't want a fancy tool but excel makes sense for you, we've built a delightful data catalog template

Enough of this self-promoting talk, I think you got it now 😉

Subscribe to the Castor Blog

New Release

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data