After more than 150 data people (managers, engineers, stewards, DPO, analysts, scientists, product managers, BI, etc), one thing is clear: data dictionaries suck. Yet, 95% of those people explained that they are a must for their day to day work. They need them to find and understand the relevant data available within the company.
What is a data dictionary?
It is a document describing the meaning of a dataset. Typically, this includes field names and types (e.g. string, float, integer) and maybe some annotations that describe the lineage of the data (where did it come from, and what does it feed) as well as the business definition.
However, as all workflows captured in spreadsheets, things can turn to hell very quickly.
Now, I tried to investigate why people hate data dictionaries so much. It basically boils down to a few points :
- they are difficult to maintain
- they are ugly
- they are not easy to share and plug to other software
- they cause confusion
1 ) They are difficult to maintain
Building a data dictionary always starts with a good intention: helping others to understand the datasets existing in the company. Generally, the owner of this data dictionary project is knowledgeable, meticulous, and organized. He/she builds a Google Sheet that contains all the information he/she has on the different datasets, make sure they are accurate.
Things start to go off the rails when more people are involved. BI guys think the definitions are too technical and not accurate to their work so they copy/paste the dictionary and build their own branch. Engineers are adding columns to the tables without notifying the data scientist, nor modifying the data dictionary. Some tables are deleted/merge/modified from the data infrastructure and no one changes anything. Basically, the owner of the data dictionary built a snapshot of the data infrastructure, based on his knowledge, but as things evolve fast and in so many directions, the document is generally useless after a month.
Now many versions of the data dictionary are moving from hands to hands within the company without any control. This is a great risk for a company: analysts are prone to make costly mistakes.
2) They are ugly
The standard in data dictionaries is the good old Excel spreadsheet, closely followed by a word document that has been saved as a PDF. I still wonder every day why companies spending tens of millions in cloud data storage and processing services are spending so little to build an efficient solution.
Big corporates hire expensive data consulting companies like BCG Gamma or McKinsey Analytics. When those consulting groups ask them about their data they send confusing, old-school, not updated excel files. They lose an expensive time because of all the confusion and useless mail exchanges to clarify simple data points.
The same goes for tech startups often pride themselves on design and on making their application as user-friendly and intuitive as possible. Yet, when they receive an inquiry about their data, they send over a spreadsheet.
Surely there is a better way.
3) They are not easy to share and plug to other software
As said earlier, the favorite formatting today is still the good old excel spreadsheet, most of the time not even the collaborative version. This means that each time you edit the dictionary with some additional information (add a column, change a description, correct a mistake) you need to send a new version to everyone. This is a real pain.
If you try and be innovative, you want to use Google Sheet to benefit from the collaborative functions. Now you have a new problem on your hands. Everyone has its own way of designing the data dictionary, its own definition, and the right to edit. You end up having trouble having a unique data dictionary formatting that you can use in a programmatic way for other purposes (GDPR reports, auto-populate new software using your data like Tableau or Dataiku, etc).
To solve this you need :
- a clear, unique template that fits all the possible use cases
- an access control to make sure people able to modify are relevant
- an open API to other software to absorb data automatically and populate other software in a click
4) They cause confusion
Because today's most popular data dictionaries are not collaborative, not up to date, not exhaustive, badly designed and many versions are in the company, no one really knows whether they can trust the information written inside. When doing troubleshooting on data for customers, data engineers or any kind of support is looking at the data itself and not the data dictionary. They can't know that they are using an outdated version or understanding it wrong. All this confusion creates inefficiencies, frustrations, and wasted time.
Do you hate your data dictionary as well?
Visit Castor to find out more about a collaborative, automated, integrated, plug-and-play data dictionary solution
Subscribe to the Castor Blog
You might also like
Fantastic tool for data discovery and documentation
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.”
Michal, Head of Data, Printify