If you've belonged to a data team for any length of time, you've probably felt submerged by the amount of tables you're meant to keep track of. You've also probably noticed that other data analysts and data scientists spend a lot of their time dealing with questions of the style:
That is, data people are spending more time on metadata management than on meaningful value-generating data analytics work. The good news is that the data dictionary, or metadata repository is a tool that can help with all these questions, allowing data people to focus on the core of their work. If you're looking for such a tool, take a look at our data catalog benchmark. But let's say you don't want to install a data dictionary just yet, for any reason, what can you do?
An "on the go" solution is for your data team to take care of data documentation using a shared excel spreadsheet and maintaining the data dictionary manually.
This spreadsheet; central repository of metadata, will contains a description of data objects. This includes table location, table and database description, quality of the data, details about the table columns, etc.. This spreadsheet will be a point of reference for anyone in your company looking for specific a table, allowing them to quickly find, understand and use the data objects.
This excel data dictionary is easy to use, reliable, and will quickly deliver value to your data team. The only drawback is that the spreadsheet becomes harder to maintain as your tables multiply. However, nothing stops you from starting manually before migrating to an automated data dictionary later.
For all the data objects you have, you can document it using this template. For example, take a table containing a list of your platform's users in 2020. When documenting your table, you should pay particular attention to three aspects: Location, quality, and governance. We go through each aspect in turn.
In the documentation process, your should aim at locating your table in terms where it can be found, and who is responsible for it. This, so that employees can find the table easily. You thus start by indicating the location of the table (data storage - database - schema). That is, whether the table is located in your data warehouse or your data lake. You should also specify in which database and which schema the table can be found. If possible, indicate the URL of the table (Data source endpoint), so it can be found easily by employees. You then outline the table name, a short description of what's contained in the table, and a few tags. It's essential to precise the name of the table owner, usually the person who created the table. This, so that employees know exactly who to contact for questions about the table instead of roaming desperately in quest of an answer. You can feature a list of contributors: the users who voluntarily enrich the table documentation. This a way to give credit to these data fairies, and to set them as reference points for fellows who might have questions about the table.
The second aspect of data documentation relates to data quality. You want to ensure that your data is coherent and qualitative, and to organise that the best data assets get the most users. In this light, start by filing the refresh frequency column. Is this table refreshed every 30 min, weekly? Monthly? This will allow user to spot an outdated table and wave a red flag. The quality score bit is meant to be filled by the data steward, or the person in charge of data quality. Is the data complete, accurate, consistent? We simply use emojis for this. But feel free to use whatever system you feel most comfortable with. The quality score is important, as analysts will be able to locate quickly which tables are readily usable for conducting great analysis.
Once your table is clearly documented, it's time to clarify what's happening on the column side. I hear your sighing already. Don't. You have probably observed analysts reach high levels of frustration when finding themselves unable to start work because they (understandably) can't figure out the name of some columns. And that's the start of huge productivity losses for your data team. Hence the importance of outlining the column name and a short column description for each column. You should also precise the type of each column, describing the value it contains. Column types can be Text, Number, Date/Time or Location. This adds a layer of context for your analysts, who won't have to pull out the table in question to get this information.
For the documentation level part, you don't have to do anything. The template calculates the quality and level of your documentation. The score is between 0 and 1, with 1 indicating the highest documentation level. Again, this guides analysts towards well documented tables, and encourages table owners to document their table (a poorly documented table reflects badly on table owners.)
Finally, even a manual data dictionary can help you deal with governance concerns. In this dictionary, you can indicate whether a given column qualifies as PII (Personally Identifiable Information), meaning that it contains any data that could potentially be used to identify a particular person. This allows you to deal with issues of privacy, for example by restricting access to columns marked as PII.
We encourage you to customise this data dictionary to your particular business, so don't hesitate to add additional fields of documentation! The template proposes a few, although they aren't part of our automated data catalog offer.
The "Dashboard" tab allows you to document your dashboards in the exact same manner as with tables.
The template also contains a "dictionary" tab. This is the space of the business glossary, containing common vocabulary and definitions for shared data. where words which are often used in the company are clearly defined. This ensures that business keywords are always used in the right context, and that everyone in the company agrees on one specific meaning. Take the term "paying user" for example. This word is used recurrently in the product department of companies. Although this word seems straightforward, it can be interpreted differently by different employees. For example, an analyst might categorize users on a free trial as "paying users", while another might exclude them. Hence the importance to have a glossary for each department. The term owner is the person who wrote the definitions, and is generally ready to answer questions about a specific word.
Data dictionaries communicate metadata regarding data in a system or a database. They are at the root of any database management system (DBMS). The primary purpose of a data dictionary is good metadata management. For each table in the database, a data dictionary includes the following:
If you have any question about how to use the template, please reach out at email@example.com