If you've belonged to a data team for any length of time, you've probably felt submerged by the number of tables you're meant to keep track of. You've also probably noticed that other data analysts and data scientists spend a lot of their time dealing with questions of the style:
That is, data people are spending more time looking for and trying to understand data than on meaningful value-generating data analytics work. The good news is that a data dictionary is a tool that can help with all these questions, allowing data people to focus on the core of their work and gain time. The term "data dictionary" has gained a lot of traction in the past few years, but there is still confusion around what it means, how it differs from other tools such as the data glossary or data catalog, and what its primary usage is.
In this article, we provide clarity on the matter, as well as a template and instructions on how you and your team can use it. For more information about how to use the template, the creation of data dictionaries, defining terms, logging an event/record, etc, keep on reading!
A data dictionary is a repository of the information related to each database schema, table, event, file, or column stored in a given system. This information is called "metadata".
Metadata is basically data generated to provide information about the data you own (number of features, number of rows, etc..). For instance, the metadata related to a specific book file includes the title, the author name, the date of publishing, the table of content, etc. Metadata has tremendous value, as it allows you to unlock the potential of your data. If you don't exactly why this is the case, have a look at this article. It explains the concept of metadata in more depth and will help you understand the relevance of metadata collection.
Going back to the main point, a data dictionary is a place where all the metadata related to a file system is stored. Metadata about a database is usually stored within a different table, separated from the original database. This resource allows data users to understand the content of a high number of large databases without having to scroll through each column and record. For instance, if you want to check whether there are null values in your database, you just have to check the metadata in the data dictionary instead of scrolling through the database to see if "null" is a possibility for a certain field. This resource saves considerable amounts of time for a data user.
Take, for instance, a database containing records about your website users. You will be collecting different types of information about them: date of birth, e-mail address, phone number, etc.. Say, you create a column storing the names of your customers. In the data dictionary for the "name" column, you will find a definition of the column, the data type, possible values, whether values are unique, etc. The database with each record about your customers will look like this:
The data dictionary for the specific 'Name' column is the following:
This is a specific example. In general, for each table in the database, a data dictionary includes the following:
This information is usually stored as tables. Using a data dictionary, it is possible to uncover the above information about a database object without even having to open the database.
A data dictionary is a list of the information related to each database schema, table, event, file, or column stored in a given system. This information is called "metadata".
Although this seems like a simple exercise, it’s very difficult to bring in the same document information from different places that change all the time. It is even harder to align business departments with the same definitions. A B2B SaaS company we work with has difficulties getting everyone aligned on the same definition for “number of customers” on a specific date or time range. Why?
The date here isn't really the problem but what is the event date that should be taken into account when speaking about a specific metric.
In a nutshell, a data dictionary aligns everyone on these definitions. So that everyone "speaks" the same language. A data dictionary is used for:
If you're looking for good data dictionary software, take a look at our benchmark of the tools available on the market. But let's say you don't want to invest in new data tools just yet, for any reason, what can you do?
You're convinced of the relevance of data dictionaries, but you're not ready to install new software just yet? An "on the go" solution is for your data team to take care of data documentation using a shared excel spreadsheet and maintaining the data dictionary manually.
This spreadsheet; the central repository of metadata, will contain a description of the data objects in your file system. This includes table location, table, and database description, quality of the data, details about the table columns, etc.. This spreadsheet will be a point of reference for anyone in your company looking for specific a table, allowing them to quickly find, understand and use the data objects in no time.
This excel data dictionary is easy to use, reliable, and will quickly deliver value to your data team. The only drawback is that the spreadsheet becomes harder to maintain as the number of tables multiplies in your file system. However, nothing stops you from starting manually before migrating to an automated data dictionary later.
At Castor, we have put together a spreadsheet-based data catalog template for your personal use. Feel free to download it so you can get started with your data documentation right away.
Directly from your desktop, you can search through data assets like you would on Google. This data dictionary is like a portal to your data resources.
For all the data objects you have, you can document them from your desktop using this template. For instance, take a table containing a list of your platform's users in 2020. When documenting your table, you should pay particular attention to three aspects: Location, quality, and governance. We go through each aspect in turn.
In the documentation process, your should aim at locating your table in terms of where it can be found in the system, and who is responsible for it. This, so that employees can find the table quickly and easily. You thus start by indicating the location of the table (data storage - database - schema). That is, whether the table is located in your data warehouse or your data lake. You should also specify in which database and which schema the table can be found. If possible, indicate the URL of the table (Data source endpoint), so it can be found easily by employees. You then outline the table name, a short description/definition of what's contained in the table, and a few tags. It is also key to indicate the dataset type - an indication of how many transformations this table has gone through. The different types are the following: 'application' for transactional data, 'T0' for raw analytics data that's never been modified, 'T1' for data that's been transformed or modified once, etc. It's essential to precise the name of the table owner, usually the person responsible for the creation of the table. This, so that employees know exactly who to contact for questions about the table instead of wasting time roaming desperately in quest of an answer. You can feature a list of contributors: a contributor being a user who voluntarily enriches the table documentation. This is a way to give credit to these data fairies and to set them as reference points for fellows who might have questions about data objects.
The second aspect of data documentation relates to data quality. You want to ensure that your data objects are coherent and qualitative, and to ensure the best data assets are the most used ones in the company. With this idea in mind, start by filing the refresh frequency column. Is this table refreshed every 30 min, weekly? Monthly? This will allow the user to spot an outdated table and wave a red flag. The quality score bit is meant to be filled by the data steward, or the person in charge of data quality. Is the data complete, accurate, consistent? We simply use emojis for this. But feel free to use whatever system you feel most comfortable with. The quality score is important, as analysts will be able to locate quickly which tables are readily usable for conducting data analysis.
Once your table is clearly documented, it's time to clarify what's happening on the column side. I hear your sighing already. Don't. You have probably observed analysts reach high levels of frustration when finding themselves unable to start work because they (understandably) can't figure out the name of some columns. And that's the start of huge productivity losses for your data team. Hence the importance of outlining the column name and a short column description for each column, even if there's a high number of them.
For the documentation level part, you don't have to do anything. The template calculates the quality and level of your documentation. The score number is between 0 and 1, with 1 indicating the highest documentation level. if the dictionary contains a lot of null columns, the score will be closer to 0. Again, this guides analysts towards well-documented tables and encourages table owners to document their tables (a poorly documented table reflects badly on table owners).
A key field in data dictionaries is the "data type" field. Precising the type adds a layer of context for your analysts, who won't have to pull out the table in question to get this information. As we mentioned above, the data type depends on the database management system used by the organization. The most commonly used data types are SQL data types, which doesn't come as a surprise as SQL is the most widespread database language. We'll do a quick stop here to go over the different SQL data types, which should allow you to fill your data dictionary accurately.
There are 6 categories of data types:
Numeric data types
Date and time data types
Characters & string data types
Unicode characters and string types
Binary data types
Other data types
A data dictionary is like a portal. Directly from your desktop, you can access information on data resources regardless of your technical expertise. Even a manual data dictionary can help you deal with governance concerns. A data dictionary can be the data portal for your legal team. In this dictionary, you can indicate whether a given column qualifies as PII (Personally Identifiable Information), meaning that it contains any data that could potentially be used to identify a particular person. This allows you to deal with issues of privacy, for instance by restricting access to columns marked as PII straight from your desktop.
We encourage you to customize this data dictionary to your particular business, so don't hesitate to add additional fields of definition/documentation for your data objects! The template proposes a few, although they aren't part of our automated data catalog offer.
The "Dashboard" tab allows you to document your dashboards in the exact same manner as with tables.
The template also contains a "glossary" tab. This is the space of the business glossary, containing common vocabulary and definitions for shared data. A data glossary is a repository where each business term is attached to its definition. This ensures that business keywords are always used in the right context and that everyone in the company agrees on one specific definition. Take the term "paying user" for instance. This word is used recurrently in the product department of companies. Although it seems straightforward, it can be interpreted differently by employees if they don't agree around a clear definition. For instance, an analyst might categorize users on a free trial as "paying users", while another might exclude them. Hence the importance to have a glossary for each department. The term owner is the person who wrote the definition and is generally ready to answer questions about a specific word.
Linking the data glossary with the data dictionary also generates time and productivity gains. This enriches data with business context, allowing data analysts and data scientists to align data with business strategy. In fact, when both the data glossary and the data dictionary are in the same repository, a data user can directly have access to a term's definition, as well as to any table or database related to this term.
If you have any questions about how to use the template, please reach out at louise@castordoc.com
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.