Are you struggling with Data Discovery?
If you've belonged to a data team for any length of time, you've probably felt submerged by the number of tables you're meant to keep track of. You've also probably noticed that other data analysts and data scientists spend a lot of their time dealing with questions of the style:
- Where is the best data to answer my question?
- What does the column name "XXXX" mean?
- Can I trust this resource?
- Who has created this resource? How was the data generated?
- When was the database last updated? By which process was this resource created?
- Who can I contact if I see something wrong (wrong record, wrong row, etc..)?
- Has someone already worked on this question?
That is, data people are spending more time looking for and trying to understand data than on meaningful value-generating data analytics work. The good news is that a data dictionary is a tool that can help with all these questions, allowing data people to focus on the core of their work and gain time. The term "data dictionary" has gained a lot of traction in the past few years, but there is still confusion around what it means, how it differs from other tools such as the data glossary or data catalog, and what its primary usage is.
In this article, we provide clarity on the matter, as well as a template and instructions on how you and your team can use it. For more information about how to use the template, the creation of data dictionaries, defining terms, logging an event/record, etc, keep on reading!
Get the data dictionary template
What is a Data Dictionary?
A data dictionary is a repository of the information related to each database schema, table, event, file, or column stored in a given system. This information is called "metadata".
Metadata is basically data generated to provide information about the data you own (number of features, number of rows, etc..). For instance, the metadata related to a specific book file includes the title, the author name, the date of publishing, the table of content, etc. Metadata has tremendous value, as it allows you to unlock the potential of your data. If you don't exactly why this is the case, have a look at this article. It explains the concept of metadata in more depth and will help you understand the relevance of metadata collection.
Going back to the main point, a data dictionary is a place where all the metadata related to a file system is stored. Metadata about a database is usually stored within a different table, separated from the original database. This resource allows data users to understand the content of a high number of large databases without having to scroll through each column and record. For instance, if you want to check whether there are null values in your database, you just have to check the metadata in the data dictionary instead of scrolling through the database to see if "null" is a possibility for a certain field. This resource saves considerable amounts of time for a data user.
Take, for instance, a database containing records about your website users. You will be collecting different types of information about them: date of birth, e-mail address, phone number, etc.. Say, you create a column storing the names of your customers. In the data dictionary for the "name" column, you will find a definition of the column, the data type, possible values, whether values are unique, etc. The database with each record about your customers will look like this:
The data dictionary for the specific 'Name' column is the following:
This is a specific example. In general, for each table in the database, a data dictionary includes the following:
- Table name - the predefined name for the table
- Table location - Precise database, file, book...
- Table definition - A detailed description of the contents and purpose of the table in the database schema. Example: "Each row represents a user of our platform"
- Column Name - The predefined name for the column
- Column definition - A detailed description of the contents and purpose for the column including Primary Key and Foreign Key designations.
- Data Type - The predefined characteristics for the column. Data types vary according to the database management system (DBMS). For instance, a column marked as BLOB data type in Oracle would be marked as TEXT in Microsoft SQL Server. The data type can be integer, date, string, XML, etc. We go into more details regarding data types later.
- Null Value - Indicates whether "null" is a possibility, this means whether or not null values can be stored for the column. It is useful to indicate whether fields can be null in the data dictionary. This means data users won't have to open the database to check whether an instance can be null or not.
- Unique value - Indicates whether each value/row is unique or not.
- Business rules - Includes schema validation or data quality.
- Entity-relationship and other system-level diagrams.
This information is usually stored as tables. Using a data dictionary, it is possible to uncover the above information about a database object without even having to open the database.
What is a data dictionary used for?
A data dictionary is a list of the information related to each database schema, table, event, file, or column stored in a given system. This information is called "metadata".
Although this seems like a simple exercise, it’s very difficult to bring in the same document information from different places that change all the time. It is even harder to align business departments with the same definitions. A B2B SaaS company we work with has difficulties getting everyone aligned on the same definition for “number of customers” on a specific date or time range. Why?
- The finance team defines the “number of customers” as the total number of customers that paid their bills between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.
- The sales team defines the “number of customers” as the total number of customers that signed the contract between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.
- The marketing team defines the “number of customers” as the total number of customers that are either paying or in the 14-trial period between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.
The date here isn't really the problem but what is the event date that should be taken into account when speaking about a specific metric.
In a nutshell, a data dictionary aligns everyone on these definitions. So that everyone "speaks" the same language. A data dictionary is used for:
- Documentation - data dictionaries are tools allowing for clear and organized documentation of a file system. This provides access to data structure details for data analysts, business analysts, and other users. Well-documented databases allow employees to find and understand the data they need in no time. With a well-maintained data dictionary, every analysis referencing certain business objects can use the definition and description in the data dictionary as a guide. Every analysis that references certain business objects should use the definitions in the dictionary as a guide.
- Communication - Equips each user with unique definitions for shared data, and helps developers measure the impact of schema changes.
- System Analysis - A key application of data dictionaries is to enable data and business analysts to understand system design and data flow.
- Data Integration - A clear definition of elements in the database management system (DBMS) provides each user with contextual understanding.
- Decision Making - Good metadata management software facilitates the planning of data collection, project development, and other collaborative efforts.
If you're looking for good data dictionary software, take a look at our benchmark of the tools available on the market. But let's say you don't want to invest in new data tools just yet, for any reason, what can you do?
The Excel Solution
You're convinced of the relevance of data dictionaries, but you're not ready to install new software just yet? An "on the go" solution is for your data team to take care of data documentation using a shared excel spreadsheet and maintaining the data dictionary manually.
This spreadsheet; the central repository of metadata, will contain a description of the data objects in your file system. This includes table location, table, and database description, quality of the data, details about the table columns, etc.. This spreadsheet will be a point of reference for anyone in your company looking for specific a table, allowing them to quickly find, understand and use the data objects in no time.
This excel data dictionary is easy to use, reliable, and will quickly deliver value to your data team. The only drawback is that the spreadsheet becomes harder to maintain as the number of tables multiplies in your file system. However, nothing stops you from starting manually before migrating to an automated data dictionary later.
At Castor, we have put together a spreadsheet-based data catalog template for your personal use. Feel free to download it so you can get started with your data documentation right away.
Get the data dictionary template
How to use this template?
Directly from your desktop, you can search through data assets like you would on Google. This data dictionary is like a portal to your data resources.
For all the data objects you have, you can document them from your desktop using this template. For instance, take a table containing a list of your platform's users in 2020. When documenting your table, you should pay particular attention to three aspects: Location, quality, and governance. We go through each aspect in turn.
In the documentation process, your should aim at locating your table in terms of where it can be found in the system, and who is responsible for it. This, so that employees can find the table quickly and easily. You thus start by indicating the location of the table (data storage - database - schema). That is, whether the table is located in your data warehouse or your data lake. You should also specify in which database and which schema the table can be found. If possible, indicate the URL of the table (Data source endpoint), so it can be found easily by employees. You then outline the table name, a short description/definition of what's contained in the table, and a few tags. It is also key to indicate the dataset type - an indication of how many transformations this table has gone through. The different types are the following: 'application' for transactional data, 'T0' for raw analytics data that's never been modified, 'T1' for data that's been transformed or modified once, etc. It's essential to precise the name of the table owner, usually the person responsible for the creation of the table. This, so that employees know exactly who to contact for questions about the table instead of wasting time roaming desperately in quest of an answer. You can feature a list of contributors: a contributor being a user who voluntarily enriches the table documentation. This is a way to give credit to these data fairies and to set them as reference points for fellows who might have questions about data objects.
The second aspect of data documentation relates to data quality. You want to ensure that your data objects are coherent and qualitative, and to ensure the best data assets are the most used ones in the company. With this idea in mind, start by filing the refresh frequency column. Is this table refreshed every 30 min, weekly? Monthly? This will allow the user to spot an outdated table and wave a red flag. The quality score bit is meant to be filled by the data steward, or the person in charge of data quality. Is the data complete, accurate, consistent? We simply use emojis for this. But feel free to use whatever system you feel most comfortable with. The quality score is important, as analysts will be able to locate quickly which tables are readily usable for conducting data analysis.
Once your table is clearly documented, it's time to clarify what's happening on the column side. I hear your sighing already. Don't. You have probably observed analysts reach high levels of frustration when finding themselves unable to start work because they (understandably) can't figure out the name of some columns. And that's the start of huge productivity losses for your data team. Hence the importance of outlining the column name and a short column description for each column, even if there's a high number of them.
For the documentation level part, you don't have to do anything. The template calculates the quality and level of your documentation. The score number is between 0 and 1, with 1 indicating the highest documentation level. if the dictionary contains a lot of null columns, the score will be closer to 0. Again, this guides analysts towards well-documented tables and encourages table owners to document their tables (a poorly documented table reflects badly on table owners).
A key field in data dictionaries is the "data type" field. Precising the type adds a layer of context for your analysts, who won't have to pull out the table in question to get this information. As we mentioned above, the data type depends on the database management system used by the organization. The most commonly used data types are SQL data types, which doesn't come as a surprise as SQL is the most widespread database language. We'll do a quick stop here to go over the different SQL data types, which should allow you to fill your data dictionary accurately.
There are 6 categories of data types:
Numeric data types
- Integer - flags when a column in the database only contains integer values (Aka numbers without decimal points)
- Tinyint - when a column in a table contains an integer but only involving numbers in the range (0,255)
- Bigint - the column in the table stores an integer bigger than one trillion
- Float - This column stores approximate values
Date and time data types
- Date - A date of a given year and month
- Time - the time of the day
- Timestamp - temporal data type that holds the combination of date and time. The format of a TIMESTAMP is YYYY-MM-DD HH:MM:SS
Characters & string data types
- Char - Refers to a fixed length of characters, with a maximum of 8,000 character
- Varchar - Same upper limit as char (8,000), but varchar refers to the fact that each entry can have a different length. Varchar is a popular data type.
- Text - Also referred to as "String" or "STR", which means that a column deals with letters rather than, numbers or symbols. Has a similar meaning as varchar, but has a maximum of 2GB.
Unicode characters and string types
- nchar - Refers to a fixed length of characters, with a maximum of 8,000 character
- nvarchar - Refers to a variable length of characters, with a maximum of 8,000 character. nvarchar takes twice as much space as varchar.
- ntext- similar to nvarchar, with a maximum storage of 2GB.
Binary data types
- binary - fixed-length binary data and can store maximum 8000 bytes. For instance, Boolean (true/false) is binary data.
- varbinary - variable-length binary data and can store maximum 8000 bytes
Other data types
- BLOB - Stands for Binary Large Objects. It is a reference or pointer to an object.
- CLOB - Stands for Character Large Object. It stores large amounts of character data, up to 4 GB in size. The CLOB data type is similar to a BLOB but includes character encoding, which defines a character set and the way each character is represented.
- XML - a data type that stores XML data. An XML file is an extensible markup language file, and it is used to structure data for storage and transport. In an XML file, you will find both tags and text. The tags in the XML file provide the structure of the data. The text is surrounded by these tags, which adhere to specific syntax guidelines
A data dictionary is like a portal. Directly from your desktop, you can access information on data resources regardless of your technical expertise. Even a manual data dictionary can help you deal with governance concerns. A data dictionary can be the data portal for your legal team. In this dictionary, you can indicate whether a given column qualifies as PII (Personally Identifiable Information), meaning that it contains any data that could potentially be used to identify a particular person. This allows you to deal with issues of privacy, for instance by restricting access to columns marked as PII straight from your desktop.
We encourage you to customize this data dictionary to your particular business, so don't hesitate to add additional fields of definition/documentation for your data objects! The template proposes a few, although they aren't part of our automated data catalog offer.
Dashboards & Data Glossary
The "Dashboard" tab allows you to document your dashboards in the exact same manner as with tables.
The template also contains a "glossary" tab. This is the space of the business glossary, containing common vocabulary and definitions for shared data. A data glossary is a repository where each business term is attached to its definition. This ensures that business keywords are always used in the right context and that everyone in the company agrees on one specific definition. Take the term "paying user" for instance. This word is used recurrently in the product department of companies. Although it seems straightforward, it can be interpreted differently by employees if they don't agree around a clear definition. For instance, an analyst might categorize users on a free trial as "paying users", while another might exclude them. Hence the importance to have a glossary for each department. The term owner is the person who wrote the definition and is generally ready to answer questions about a specific word.
Linking the data glossary with the data dictionary also generates time and productivity gains. This enriches data with business context, allowing data analysts and data scientists to align data with business strategy. In fact, when both the data glossary and the data dictionary are in the same repository, a data user can directly have access to a term's definition, as well as to any table or database related to this term.
If you have any questions about how to use the template, please reach out at email@example.com
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and get a free 14 day demo.
Subscribe to the Castor Blog
You might also like
CastorDoc evaluates data catalog solutions for mid-market & enterprise companies, assisting you in selecting the right tool for your data management needs.
Understand the ROI of data catalogs and how investing in CastorDoc can enhance your data management and analytics capabilities.
Fantastic tool for data discovery and documentation
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.”
Michal, Head of Data, Printify