Writing data documentation belongs to the famous category of “important but not urgent” tasks.
There's always something more urgent to do … Or more exciting. You might also not know where to start or how to proceed.
Last week, we dove into the importance of data documentation in our first article of our “Data Documentation Demystified” series. Now that we understand why it's critical, let's delve into the different forms of documentation.
When thinking about data documentation, you might think of a data dictionary or business glossary. However, there are additional capabilities that are just as important.
In this article, we look at the various forms that data documentation can take. Specifically, there are three main use cases for documentation:
While the primary focus of this article is on data documentation and business knowledge, we have also compiled a set of suggestions for documenting your knowledge base beyond the scope of the data. Want to know more? Subscribe to receive the rest of the series directly to your inbox.
Let’s dive in!
Data should always be tied to knowledge and context. Data without context is dangerous, because it leads to different interpretations and confusion.
It's essential to describe the different elements of your data, such as the various tables, columns, dashboards, and fields. Documenting your data provides insight into the context of the data, its origins, and its uses.
This information helps to establish trust in the data and ensures that everyone who works with it has a clear understanding of what it represents.
When writing data descriptions, it is essential to consider what it is, where it comes from, and how to use it. This information can help users quickly and accurately identify the data they need and use it effectively.
Data knowledge can be categorized into three distinct parts: warehouse assets, reporting assets, and lineage assets. We’ll examine each of these in detail.
Documenting data tables is a critical aspect of working with data. When documenting tables, it's important to keep in mind the perspective of the data consumer and what they need to know.
This can include information on the source of the data, the structure of the table, the meaning of the column headers, and the data types and formats used.
Having comprehensive and accurate documentation for your data tables can provide clarity, consistency, and accuracy for anyone who works with the data. It also helps to avoid misunderstandings, errors, and wasted time due to searching for information.
The table below illustrates the level of documentation you should aim for when documenting a table.
TIP: Add links to the related assets: KPIs, Dashboards, Tables, Queries
When it comes to documenting column descriptions, it's essential to ensure that the information is clear and concise. Here are some key considerations for documenting columns:
If you feel like you are copying/pasting too much, create an entry in your knowledge base and refer to it, using links. You can also use a tool like Castor that can propagate descriptions automatically with data lineage.
Reporting data assets are tools and resources that are used to communicate and display important metrics and key performance indicators (KPIs) to business stakeholders.
These assets can take many forms, including dashboards, reports, and scorecards. It is important to document these assets in order to ensure that they are easily understood and accessible to their intended audience.
Keep in mind that the consumers of reporting are less technical than data-warehouse consumers. Be more business oriented: talk about metrics, business rules, workflows, etc.
Here are some guidelines for documenting reporting assets:
Data lineage is the process of tracking data as it moves through various systems and processes, from its origin to its ultimate destination. This information is crucial in understanding the impact of any changes made to data structures and ensuring compliance with data privacy regulations.
Here are the rules for documenting data lineage assets:
Note that data lineage is a complex process and it is highly recommended to use specialized tools to maintain it effectively.
Data is inextricably linked to the essential business concepts that form the basis of any organization, such as client, product, invoice, payment, and so on. Having a clear idea of these concepts and how they interact with each other is essential before delving into the data warehouse.
Business knowledge can be represented using Key Performance Indicators (KPIs), Entity-Relationship Diagrams (ERDs), and State Diagrams. We will examine each of these separately.
It is essential to provide clear, detailed explanations of your key metrics. Agree on specific definitions for each metric and how they are calculated, so everyone can be on the same page. Here are four rules you can follow when defining KPI’s:
When setting up your key performance indicators (KPI's), it's crucial to clearly specify what is included in the calculation and what is not. The following provides an example of how to precisely determine what is included and excluded in each KPI:
Include data source citations when creating KPIs to clarify which tables are utilized for calculating specific KPIs. For example:
This helps to give users a visual representation of the KPI and how it is used in real-world situations.
For example, a dashboard for an airline may include the KPIs "On-time Departure Rate" and "Booking Load Factor" to provide an overview of the airline's performance. These KPIs can be displayed in a line chart or table format, showing trends over time or comparisons with previous periods.
This helps users understand the logic behind the KPI and how it is calculated.
For example, to calculate the KPI "Passengers per Flight Segment" for an airline, a query may be used to join data from the passengers table and the flights table, and then group the data by flight segments. In your data catalog, this may look like:
Here's an illustration of the level of detail needed for computing a KPI:
TIP: When a KPI involves other KPIs, use links instead of copy and pasting the whole computation rules. 🔗
An ERD diagram is like a blueprint for a database. It shows how different business entities are connected to each other.
Mermaid is a good tool to draw these diagrams. It makes it easy to keep track of changes because it uses a markdown language. It also has a special feature for making ERDs.
Your Entity Relationship Diagram (ERD) should always be accompanied by a business glossary to ensure that all of the entities within the diagram are clearly defined.
This glossary should also provide further details about the entities, such as their purpose and any relevant relationships.
A state machine diagram models the behavior of a single object, specifying the sequence of events that an object goes through during its lifetime in response to events. https://sparxsystems.com\
A state diagram is a visual representation of the possible states of an object and the transitions between those states.
Entities with status are generally good candidates for these diagrams. State diagrams describe the different events and state modifications. Once again, Mermaid is a great tool for this:
Data onboarding is about equipping new employees with the information they need when they join the team. This includes your data stack, organizational structure, naming conventions, frequently asked questions, and SQL best practices. We will go over each of these.
A company’s data stack involves a variety of tools that are essential for efficient data management. Building a company-wide understanding of your data stack helps avoid confusion and repeated questions.
The documentation should be straightforward, easy to understand, and directly address the most critical information, keep it simple!
Data teams are composed of many different functions. Depending on the task, you may need to speak to a data scientist, business analyst, data engineer, or another specialized role. This article provides a good overview of the different functions and key players within data teams.
Make sure you have an organizational chart, and make sure it is shared with everyone.
If you have naming conventions for your data warehouse, it's important to share them with the team. Sharing naming conventions with new members of your data team is essential to ensure consistency, collaboration, scalability, and data governance. It helps to avoid confusion and errors, promotes effective teamwork, and ensures that projects can be scaled and maintained over time.
"Where's the glory in repeating what others have done?"
Rick Riordan, New York Times bestselling author
Consider the questions that you receive on a regular basis. Then, take the time to write comprehensive answers for these questions, to ensure you won't have to keep repeating the same information.
Some examples of questions to answer include:
Don’t blame users for asking questions that could have been answered in your documentation. Instead, provide a concise response and include a relevant link.
This will lead your users to increasingly turn to your FAQ for answers in the future, provided that it's well-constructed.
SQL is a must-have in every data consumer's toolkit. Mastering this language can have numerous benefits for your team. To make it easier, consider breaking down the cheat sheets into two categories.
First, gaining general knowledge of SQL. There are a lot of good cheat sheets out there if you need inspiration.
Second, think about specific advice for your stack. This includes providing sample queries to compute main KPIs and typical JOINS to navigate among your main entities.
In conclusion, data documentation is an important task that is often overlooked due to its "important but not urgent" nature.
However, as we discussed in the first article of this series, it is critical for effective data management.
In this article, we explored the different forms that data documentation can take, including data knowledge, business knowledge and team onboarding. With this context, organizations can develop more comprehensive and effective data documentation strategies to drive better business outcomes.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.
Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.