Unveiling the Three Faces of Documentation

Three essential use cases for documentation

Unveiling the Three Faces of Documentation

Writing data documentation belongs to the famous category of “important but not urgent” tasks.

There's always something more urgent to do … Or more exciting. You might also not know where to start or how to proceed.

Last week, we dove into the importance of data documentation in our first article of our “Data Documentation Demystified” series. Now that we understand why it's critical, let's delve into the different forms of documentation.

When thinking about data documentation, you might think of a data dictionary or business glossary. However, there are additional capabilities that are just as important.

In this article, we look at the various forms that data documentation can take. Specifically, there are three main use cases for documentation:

  1. Data documentation- the knowledge around your data assets to facilitate data consumption.
  2. Business knowledge - the business concepts used by the company to make decisions.
  3. Team onboarding - the necessary information to ramp up new hires.

While the primary focus of this article is on data documentation and business knowledge, we have also compiled a set of suggestions for documenting your knowledge base beyond the scope of the data. Want to know more? Subscribe to receive the rest of the series directly to your inbox.

Let’s dive in!

I. Data knowledge

Data should always be tied to knowledge and context. Data without context is dangerous, because it leads to different interpretations and confusion.

It's essential to describe the different elements of your data, such as the various tables, columns, dashboards, and fields. Documenting your data provides insight into the context of the data, its origins, and its uses.

This information helps to establish trust in the data and ensures that everyone who works with it has a clear understanding of what it represents.

When writing data descriptions, it is essential to consider what it is, where it comes from, and how to use it. This information can help users quickly and accurately identify the data they need and use it effectively.

Data knowledge can be categorized into three distinct parts: warehouse assets, reporting assets, and lineage assets. We’ll examine each of these in detail.

1. Warehouse assets 🗄️

a) Tables

Documenting data tables is a critical aspect of working with data. When documenting tables, it's important to keep in mind the perspective of the data consumer and what they need to know.

This can include information on the source of the data, the structure of the table, the meaning of the column headers, and the data types and formats used.

Having comprehensive and accurate documentation for your data tables can provide clarity, consistency, and accuracy for anyone who works with the data. It also helps to avoid misunderstandings, errors, and wasted time due to searching for information.

The table below illustrates the level of documentation you should aim for when documenting a table.

Table documentation - Image courtesy of Castor

TIP: Add links to the related assets: KPIs, Dashboards, Tables, Queries

b) Columns

When it comes to documenting column descriptions, it's essential to ensure that the information is clear and concise. Here are some key considerations for documenting columns:

  • Tag the columns that participate in the primary key and add links to foreign keys.
  • Tag columns that contain personally identifiable information (PII)
  • When the column is computed using other fields, explain the calculation
  • When the column is an enumeration, describe each value (if relevant)
  • When the column represents a state, add a link to the related state diagram.
  • When dealing with timestamps, specify whether you're referring to UTC or local time
  • If a column represents an amount, specify the currency, with or without VAT or tax, and the formula if computed from several other fields.

If you feel like you are copying/pasting too much, create an entry in your knowledge base and refer to it, using links. You can also use a tool like Castor that can propagate descriptions automatically with data lineage.

Description propagation using lineage - Image courtesy of Castor

2. Reporting assets 📊

Reporting data assets are tools and resources that are used to communicate and display important metrics and key performance indicators (KPIs) to business stakeholders.

These assets can take many forms, including dashboards, reports, and scorecards. It is important to document these assets in order to ensure that they are easily understood and accessible to their intended audience.

Keep in mind that the consumers of reporting are less technical than data-warehouse consumers. Be more business oriented: talk about metrics, business rules, workflows, etc.

Here are some guidelines for documenting reporting assets:

  • Pin-related metrics and KPI definitions
  • Add visual warnings on deprecated dashboards, or drop them when possible
  • Assign owners who can be contacted for questions
  • Add labels for categories
  • If possible, indicate when the refreshing schedule of the report. Is it refreshed daily, weekly, or on demand?

3. Data lineage: the data flow 🔀

Data lineage is the process of tracking data as it moves through various systems and processes, from its origin to its ultimate destination. This information is crucial in understanding the impact of any changes made to data structures and ensuring compliance with data privacy regulations.

Here are the rules for documenting data lineage assets:

  • Visualize where data comes from and where it goes
  • Understand impact when changing something (for example: removing a table)
  • Keep track of PII columns 🔐

Note that data lineage is a complex process and it is highly recommended to use specialized tools to maintain it effectively.

II. Business knowledge

Data is inextricably linked to the essential business concepts that form the basis of any organization, such as client, product, invoice, payment, and so on. Having a clear idea of these concepts and how they interact with each other is essential before delving into the data warehouse.

Business knowledge can be represented using Key Performance Indicators (KPIs), Entity-Relationship Diagrams (ERDs), and State Diagrams. We will examine each of these separately.

1. KPIs

It is essential to provide clear, detailed explanations of your key metrics. Agree on specific definitions for each metric and how they are calculated, so everyone can be on the same page. Here are four rules you can follow when defining KPI’s:

1- Be specific

When setting up your key performance indicators (KPI's), it's crucial to clearly specify what is included in the calculation and what is not. The following provides an example of how to precisely determine what is included and excluded in each KPI:

  • Canceled reservations are excluded from revenue
  • No-show passengers are included in booking_count but excluded from pax
  • A flight counts as on-time if the departure delay is not greater than 15 minutes

2- Cite your data sources.

Include data source citations when creating KPIs to clarify which tables are utilized for calculating specific KPIs. For example:

  • booking_count is computed using booking_at.
  • to compute pax, we use departure_at.

3- Provide examples of dashboards where this KPI can be shown.

This helps to give users a visual representation of the KPI and how it is used in real-world situations.

For example, a dashboard for an airline may include the KPIs "On-time Departure Rate" and "Booking Load Factor" to provide an overview of the airline's performance. These KPIs can be displayed in a line chart or table format, showing trends over time or comparisons with previous periods.

4- Provide an example of queries computing this KPI.

This helps users understand the logic behind the KPI and how it is calculated.

For example, to calculate the KPI "Passengers per Flight Segment" for an airline, a query may be used to join data from the passengers table and the flights table, and then group the data by flight segments. In your data catalog, this may look like:

Query example - Image courtesy of Castor

Here's an illustration of the level of detail needed for computing a KPI:

Level of detail needed to compute a KPI - Image courtesy of Castor

TIP: When a KPI involves other KPIs, use links instead of copy and pasting the whole computation rules. 🔗

2. Entity-Relationships Diagram (ERD)

An ERD diagram is like a blueprint for a database. It shows how different business entities are connected to each other.

Mermaid is a good tool to draw these diagrams. It makes it easy to keep track of changes because it uses a markdown language. It also has a special feature for making ERDs.

Mermaid ER diagram

Your Entity Relationship Diagram (ERD) should always be accompanied by a business glossary to ensure that all of the entities within the diagram are clearly defined.

This glossary should also provide further details about the entities, such as their purpose and any relevant relationships.

Business glossary - Image courtesy of Castor

3. State diagrams

A state machine diagram models the behavior of a single object, specifying the sequence of events that an object goes through during its lifetime in response to events. https://sparxsystems.com\

A state diagram is a visual representation of the possible states of an object and the transitions between those states.

Entities with status are generally good candidates for these diagrams. State diagrams describe the different events and state modifications. Once again, Mermaid is a great tool for this:

State diagram - Image courtesy of Lloyd Atkinson

III. Data onboarding

Data onboarding is about equipping new employees with the information they need when they join the team. This includes your data stack, organizational structure, naming conventions, frequently asked questions, and SQL best practices. We will go over each of these.

1. Data stack

A company’s data stack involves a variety of tools that are essential for efficient data management. Building a company-wide understanding of your data stack helps avoid confusion and repeated questions.

The documentation should be straightforward, easy to understand, and directly address the most critical information, keep it simple!

Most common tools of modern Data-Stack. Image courtesy of Dataiku.

2. Organizational chart

Data teams are composed of many different functions. Depending on the task, you may need to speak to a data scientist, business analyst, data engineer, or another specialized role. This article provides a good overview of the different functions and key players within data teams.

Make sure you have an organizational chart, and make sure it is shared with everyone.

If you’re looking for tools to draw charts, WHIMSICAL comes with very handy capabilities

3. Naming conventions

If you have naming conventions for your data warehouse, it's important to share them with the team. Sharing naming conventions with new members of your data team is essential to ensure consistency, collaboration, scalability, and data governance. It helps to avoid confusion and errors, promotes effective teamwork, and ensures that projects can be scaled and maintained over time.

4. FAQ

"Where's the glory in repeating what others have done?"
Rick Riordan, New York Times bestselling author

Consider the questions that you receive on a regular basis. Then, take the time to write comprehensive answers for these questions, to ensure you won't have to keep repeating the same information.

Some examples of questions to answer include:

  • When will my data be updated?
  • Where can I locate this information?
  • How can I gain access to data?
  • Who can I contact for questions about a dashboard?
  • How do I report a data issue?
  • What is the procedure for obtaining new data?
  • What is the cost of running a query?

Don’t blame users for asking questions that could have been answered in your documentation. Instead, provide a concise response and include a relevant link.

This will lead your users to increasingly turn to your FAQ for answers in the future, provided that it's well-constructed.

5. SQL cheat sheet

SQL is a must-have in every data consumer's toolkit. Mastering this language can have numerous benefits for your team. To make it easier, consider breaking down the cheat sheets into two categories.

First, gaining general knowledge of SQL. There are a lot of good cheat sheets out there if you need inspiration.

Second, think about specific advice for your stack. This includes providing sample queries to compute main KPIs and typical JOINS to navigate among your main entities.

Final words

In conclusion, data documentation is an important task that is often overlooked due to its "important but not urgent" nature.

However, as we discussed in the first article of this series, it is critical for effective data management.

In this article, we explored the different forms that data documentation can take, including data knowledge, business knowledge and team onboarding. With this context, organizations can develop more comprehensive and effective data documentation strategies to drive better business outcomes.

Subscribe to the Castor Blog

About us

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.

Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.

Want to check it out? Reach out to us and we will show you a demo.

New Release

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data