We increasingly hear about metadata and its importance for data analysis. Although metadata sounds like a fancy word, it refers to a dead-simple concept. We owe the word "meta" (amongst other things) to the ancient greek. Meta means "about the thing itself". A meta-joke is a joke about jokes, meta-thinking is thinking about thinking. In a logical continuation of things, meta-data is data about data. It is data whose only purpose is to define and describe the data object it is linked to. For example, a web page might include metadata specifying what software language the page is written in, what tools were used to create it, what objects the page is about, etc. Before I dive deeper into the importance of metadata and metadata management, let's take the time to learn more about this topic, by going over a few examples of metadata, both in the real and in the digital world.
The term metadata emerged in the 1990s in the context of describing resources on the Internet and became widespread thereafter.
There have been several use cases of metadata in order to catalog items in libraries, both in digital and analog formats. For example, this data helps classify, aggregate, identify, and locate a particular book, DVD, magazine, or object a library may hold in its collection.
More recent and specialized examples of library metadata include the creation of digital libraries, including electronic print repositories and digital image libraries. Although primarily based on library principles, the emphasis on use by non-librarians, particularly in the metadata provision, means that they do not follow traditional or mainstream cataloging approaches. Given the personalized nature of the documents included, metadata fields are often custom-built, e.g. taxonomic classification fields, location fields, keywords or copyright statements. Size and format, which are considered to be standard file information, are automatically included.
In the real world, metadata is everywhere. Each time you open an e-mail, read a book or order something off Amazon, you've encounter metadata.
Every book is enriched with metadata. Thanks to metadata, books can be classified in a neat manner, enabling potential users to find them in no time. Metadata about books include:
When you take a photo with your iPhone, metadata is generated and saved just as the photo is created. This metadata includes:
You also encounter metadata every time you receive or send an email. This metadata allows for the effective classification of e-mails in your mailbox and helps you find specific e-mails quickly using keywords. Metadata for e-mails usually include:
Journal editors and citation databases usually create metadata for scientific publications. Data contained in manuscripts or accompanying them as supplementary material is less often the subject of metadata creation. The original authors and curators of the database are then responsible for creating the metadata with the help of automated processes. Comprehensive metadata for all experimental data forms the basis for standards that ensure findable, accessible, interoperable and reusable research data
In the research study, transparent metadata about the authors' contribution to the study is proposed - e.g. the role, level of assistance and responsibilities played in the document preparation.
For clarity purposes, different types of metadata have been put in specific categories. The different types of metadata are the following:
Descriptive metadata: data that describes information about a resource or a file. It is used to help with discovery and identification. Descriptive metadata includes elements such as title, abstract, author, keywords.
Structural metadata: data that informs about the structure of the data object. It enlightens users about how a resource / file is organized. An example of structural metadata is a table of contents. Tables of contents indicate how pages from chapters, and how the chapters are related to each other.
Administrative metadata: Technical information that helps manage a resource. This can be the date on which the file was created, file type, permissions, etc. Administrative metadata is also related to usage rights and intellectual property, providing information such as the owner of a given asset, how it can be used, by whom, and for how long.
There are two ways to deal with digital metadata storage.
All organizations that have to manage information, libraries, archives or media libraries already have a long practice of signalling or encoding the contents of the documents they process. Before the advent of computers, cardboard index cards were used, whose structure was standardized in 1954.
These descriptions were later computerized in the form of bibliographic and standardized records. They facilitate the internal management of document resources and, on the user side, make it possible to optimize the search and location of documents.
Digital libraries have used the same devices to manage and locate electronic documents. The exchange of data items extracted from these records was quickly standardized within distributed applications.
In people's minds, digital data is more important than metadata. What you might not realize is that metadata is the key to unlocking the value residing in your data.
I was reminded of the importance of metadata by an unfortunate event that occurred to me this week. I forgot my bag, containing my car keys and other valuable stuff in a park after lunch. When I returned two hours later, the bag had obviously disappeared. Luckily, I had invested in a small device called a tile, which locates my car keys at all times. I just need to open the "tile" app to know the location of my keys. Basically, tile gives me access to my car keys' metadata. This allowed me to locate the keys, and I was back to work in no time.
Metadata is crucial. My car keys are precious, but if I don't know where they're located, they are of no use to me. If your organization collects any kind of data, you're in the same situation. You can have great datasets, if you can't locate them in your cloud data warehouse or other locations, they are utterly useless as well.
This also shows you the importance of investing in a metadata management tool. I'm happy I have the tile, automatically generating and updating metadata about my car keys. In fact, I'm busy and don't want to waste time thinking about the location of my various devices and objects, regardless of how important they are. In an ideal world: I wouldn't spend any time thinking about my car keys, but just always find them right when I need them. Your organization might face the same issue with data. Digital assets multiply, and people just have other priorities than keeping a neat record of the metadata. Of course, it's nice to be able to locate a table right at the time when you need it, but it's surely not worth dedicating 100% of your energy keeping track of every digital asset in your cloud data warehouse. That's when it becomes interesting to invest in metadata management tools that automatically collect metadata about your sets. if you're looking for such a tool, we've made a benchmark of all the data cataloging solutions on the market. if you're not ready for a tool yet, but still want to maintain a neatly organized metadata repository, feel free to use our handmade solution.
You've probably got the message at this point: metadata is key. Still, cultivating metadata about your data objects will impact your organization in ways you can't imagine. Metadata tells you what data you have, where they come from, what they mean and what their relationship is to other data you have. This helps your organization in four areas: data discovery and trust; data governance, data quality, and cost management/data maintenance.
Data discovery is obviously the first beneficiary of a good metadata management strategy. Having an organized and centralized repository of metadata allows you to know exactly where your data lives and who has access to it. Each table is enriched with context about what it contains, who imported it into the company, which dashboard and KPI it is related to, and any other information that can help data scientists locate it. In short, metadata makes your data discoverable. A metadata repository answers the following questions:
These are fairly basic, almost ridiculous questions. The sad reality is that most data users waste huge amounts of time trying to answer these questions. A lot of organizations neglect their metadata, leading data analysts to spend hours looking through different locations to find the digital assets they need.
This problem becomes more important as companies collect data from an increasing number of cloud source applications. Without clear documentation standards, enterprise data resources end up being poorly organized. As data is moved from cloud applications to the cloud data warehouse, data resources are linked to the metadata generated automatically in those applications. You can thus end up with two files bearing the same name; "users_2020" for example, but with the word 'users' referring to completely different concepts in each file. This, because the definition of the term 'user' was different in Salesforce and in Marketo. This show you the importance of establishing a clear metadata management standard in your system.
Once your business collects metadata in a standardized process, it becomes easier to find the data you need at the time you need it by leveraging the search features of data discovery tools. Remember how easy it is to find a photo on your iPhone when the only thing you remember is the location where the photo was taken?
I'm enjoying this feature way too much, as I have more than 40 000 photos stored in the cloud. When you've collected metadata, you can find digital assets in a blink of an eye by using keywords, just like a google search.
This saves data users considerable amounts of time compared to having to scan each data source in the system in the quest for the right data asset.
Metadata also enables objects that are similar or linked to be paired with like objects, to help optimize the use of digital assets. For example, metadata can help you pair a database with the dashboard that has been created from this database.
It also enables objects that are dissimilar to be identified and paired with like objects to help optimize the use of data assets.
If you're dealing with data, you need to be prepared to deal with security and compliance issues. These matters often feel like a mountain to deal with. Sensitive data and private information should not end up in the wrong hands, yet it feels almost impossible to control things when you're dealing with thousands, even millions of datasets. The key to ensuring security and compliance with laws such as GDPR is to have a solid data governance strategy.
Data governance is a set of policies regarding data usage and data security. These policies are created to determine the appropriate actions to be applied to a given dataset.
Again, here, metadata saves your life. It provides the means for identifying, defining, and classifying data within categories to ensure strong data governance. More particularly, it allows you to:
High-quality data is highly desirable, as it makes your organization's resources more reliable, increasing the business benefits gained by using them. Data quality is measured according to the following basic set of dimensions:
Before investing in an expensive data quality solution, look at what your metadata has to say about data quality. If you've invested in a metadata management strategy, data users should have the following elements for each digital asset at disposition:
That already brings you quite far off in your data quality assessment. First, you know a bit about the accuracy of your data. In fact, the table definition provides you with clear information about what the digital asset contains, and how the information was collected, and by whom. This allows you to quickly check whether the information reflects real-world facts and has been accurately measured. Having a centralized repository of metadata also supports data consistency. Finally, metadata tells you when your tables were last refreshed and thus whether your data is up to date.
Finally, metadata can be of good help when it comes to optimizing database management and especially data storage. Different storage solutions have different costs. For example, it's more expensive to store data in a cloud data warehouse than in a database. In general, the easiest it is to access digital assets in a database, the more expensive this storage solution will be. Based on metadata, you can identify the tables that are used the most as well as the poorly used tables in your business. This is extremely practical, as it allows you to move the unused tables in less costly, harder to query storage spaces. On the basis of your metadata, you can create rules according to which data that hasn't been used in the past 30 days is immediately moved to a less costly storage bucket. Metadata allows you to pinpoint exactly how much each dataset costs you according to storage cost and usage.
A good metadata management solution also helps you maintain your databases better. Metadata about digital assets includes data quality scores, number of issues with the data asset on a certain period, etc. You can thus know exactly which datasets you should focus your maintenance efforts on based on metadata. If a digital asset has been down 10 times in the past few days, you will ensure someone fixes it as soon as possible. More generally, this helps you prioritize the actions of your data team, ensuring it has the greatest possible impact and generated business value.
Metadata is important. You've probably understood it by now. One question remains, though: can you document all the files in your systems manually, or do you need to invest in a metadata management tool to support your documentation efforts? The most important dimension to look at when deciding whether to invest in a metadata management solution is whether it makes a key difference in how you document resources and how you collect metadata. In fact, you need to understand first whether or not you nassistance to document the content of your cloud data warehouse.
Now you might be wondering: what kind of assistance does a data catalog provide? Fundamentally, the value of a metadata management tool resides in the fact that it automates the data documentation process. What does it mean? Say, you document a specific file in your system, enriching the columns with descriptive context and definitions. An intelligent data catalog will propagate the original definition you gave to a specific column to all the other columns that bear the same name in your cloud system. This means that each minute you spend on documentation has a much greater impact that when you document your data resources using a manual process. If you have thousands of datasets containing one column with the same name, writing a definition for one column is equivalent to writing definitions for thousands of columns. This is incredibly time and cost saving.
Now, whether you find automating this process interesting or not depends on whether your organization is an enterprise or a small business.
If you're a small business, you won't be dealing with too much data resources. In this case, putting someone (or a few people) in charge of documenting each file in your company is possible. Manually maintaining a data catalog to keep your system organized is a feasible option. If that's what you need at the moment, we've got a template in store here, and we explain how to use it effectively.
In the enterprise case, your business might just be too large to document data content and files manually. It would take a disproportionate amount of time and human resources to document thousands of files. Especially given the fact that data is not static. Your company keeps collecting data throughout time. This means you would need to hire a full time data documentation team that continuously updates the metadata around your files. In terms of cost and time, it is generally more efficient to invest in a solution that automates the documentation process, bringing visibility to your system. If this is the most suited option for your business model, make sure you choose a tool that suits your company's needs. We've listed the various options here.
Big data is incredibly valuable, but metadata is the key allowing organizations to access this value. A good metadata management strategy will help your organization around four dimensions: Data discovery, data governance, data quality and data maintenant/cost management. Maintaining a centralized repository of metadata manually is tiresome, and can quickly become unsustainable when the number of datasets you own start growing exponentially. Thankfully, there are plenty of tools out there for you to choose.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data. If you're a data leader and would like to discuss these topics in more depth, join the community we've created for that!
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.