Generative AI has already started shaking the world of Data Governance, and it is set to keep doing so.
It’s just been 6 months since ChatGPT’s release, but it feels like we need a retrospective already. In this piece, I’ll explore how generative AI is impacting data governance, and where it’s likely to take us in the near future. Let me emphasize near because things evolve quickly, and they can go a lot of different ways. This article isn’t about forecasting the next 100 years of data governance, but rather a practical look at the changes happening now and those just on the horizon.
Before diving in, let’s remind ourselves of what data governance deals with.
Keeping things simple, data governance is the set of rules or processes that an organization follows throughout the data life cycle to ensure the data is trustworthy. It involves 5 key areas:
- Metadata and Documentation
- Search and Discovery
- Policies and Standards
- Data Privacy and Security
- Data Quality
In this piece, we’ll look at how each of these areas is set to evolve once we incorporate generative AI in the mix.
Let’s do this.
1. Metadata and Documentation
Metadata and documentation is probably the most important part of data governance, and the other parts build heavily of this one being done correctly. AI has already started, and will continue to change the way we create data context. However, I don’t want to get your hopes too high. We still need humans in the loop when it comes to documentation.
Producing context around data, or documenting the data is made of two elements. The first part, which makes up about 70% of the job, involves documenting information that's common for many companies. A very basic example is the definition of “e-mail” which is common to all companies. The second part, the remaining 30%, is about writing down the specific know-how that's unique to your company.
Here's the exciting part: AI can do a lot of the heavy lifting for the first 70%. This is because it's general knowledge, and generative AI is excellent at handling that. We have already implemented such a feature in CastorDoc, allowing you to document 70% of your tables this way.
Now, what about knowledge that's peculiar to your company? Every organization is unique, and this uniqueness gives birth to a specific company language. This language is your metrics, KPIs, and business definitions. And it isn't something that can be imported from outside. It's born from the people who know the business best - its employees. In my conversations with data leaders, I often discuss how to create a shared understanding of these business concepts among all team members. Many leaders share that to achieve this alignment, they gather domain teams to talk, debate, and finally agree upon the definitions that best fit their business model.
Let's take, for example, the definition of a 'customer.' For a subscription-based software company, a customer could be someone who's currently subscribed to their service. But for a retail business, a customer might be anyone who's made a purchase in the last 12 months. Each company defines 'customer' in a way that makes the most sense for them, and this understanding usually emerges from within the organization.
When it comes to such peculiar knowledge, AI, as smart as it is, can't do this part just yet. It can't sit in on your meetings, join in the discussion, or help new concepts bloom. For Andreessen Horowitz, this might become possible when the second wave of AI hits. For now, we are still at wave 1.
Finally, I'd like to touch on a question posed by Benn Stancil. Benn asks: If a bot can write data documentation on demand for us, what’s the point of writing it down at all?
There is some truth to this: if generative AI can generate content on demand, why not just generate it when you need it, instead of bothering with documenting everything? Unfortunately, it does not work like this, for two reasons.
First, as I've previously explained, a chunk of documentation covers the unique aspects of a company that AI can't currently capture accurately. This calls for human expertise. It cannot be generated on the fly by AI.
Second, while AI is advanced, it's not infallible. The data it generates isn't always accurate. Hence, at CastorDoc, we ensure a human checks and confirms AI-produced content.
2. Search and Discovery
Generative AI is not just changing the way we create documentation, it also affects the way we consume it. In fact, we’re witnessing a paradigm shift in search and discovery methods. The traditional methods, where analysts sift through your data catalog seeking out relevant information, are quickly becoming outdated.
The true game changer lies in AI’s ability to become a personal data assistant to users. In some data catalogs, you can already approach the AI with your specific data inquiries. You can ask questions such as, "Is it possible to perform action X with the data?", "Why am I unable to use the data to achieve Y?", or "Do we possess data that illustrates Z?". This streamlined process, powered by AI, is more efficient and also enhances users’ understanding and manipulation of data.
Another development in AI that we’re expecting is that it will transform the data catalog from a passive entity to an active helper. Think about it this way: if you're using a formula incorrectly, the AI assistant could give you a heads-up. Likewise, if you're about to write a query that already exists, the AI could let you know.
In the past, data catalogs just sat there, waiting for you to sift through them for answers. But with AI, these catalogs could start actively helping you, offering insights and solutions before you even realize you need them. This would be complete shift in how we engage with data, and we are not too far from it.
But there is a condition for the AI assistant to work effectively: your data catalog must be meticulously maintained. To ensure that the AI assistant provides reliable guidance to stakeholders, the underlying documentation must be accurate and trustworthy. If the catalog is not properly maintained, or if the policies are not clearly defined, then the AI assistant will spread incorrect information throughout the company. This would be more detrimental than having no information at all, as it could lead to poor decision-making based on inaccurate data.
You’ve probably understood it: the relationship between AI and data governance is cyclical and interdependent. AI can significantly enhance data governance, but in turn, robust data governance is required to fuel the capabilities of AI. This results in a virtuous cycle where each component boosts the other, leading to a turbocharged approach to data governance.
3. Data policies and standards
Another key component of data governance is the formulation and implementation of governance rules,policies and standards unique to each company.
This usually involves establishing clear definitions of data ownership and domains within the organization. At present, AI isn't quite up to the task when it comes to defining these policies and standards. AI shines when it comes to executing these rules or flagging infractions, but it is lacking when tasked with creating the rules themselves.
This gap exists for a simple reason. Defining ownership and domains is closely tied to human politics. For example, ownership means deciding who within the organization has the authority and responsibility over specific datasets. This could include the power to make decisions about how and when the data is used, who has access to it, and how it's maintained and secured. Making these decisions often involves negotiating between different individuals, teams, or departments, each with their own interests and perspectives.
We thus expect that humans will continue to play a significant role in this aspect of governance in the near future. Generative AI can play a role in drafting an ownership framework or suggesting data domains. However, keeping humans in the loop remains a must.
4. Data Privacy and Security
However, generative AI is set to shake things up in the privacy department of data governance. Managing privacy rights is a traditionally feared aspect of governance. Nobody enjoys it. It involves manually creating a complex architecture of permissions to ensure that sensitive data is only accessed by authorized individuals.
The good news is: AI can automate much of this process. Given parameters such as the number of users and their respective roles within an organization, AI can create rules for access rights. Essentially, the architectural aspect of access rights, being fundamentally code-based, aligns well with AI's capabilities. The AI system can process these parameters, generate relevant code, and apply it to manage data access efficiently.
Another area where AI can make a big impact is in the management of Personally Identifiable Information (PII). Today, PII tagging is usually done manually, making it a burden for the person in charge of it. This is something AI can automate completely. By leveraging AI's pattern recognition capabilities, PII tagging can be conducted more accurately than human-led processes. In this sense, using AI could actually improve the way we we manage privacy protection.
This does not imply that AI will completely replace human involvement. Despite AI's capabilities, human oversight will remain essential to manage unexpected situations and make judgment calls when needed.
5. Data Quality
Let’s not forget about data quality, which is an important pillar of governance. Data quality ensures that the information used by a company is accurate, consistent, and reliable. Maintaining data quality has always been a complex task, but things are already changing with generative AI.
As I mentioned above, AI is great at applying rules and flagging things when rules are not respected. This makes it easy for algorithms to identify common data errors and anomalies in the data. You can find a detailed account on how AI affects different aspects of data quality in this article.
AI can also lower the technical barrier of data quality. This is something SODA is already putting in place. Their new tool, SodaGPT, offers a no-code approach to express data quality checks, enabling users to perform quality checks using natural language alone. This allows data quality maintenance to become much more intuitive and accessible.
We’ve seen that AI can supercharge Data Governance in a way that is triggering the beginning of a paradigm shift. A lot of changes are already happening, and they are here to stay. However, it’s important to note that AI can only build on an already solid foundation. For AI to change the search and discovery experience in your company, you must already be maintaining your documentation. AI, with all its transformative potential, can't miraculously mend a system that is flawed or poorly maintained. The second point to keep in mind is that even if AI can be used to generate most of the context around data, it cannot replace the human element entirely. we still need humans in the loop for validation and for documenting the knowledge unique to each company. So our one sentence prediction for the future of governance: turbocharged by AI, anchored in human discernment and cognition.
Subscribe to the Newsletter
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.
Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.
You might also like
Discover the new face of data governance and how CastorDoc is shaping the future of data management and compliance.
Castor looks at the modern wave of AI-augmented data catalogs and how they can help organizations make the most of their data. Get started today!
“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data