"Data governance is a measure of a company's control over its data"
Data governance is a data management concept. It is a measure of the control an organization has over its data. This control can be achieved through high-quality data, visibility on data pipelines, actionable rights management, and clear accountability. Data governance encompasses the people, processes, and tools required to create consistent and proper handling of a company's data. By consistent and proper handling of data, I mean ensure availability, usability, consistency, understandability, data integrity, and data security.
The most comprehensive governance model— say, for a global bank—will have a robust data-governance council (often with C-suite leaders involved) to drive it; a high degree of automation with metadata recorded in an enterprise dictionary or data catalog; data lineage traced back to the source for many data elements; and a broader domain scope with ongoing prioritization as enterprise needs shift.
A good data governance and privacy model is a mix of people, process and software.
Data governance isn't just that rusty process that companies have to deploy in order to comply to regulation. Of course, part of it is a legal obligation, and thank god, but clean governance strategy can have key business outcomes.
Here are the main goals and benefits of a data governance program:
In most organizations, data stewards are in charge to implement a framework to ensure key governance standards are met. This framework supports a set of rules and responsibilities such as assigning owners to data assets, enforcing the security of the analytics systems, adding access rights and security roles to data analysts and engineers. The framework and policies can change from one company to another.
Heads of data or CDOs, that manage data analytics teams, oversee the efforts of the data stewards. They set up a clear program or strategy to prioritize the work, set standards, define clear roles and responsibilities during a monthly or yearly committee. Data stewards support the strategy and implement the processes established by the head of data or Chief Data Officer. Good practices often focus on using a specialized governance tool, such as Castor.
The benefits of the work of the data stewards impact data governance and the efficiency of the analytics teams. It helps improve the quality of the decision making and brings visibility. They support analytics teams by maintaining high data quality standards, owners, roles to ensure smooth decisions and increasing security.
Timeline and key milestones in the space.
For the past twenty years, the challenge around data was to build an infrastructure to store and consume data efficiently and at scale. Producing data has become cheaper and easier over the years with the emergence of cloud data warehouses and transformation tools like dbt. Access to data has been democratized thanks to BI tools with BI tools like Looker, Tableau or Metabase. Now, building nice dashboards is the new normal in Ops and Marketing team. This gave rise to a new problem: decentralized, untrustworthy & irrelevant data and dashboards.
Even the most data-driven companies still struggle to get value from data - up to 73% of all enterprise data goes unused.
In the 1970s the first data protection regulation in the world was vetted in Hessen, Germany. Since then data regulation has kept increasing. The 1990's mark the first regulations regarding data privacy with the EU directive on data protection.
Yet, compliance with regulation really became a worldwide challenge in the second half of the 2010s with the emergence of GDPR, HIPAA, and other regional regulations on personal data privacy. These first regulations drove data governance for large enterprises. This created an urgency to build tools to handle these new requirements.
With the increasing complexity of data resources/processes on one hand and the first fines for GDPR infringement on the other, companies started to build regulatory compliance processes. The 1st pieces of software to organize Governance and Privacy were born with companies like Alation and Collibra.
The challenge is simple: enforce traceability across the various data infrastructure in the organization. Data governance was then a privilege of enterprise level companies, the only ones able to afford those tools. On-premise data storage makes it expensive to deploy these software. Indeed, companies like Alation and Collibra had to deploy technology specialists on the field to connect the data to their software. The first version of data governance tools aims at collecting and referencing data resources across the organization's departments.
There were several forces at play in this period. It became easier to collect data, cheaper to store it, simpler to analyze it. This led to a Cambrian explosion of the number of data resources. As a result, large companies struggled to have visibility over the work done with data. Data was decentralized, untrustworthy & irrelevant. This chaos brought a new strategic dimension to data governance. More than a compliance obligation, data governance became a key lever to bring about business value. The best organizations decided to bring in governance to help and improve efficiency of their analytics team. Governance wasn't just a compliance tool, but has key business impact and brought visibility across the various data systems.
With the standardization of the cloud data stack, the paradigm changed. It is easier to connect to the data infrastructure and gather metadata. Where it took 6 months to deploy a data governance tool on a multitude of siloed on-premise data centers in 2012, it can take up to 10 minutes in 2021 on the modern data stack (for example: Snowflake, Looker, and DBT).
This gave rise to new challenges: automatization and collaboration. Data governance on excel means maintaining manually 100+ fields, on thousands of tables and dashboards. This is impossible. Data governance with a non-automated tool means maintaining 10+ fields on thousands of tables: this is time-consuming. Doing data governance with a fully automated tool means maintaining 1 or 2 fields only on thousands of tables (literally table and column/field description). For that last part of manual work, you want to leverage the community. Prioritize work based on data consumption (high documentation SLA for popular resources) and democratize usage of end users through a friendly UX.
Additionally, you want that data governance tool to be integrated into the rest of the data stack. Define something once and find it everywhere: whether this is a table definition, a tag, a KPI, a dashboard, access rights, owners, rules, or data quality results.
Diverse governance's use-cases based on industry needs and organizations size
There are two main drivers for data governance programs:
Data regulation push the minimum bar of data governance processes higher. It requires business to add controls, security, reporting and documentation. Organizations set up a governance program to ensure transparency over sometimes unclear processes.
Having a strong governance become increasingly important with the exponential growth of data resources, tools and people in a company.
The level of complexity increases with the scope of business operations (number of lines of business, programs and geographies covered), the velocity of data creation or the level of automation (decision-making, processes) based on data.
Several bricks are needed to enforce data management
Before even talking about data governance framework, a company needs the basis: a good infrastructure to begin with. Based on business needs and the company's data maturity, the nature of the data architecture framework can change a lot. Regarding storage, do you go for: on-premise or cloud? data warehouse or data lake? Regarding modeling: Spark or DBT? in data warehouse or in BI tool? real-time or batch? Regarding visualization: do you allow anyone to build dashboards or data teams only? etc.
The first level of any data governance strategy is making sure relevant people can find the relevant datasets to do their analysis or build their AI model. If you don't implement this step, companies end up with a lot of questions on Slack and useless meetings with the engineering teams. The company ends up with a lot of duplicate tables, analyses and dashboards. It takes valuable time to engineering resources that are needed to perform the next steps.
Once you can efficiently find the data. You need to understand it quickly in order to assess if it is going to be useful.For example, you are looking at a dataset called "active_users_revenue_2021". There is a column "payment". Is this column in € or $? Has it been refreshed this morning, last week, or last year? Does it contain all the data on active users or just the ones in Europe? If I remove a column, will this break important dashboards for the marketing or finance team? etc.
Now that you have data, stored in scalable infrastructure, that everyone can find and understand, you need to trust that what is inside is of high quality. This is why so many data observability and reliability tools were born in the last five years. Data observability is the general concept of using automated monitoring, alerting, and triaging to eliminate data downtime. The two main approaches to data quality are: declarative (manually define thresholds and behavior) or ML-driven (detecting sudden changes in distribution).
Some data might be more private or strategic than others: you need to improve security as well as possible. Let's say you are a bank, you don't want to give access to the transaction logs to anyone in the company. You need to define access rights and managing them efficiently can quickly become a struggle as the number and type of people working with data grow. Sometimes, you want to give access to someone for a specific mission and for anything else. What happens when one of your employees was in the finance department but moves to marketing? You need a program to manage these rights thoroughly and efficiently to ensure key security standards.
This one is self-explanatory. To comply with various policies and regulations, you need to list all assets, report on personal information and usage to comply with regulation. For now, only enterprise companies are targeted by regulators, it is just a question of time before smaller companies start receiving fines. In most organizations, a yearly committee helps to drive the governance program.
Data governance brings trust from the raw data sources to domain experts dashboards
The typical data flow is the following :
These steps are happening on different tools, with a high level of abstraction. It is hard to keep a bird's eye view of what happens under the hood. This is what data governance is bringing to the table. You can see how the data flows, where the pipeline breaks, where risks lie, where to put your energy as a data manager, etc.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.