What is Data Governance and Privacy?

and why governance is the Gordian knot to all your problems

What is Data Governance and Privacy?
"Data governance is a measure of a company's control over its data"


Data governance is a data management concept. It is a measure of the control an organization has over its data. This control can be achieved through high-quality data, visibility on data pipelines, actionable rights management, and clear accountability. Data governance encompasses the people, processes, and tools required to create consistent and proper handling of a company's data. By consistent and proper handling of data, I mean ensure availability, usability, consistency, understandability, data integrity, and data security.

The most comprehensive governance model— say, for a global bank—will have a robust data-governance council (often with C-suite leaders involved) to drive it; a high degree of automation with metadata recorded in an enterprise dictionary or data catalog; data lineage traced back to the source for many data elements; and a broader domain scope with ongoing prioritization as enterprise needs shift.

A good data governance and privacy model is a mix of people, process and software.

Data governance has a direct business impact.

Data governance isn't just that rusty process that companies have to deploy in order to comply to regulation. Of course, part of it is a legal obligation, and thank god, but clean governance strategy can have key business outcomes.

Here are the main goals and benefits of a data governance program:

Data Governance Business Impact - by Xavier de Boisredon

Head of Data, CDO, and Data Stewards are in charge of data governance

In most organizations, data stewards are in charge to implement a framework to ensure key governance standards are met. This framework supports a set of rules and responsibilities such as assigning owners to data assets, enforcing the security of the analytics systems, adding access rights and security roles to data analysts and engineers. The framework and policies can change from one company to another.

Heads of data or CDOs, that manage data analytics teams, oversee the efforts of the data stewards. They set up a clear program or strategy to prioritize the work, set standards, define clear roles and responsibilities during a monthly or yearly committee. Data stewards support the strategy and implement the processes established by the head of data or Chief Data Officer. Good practices often focus on using a specialized governance tool, such as Castor.

The benefits of the work of the data stewards impact data governance and the efficiency of the analytics teams. It helps improve the quality of the decision making and brings visibility. They support analytics teams by maintaining high data quality standards, owners, roles to ensure smooth decisions and increasing security.

When did data governance become a thing?

Timeline and key milestones in the space.

For the past twenty years, the challenge around data was to build an infrastructure to store and consume data efficiently and at scale. Producing data has become cheaper and easier over the years with the emergence of cloud data warehouses and transformation tools like dbt. Access to data has been democratized thanks to BI tools with BI tools like Looker, Tableau or Metabase. Now, building nice dashboards is the new normal in Ops and Marketing team. This gave rise to a new problem: decentralized, untrustworthy & irrelevant data and dashboards.

Even the most data-driven companies still struggle to get value from data - up to 73% of all enterprise data goes unused.

→ 1990-2010: emergence of the 1st regulation on data privacy

In the 1970s the first data protection regulation in the world was vetted in Hessen, Germany. Since then data regulation has kept increasing. The 1990's mark the first regulations regarding data privacy with the EU directive on data protection.

Yet, compliance with regulation really became a worldwide challenge in the second half of the 2010s with the emergence of GDPR, HIPAA, and other regional regulations on personal data privacy. These first regulations drove data governance for large enterprises. This created an urgency to build tools to handle these new requirements.

→  2010 - 2020: 1st tools to comply with regulation. C-level realizes data governance becomes a strategic advantage to drive business value

With the increasing complexity of data resources/processes on one hand and the first fines for GDPR infringement on the other, companies started to build regulatory compliance processes. The 1st pieces of software to organize Governance and Privacy were born with companies like Alation and Collibra.

The challenge is simple: enforce traceability across the various data infrastructure in the organization. Data governance was then a privilege of enterprise level companies, the only ones able to afford those tools. On-premise data storage makes it expensive to deploy these software. Indeed, companies like Alation and Collibra had to deploy technology specialists on the field to connect the data to their software. The first version of data governance tools aims at collecting and referencing data resources across the organization's departments.

There were several forces at play in this period. It became easier to collect data, cheaper to store it, simpler to analyze it. This led to a Cambrian explosion of the number of data resources. As a result, large companies struggled to have visibility over the work done with data. Data was decentralized, untrustworthy & irrelevant. This chaos brought a new strategic dimension to data governance. More than a compliance obligation, data governance became a key lever to bring about business value. The best organizations decided to bring in governance to help and improve efficiency of their analytics team. Governance wasn't just a compliance tool, but has key business impact and brought visibility across the various data systems.

→ 2020+: Towards an automated and actionable data governance

With the standardization of the cloud data stack, the paradigm changed. It is easier to connect to the data infrastructure and gather metadata. Where it took 6 months to deploy a data governance tool on a multitude of siloed on-premise data centers in 2012, it can take up to 10 minutes in 2021 on the modern data stack (for example: Snowflake, Looker, and DBT).

This gave rise to new challenges: automatization and collaboration. Data governance on excel means maintaining manually 100+ fields, on thousands of tables and dashboards. This is impossible. Data governance with a non-automated tool means maintaining 10+ fields on thousands of tables: this is time-consuming. Doing data governance with a fully automated tool means maintaining 1 or 2 fields only on thousands of tables (literally table and column/field description). For that last part of manual work, you want to leverage the community. Prioritize work based on data consumption (high documentation SLA for popular resources) and democratize usage of end users through a friendly UX.

Additionally, you want that data governance tool to be integrated into the rest of the data stack. Define something once and find it everywhere: whether this is a table definition, a tag, a KPI, a dashboard, access rights,  owners, rules, or data quality results.

Data governance challenges are not the same for everyone

Diverse governance's use-cases based on industry needs and organizations size

Diverse governance's use-cases based on industry needs and organizations size

There are two main drivers for data governance programs:

  • Level of regulation needed in the industry

Data regulation push the minimum bar of data governance processes higher. It requires business to add controls, security, reporting and documentation. Organizations set up a governance program to ensure transparency over sometimes unclear processes.

  • Level of complexity of the data assets

Having a strong governance become increasingly important with the exponential growth of data resources, tools and people in a company.

The level of complexity increases with the scope of business operations (number of lines of business, programs and geographies covered), the velocity of data creation or the level of automation (decision-making, processes) based on data.

How do you set up a good data governance and privacy strategy?

Several bricks are needed to enforce data management

Several bricks are needed to enforce data management - Image by Xavier de Boisredon

  • Data Architecture (Storage, Modeling, Visualization)

Before even talking about data governance framework, a company needs the basis: a good infrastructure to begin with. Based on business needs and the company's data maturity, the nature of the data architecture framework can change a lot. Regarding storage, do you go for: on-premise or cloud? data warehouse or data lake? Regarding modeling: Spark or DBT? in data warehouse or in BI tool?  real-time or batch? Regarding visualization: do you allow anyone to build dashboards or data teams only? etc.

  • Search and Discovery

The first level of any data governance strategy is making sure relevant people can find the relevant datasets to do their analysis or build their AI model. If you don't implement this step, companies end up with a lot of questions on Slack and useless meetings with the engineering teams. The company ends up with a lot of duplicate tables, analyses and dashboards. It takes valuable time to engineering resources that are needed to perform the next steps.

  • Metadata and Documentation

Once you can efficiently find the data. You need to understand it quickly in order to assess if it is going to be useful.For example, you are looking at a dataset called "active_users_revenue_2021".  There is a column "payment". Is this column in € or $? Has it been refreshed this morning, last week, or last year? Does it contain all the data on active users or just the ones in Europe? If I remove a column, will this break important dashboards for the marketing or finance team? etc.

  • Data Quality

Now that you have data, stored in scalable infrastructure, that everyone can find and understand, you need to trust that what is inside is of high quality. This is why so many data observability and reliability tools were born in the last five years. Data observability is the general concept of using automated monitoring, alerting, and triaging to eliminate data downtime. The two main approaches to data quality are: declarative (manually define thresholds and behavior) or ML-driven (detecting sudden changes in distribution).

  • Security and Access Rights

Some data might be more private or strategic than others: you need to improve security as well as possible. Let's say you are a bank, you don't want to give access to the transaction logs to anyone in the company. You need to define access rights and managing them efficiently can quickly become a struggle as the number and type of people working with data grow. Sometimes, you want to give access to someone for a specific mission and for anything else. What happens when one of your employees was in the finance department but moves to marketing? You need a program to manage these rights thoroughly and efficiently to ensure key security standards.

  • Compliance and Regulation

This one is self-explanatory. To comply with various policies and regulations, you need to list all assets, report on personal information and usage to comply with regulation. For now, only enterprise companies are targeted by regulators, it is just a question of time before smaller companies start receiving fines. In most organizations, a yearly committee helps to drive the governance program.

Where does data governance fit in the modern data stack?

Data governance brings trust from the raw data sources to domain experts dashboards

Modern data stack governance - Image by Xavier de Boisredon

The typical data flow is the following :

  • You collect data from various sources from your business. It can be product logs, marketing, and website data, payment and sales logs etc. You extract that information with tools like Fivetran, Stitch, or Airbyte.
  • You then store this data in a data warehouse (Snowflake, Redshift, Bigquery, Firebolt to name the most popular). The data warehouse is both a place to store and transform your data to refine it.
  • The new trending transformation layer for the past 3 years is DBT. It enables to perform data transformation in SQL within the data warehouse while implementing software engineering good practices.
  • At last, the transformation helps you build your "data mart", the golden standard in term of refined data. The visualization brick helps domain experts visualize this gold-level data to share insights throughout the whole organization.

These steps are happening on different tools, with a high level of abstraction. It is hard to keep a bird's eye view of what happens under the hood. This is what data governance is bringing to the table. You can see how the data flows, where the pipeline breaks, where risks lie, where to put your energy as a data manager, etc.

Subscribe to the Castor Blog

About us

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.

Want to check it out? See how CastorDoc can help you manage, curate  and secure your data with a free demo.

New Release
Share

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data