Measuring data ROI is one of the top challenge of data teams. There are many ways to do it, and no established best practices. In a previous article inspired by discussions with data leaders, we sought to identify the skeleton of a framework for measuring data team’s performance. This article seeks to refine this approach a step further, by proposing 10 identifiable metrics to measure data teams’ performance. in short, this article represents the bones of our data ROI skeleton. Our favourite metrics for data teams are the following:
Reduction in the number of requests
Infrastructure cost saving
Analytics turnaround time
These metrics fall in three big categories: data quality metrics, operationalization metrics, and team productivity metrics. We briefly explain how these categories come about in this article, but don’t hesitate to check our previous piece on the matter to understand how we’ve formed this framework. The big idea here is that data teams go through different stages of developments, and that each stage correspond to specific metrics. The idea is intuitive: you will be looking at different metrics according to whether you have 1 or 20 people on your data team.
The first priority of a data team should be to provide clean, reliable, and understandable data to stakeholders. Before you can focus on anything else, you need to tick the data quality box. Poor quality data will poison all your operations. Our five favorite metrics are thus related to data quality.
💡Data accuracy refers to whether each available data field represents reality.
Accurate data is a prerequisite for ensuring the smooth running of your project. Accurate bank details mean you can charge your customers, accurate phone numbers mean you can always reach them, etc... On the contrary, inaccurate data often leads to frustration and high operational costs. This metric sounds very basic, yet every single organization deals with accuracy problems. getting data accuracy right will only get you on the starting line of the data analytics race: its won’t get you far but you can’t start the race without having it ticked.
Measuring accuracy is rather intuitive. You need to calculate the ratio between accurate data and the total amount of available data.
The tricky part, as you can imagine, is the numerator. In fact, it requires distinguishing accurate from inaccurate data. Ideally, this should be performed in an automated manner to prevent anyone from scrolling endlessly through datasets, calling each customer phone number to ensure its accurate. The key to measure data accuracy is to set up rules in place, and then conduct tests and assessments to measure the quality of datasets. Verifications can be performed against reference data sets, or the real entity.
Data accuracy should be calculated for each dataset and easily accessible from dataset users. It should be displayed as a percentage, with a company agreement that under a certain accuracy threshold, the dataset should not be used for analysis.
💡Consistency measures whether the entities that are supposed to match actually do.
Data consistency ensures that analytics correctly captures and leverages the value of data. In most organizations, the same information can be stored in more than one place. If this information all agrees with the same information in different places, we say the data is consistent. data consistency ensures that elements bearing the same name also bear the same value. For example, if your human resources information systems say an employee doesn’t work there anymore, yet your payroll says he’s still receiving a check, that’s inconsistent.
Data consistency is usually displayed as the percent of matched values across various records.
Consistency is hard to define and calculate. This metric should be calculated at different levels of granularity: two tables with the same name should contain the same information and two columns with the same name should have the same structures. At the smallest level of granularity, calculating two metrics from the same, consistent data should yield the exact same number.
💡Reliable data is data which remains accurate and consistent over time.
Reliability refers to whether data quality and accuracy can be maintained over a long period of time. It’s great if your dataset is consistent and accurate at time T, but you want it to remain consistent and accurate at time T+1, T+2, T+3, ect.. If data is not reliable, there is no way you can trust it over time.
To measure data accuracy, one should look at the processes used to create and refresh the data. These processes are called data pipelines. If data pipelines do their job, the data will be updated, refreshed and delivered on time. To measure data accuracy you should thus look at the health of your data pipelines, and and how often they break. Data accuracy can be captured by percentage of time where you have no data accidents. Looking at data lineage should also give you good insights as to how reliable is your data.
It is most appropriate to represent data reliability at the dataset level. This will you the level of insight needed to enlighten your decision of whether to use a dataset or not. For example, you might want to be cautious before using a dataset that broke ten times within the past 2 months.
💡Completeness refers to the comprehensiveness of information you have at hand.
You might have a lot of data, if it’s not complete, it’s not useful. Let’s say you’re looking at a customers’ dataset and you only have your clients’ name but not their e-mail addresses, making it impossible for you to reach anyone. Your data is incomplete and useless.
A measure of data completeness is the percentage of missing data entries. For instance, a column of 500 with 100 missing fields has a completeness degree of 80%.
Completeness is also a metric which it is ideal to display at the dataset level. Taking a look at the level of completeness of your dataset will guide you in your decision to use the dataset or to keep looking for more relevant information. Say, you stumble upon a dataset with a completeness of 15%. You might prefer to go your way and find another, more complete one.
💡Data usability refers to whether data can be used and understood smoothly.
Usability is a fundamental characteristic. Your data is usable when it’s easy to understand and interpret correctly, in an unambiguous manner. For example, you have a usability problem when a Looker dashboard is hard to interpret.In general, enriching your data with metadata (i.e documenting your data) makes it usable and easy to interpret on the go. Even if your data fares well regarding all the other data quality metrics, it might well be unusable. In fact, it might still mean that users do not trust, understand, or find the data.
Usability can be measured by looking at the level of documentation of your data assets, or the number of data users. In general, enriching your data with metadata (i.e documenting your data) makes it usable and easy to interpret on the go. The key metric to look at here is the percentage of documented columns. The best way to display and propagate your data documentation is to use a data catalog.
It’s a good number to display on each dataset. It also boosts productivity within your data team, because it encourages stakeholders to use the most popular documentation.
Once you’ve got data quality above 80%, meaning you can get clear answers from your data, it’s time to for phase 2: Data Operationalization. Operationalization is a complicated word for a simple concept. It means putting your high quality data in the hands of domain experts so they can move faster in their own job, without having to rely on the data team to complete their requests. Instead of using data to influence long-term strategy, operational analytics informs strategy for the day-to-day operations of the business.
💡This metric refers to the reduction in the number of requests in a specific business category.
A nice way to measure data operationalization is to look at the number of problems you allow other teams to solve independently thanks to the infrastructure provided. For example, when the data team is relatively young, it might get a lot of requests from other teams about attribution. As the data infrastructure improves, and as operational teams access the data easily, the need to rely on the data team to solve attribution issues decreases. The marketing team becomes more independent in solving this kind of problem, ultimately driving the number of attribution-related requests to zero. A good measure of how well you’re operationalizing your data is to look at the reduction in the number of requests in various categories. And the more you can tick problems off the list, the more your data is operationalized.
A good way to estimate this metric is to measure the percentage of self-supported queries. That is, queries that the business could do entirely by themselves. It reflects the idea that data teams should be focusing on offloading queries and providing infrastructure for the business to run queries themselves. (Jessee, 2022)
You should be keeping track of the decrease for each category over the months. This will give you a clear idea of the problems teams are independent with, and which ones they still have to rely on the data team to solve. You can deem the data in a specific category is fully operationalized when you receive zero requests about a specific category.
You now have clean and operationalized data. It’s a miracle. the last thing left on your plate is to measure the productivity of the people on your data team. Although it will recoup the data quality and data operationalization features, these are important. In fact, you want to ensure people on your team do their job well. And it’s far from straightforward because people on your data team are on different missions. Data engineers are on a mission to make everyone’s life easier by providing a nice data infra, while data analysts focus on helping people make better data-driven decisions. The difference between engineering and Analytics is brilliantly explained by Mikkel Dengsoe. The point is, they have different missions which should be measured differently. This is the purpose of this part.
💡Cost savings due to good data management and clever tooling choices.
Apart from providing clean and reliable data to other teams, data engineers are responsible for good data management.This includes cleaning tables, archiving unnecessary ones, and taking full advantage of what the cloud has to offer. It turns out that good data management saves tremendous amounts of money in storage costs. Similarly, data engineers generally seek tp automate processes or make them more efficient. And this saves time, and thus money.
Infrastructure cost savings should naturally follow good data management practices, and come as the natural reward of performing data engineering teams.
Again, this is a reasonably basic measure. You could simply look at the percentage decrease in your infrastructure cost. Naturally, you should be taking this number with a pinch of salt and understand that falling costs are not always a good sign. Are costs falling because your team has become more efficient (good sign) or are they simply processing much less data than before (less of a good sign)? Regardless of the answer, this number will tell you something, on top of being super easy to measure.
Look at the evolution of these numbers over the months/years and adapt your strategy accordingly.
💡This metric measures how easy it is for data scientists to access the data.
This metric can seem somewhat stupid at first sight, yet you would be surprised to see the amount of frustration caused by poor data accessibility. Due to poor documentation or friction tied to data governance programs, it can sometimes take two to three days for a data scientist to access a dataset of interest. The issue is, that data doesn't mean anything to people who can’t access it.
This metric is not straightforward to measure. You’re basically looking for the average amount of time it takes a data person to access a given dataset. To measure this, you can look at the average time it takes between the moment when access to the dataset is requested, and the moment it is granted. This would already give you a good idea of the accessibility level in your company. Ideally, you should also identify the average time it takes a data person to find the dataset she needs in the warehouse. The most straightforward way to do this is to do a survey.
You can be looking at this number at the global level. And its evolution. If this number is very high or keeps getting worse, maybe it’s time for you to invest in a data catalog or in a more efficient data governance program, which allows for both data protection and data discovery simultaneously.
💡Data uptime refers to the percentage of time a dataset is delivered on time.
Data uptime is a key metric allowing teams to put a number on the “multiplier effect” of engineering teams. In fact, engineers rarely impact top-level KPIs in the organization. Its mission is to create a reliable data infrastructure that then acts as a “multiplier”, allowing analytics teams to move faster and more efficiently. Data uptime is a good way to measure the quality of this infrastructure, and a strong indicator of how much engineering facilitates analytics life. It is basically a measure of how often data is available and fresh.
Data uptime is the percentage of time a dataset is delivered on time. It is always calculated relative to the expected frequency and SLA requirements.
Expected frequency refers to how often a dataset is expected to be updated. This is most commonly daily, hourly or real-time.
SLA (Service Level Agreement) Requirement is the frequency clause stipulated by an SLA agreement between a data producer and consumer for when data must be updated by. For example, an SLA might state an absolute time like 7:00 am or a relative time like every 3 hours.
The easiest way to measure data uptime is to measure data downtime (when data is NOT delivered on time) and to subtract this number from 100%. The formula for data downtime was derived in this great article by Barr Moses.
For example, if you had 15 data incidents the past month, each taking 5 hours to detect and 3 hours to resolve, your data downtime can be estimated to 120 hours.
Data uptime should be measured frequently for the totality of your datasets. Looking at its evolution over the months will tell you whether your engineering team is doing its job better and more efficiently, or not.
💡Turnaround time refers to the time elapsed between when a data driven question is asked and when analytics can provide an answer.
Turnaround time is a great way to measure the efficiency of Analytics teams. This metric was initially proposed by Benn Stancil. Contrarily to engineering teams, Analytics teams directly impact decision-making and top-level KPIs. Their ultimate mission is to provide fast answers to key questions so as to enlighten decision-making in the organization. This should be taken into account when measuring their performance. And the best way to do so, as proposed by Benn Stancil, is to measure the time between when a question is asked, and when a decision is made based on the answer given by the analyst.
The great thing about measuring analytics performance this way is that it encourages analysts to focus their work on real-life decision-making, preventing them from getting lost in exploratory data analysis.
“The moment an analyst is asked a question, a timer starts. When a decision gets made based on that question, the timer stops. Analysts’ singular goal should be to minimize the hours and days on that timer.” (B.Stancil, 2022). Accordingly, the best way to measure analytics turnaround time is to subtract the elapsed time between when the question is asked and when a decision is made based on it.
This number should be calculated for each analyst on the team, as it is most appropriate to measure individual performance. Feel free to aggregate it to get the overall turnaround time for the team, but measuring it individually will allow you to make more strategic decisions.
We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.
At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation. We designed our catalog software to be easy to use, delightful and friendly.
Want to check it out? Reach out to us and we will show you a demo.