Top 10 Metrics for Engineering Teams

And how you can measure them

Top 10 Metrics for Engineering Teams

When it comes to measuring your engineering team's performance, choosing the right KPIs can challenging. Therefore, we decided to propose you 10 identifiable metrics which will help you assess your team on different aspects such as productivity, code quality, or efficiency of development. It is important to consider that some metrics will be more relevant than others depending on the nature of your project as well as the size of your team. However, you should always measure your team’s performance with some metrics to be able to improve it. Here are the 10 metrics that could help you:

1. Cycle Time
2. Mean Time To Repair (MTTR)
3. Mean Time Between Failures (MTBF)
4. Change Failure Rate
5. Change Lead Time
6. Test Coverage Ratio
7. Release Burndown
8. Deployment Frequency
9. Pull Request Flow Ratio
10. Code Churn Rate

1. Cycle Time

What is it? 

Cycle time is the time it takes to complete a given task, such as fixing a bug or adding new code.

Why is it important?

Cycle Time is a key engineering metric to acquire a better understanding of your work processes. You can use Cycle Time to identify the type of work that takes especially long to complete, which allows you to locate the bottlenecks slowing down your projects. This metric is also valuable as it helps you estimate the length of your future projects. 

How to calculate it? 

Cycle Time is the time that elapses between the moment a task enters the “in progress” stage to the moment it is considered finished. The evident way to measure the Cycle Time of a task is to count the number of days spent working on this task.

Cycle Time - Image courtesy of Castor

However, there is still a debate in a lot of organizations around whether Cycle Time should include non-working hours or not. The best way to settle the debate is to look at things from a customer experience point of view. Let’s say you measure Cycle Time excluding non-working hours and weekends. Based on historical data, you predict that your cycle time is 8 days and announce this to your customers. Your customers (rightfully) expect a deliverable in exactly eight days. However, you will only be able to deliver the product in 15 days because you have excluded non-working hours and weekends from your calculations. From your customers’ perspective, you will be late. Excluding non-working hours might make your Cycle Time look better internally, but it will certainly lead to customer dissatisfaction. I’ll let you guess which is the most important.

What does it look like? 

It’s interesting to display the Cycle Time as a chart showing its evolution and the type of task. This allows you to notice if your Cycle Time changes with time or according to the type of task it is.

2. Mean Time To Repair (MTTR)

What is it? 

As a production metric, MTTR refers to the amount of time it takes for engineers to repair a software problem. As a security metric, it refers to the amount of time it takes engineers to deploy a solution from the time they discover a security breach.

Why is it important? 

MTTR begins the moment a failure is detected and encompasses diagnostic time, repair time, testing, and all other activities until service is returned to end users. This metric shows how the software performs in production. Software failures are unavoidable, which makes it important to measure how quickly it recovers. This metric is valuable as it shows how long users have to wait until they can re-use the software. This is what a lot of users perceive as technical support. This metric impacts customer experience, which is why it’s key to measure it. Keeping this number low helps you maintain a high standard for your product.

How to calculate it? 

MTTR is calculated by dividing the total downtime caused by failures by the total number of failures.

Mean Time To Repair - Image courtesy of Castor

For example, a system fails three times in a month, and the failures resulted in a total of six hours of downtime, the MTTR would be two hours.MTTR = 6 hours / 3 failures = 2 hours

When this value becomes smaller over time, this means developers are becoming more efficient at understanding and fixing bugs.

What does it look like? 

It’s important to display MTTR so as to see its evolution. In fact, seeing how this metric evolves allows you to understand whether your team is solving issues quicker and in a more efficient manner. if this metric doesn't go down, something is abnormal. Issues are often redundant. If your engineering team is not solving issues faster, there might be a need to understand the root cause of problems instead of using quick fixes.

3. Mean Time Between Failures (MTBF)

What is it? 

The Mean Time Between Failures or MTBF refers to the average time between system breakdowns.

Why is it important? 

The MTBF is a maintenance metric used to assess the reliability and availability of a system or machinery. In the case of software, for example, it indicates how long it can operate without any breakdowns. It is important to monitor the MTBF because having a really low MTBF means that the system requires a lot of maintenance and improvements. Moreover, having too many failures can lead to a loss of users/clients. Therefore, measuring the MTBF is a way to know better and improve the quality of your product. In fact, the MTBF can be used to optimize the predictive maintenance schedule, thus avoiding system failures.

How to calculate it? 

Mean Time Between Failures - Image courtesy of Castor

For example, if your website was operational during one entire day with two failures that each took one hour to repair, your MTBF would be eleven hours.

MTBF = 22 hours / 2 failures.

The higher the MTBF is, the more reliable and available our system is. When looking at the MTBF, you should not factor downtime due to expected maintenance or update. Instead, we want to focus on unexpected issues.

What does it look like? 

This metrics should be displayed on a line chart. The Y-axis should represent the MTBF and the X-axis should represent the time with different granularities (hours, days, months) depending on the product. The objective is to keep a MTBF as high as possible as it translates in having very few unexpected failures.

4. Change Failure Rate

What is it? 

The Change Failure Rate or CFR represents the proportion of failed deployments over the total number of deployments. In this metric, we do not take into account the errors that were made during development or testing or the issues that occur after a long period of time. We only focus on the errors that were made because of a change to the system such as new features or quick fixes.

Why is it important? 

It is interesting to look at the Change Failure Rate because it is a good indicator of the quality of the changes that are deployed. In fact, if the CFR is too high, it means that you may have issues with testing or code reviewing before deployment. On the other hand, a good Change Failure Rate means that your team is able to identify issues before the deployment and you do not have to lose time fixing them afterward.

How to calculate it? 

Change Failure Rate - Image courtesy of Castor

If your team made 20 deployments over the week and four of them led to issues that had to be fixed right after the deployment, then your team’s CFR is 20%.

What does it look like? 

It is interesting to look at the evolution of this metric over time. Therefore, displaying the CFR on the Y-axis of a line chart with the time (days/weeks/months) on the X-axis will enable you to look at your team's performances easily. The goal for your team is to lower this metric as much as possible. Even though having a 0% CFR would be impossible, high-performing engineering teams should have a CFR of under 15%.

5. Change Lead Time

What is it? 

The Change Lead Time is an essential metric that measures a team’s speed of deployment. It represents the average duration to implement, test, and deliver a piece of code.

Why is it important? 

Change Lead Time is one of the most important KPIs to follow when assessing the efficiency of your team in the development process. If your Change Lead Time is very long, it might be interesting to look at the development chain to see where there is an issue. For example, there could be a bottleneck in the process if some individual along the chain is overloaded with tasks. On the other hand, a short Change Lead Time is a sign of flexibility and reactivity to problems.

How to calculate it?

Change Lead Time - Image courtesy of Castor

This week, your team made two changes to your product which took respectively two and four days between their commit and their deployment. The Change Lead Time is 3 days.

What does it look like? 

The evolution of your Change Lead Time can be monitored using a line chart. The idea is that by keeping a low Change Lead Time, you are making sure that the changes that you bring to your product are made efficiently and that you are more flexible.

6. Test Coverage Ratio

What is it? 

Test Coverage refers to the proportion of your code that is being executed when performing your test suite.

Why is it important? 

By taking into account a Test Coverage metric, you are able to better identify the gaps between requirements and testing. As a result, you can discover easily areas of your code that are not being tested. Moreover, Test Coverage can indicate that you should create more test cases for better coverage but it can also help identify and eliminate test cases that are not necessary. Overall, having a Test Coverage metric will help you have more control over your testing, thus making the testing cycle smoother and of better quality.

How to calculate it? 

Test Coverage Ratio - Image courtesy of Castor

It is considered adequate to have your test suite covering at least 80% of your code. In fact, the better the coverage, the more likely you will be to discover areas that need to be fixed. Even though we would like the Test Coverage to be as high as possible, you should not focus on achieving 100% coverage by writing vague test scripts but more on the requirements of your software.

What does it look like?

The Test Coverage Ratio should be displayed on a chart to be able to see how it evolves through time. For example, if you witness the Test Coverage Ratio go down after you made some updates to your code, it might be because you forget to write tests that execute some parts of the new code.

7. Release Burndown

What is it? 

Release Burndown is a metric used particularly in scrum projects to track the progress of a project by looking at the quantity of work remaining.

Why is it important? 

One of the main use of the Release Burndown is for teams to monitor their progress. However, it can also be used to set goals and motivate the team to achieve them. Moreover, looking at the Release Burndown chart of your team will also help you to see if your project is on-time or if there are some issues that led to a slow down.

How to calculate it? 

Release Burndown - Image courtesy of Castor

The Release Burndown can be expressed in terms of different units such as the number of hours of work remaining. In software development, teams also like to use the number of requirements that still need to be addressed for the application. You should adapt the unit of this metric to your project in order for it to be as easily understandable as possible by your team members.

What does it look like? 

The Release Burndown chart can either be a bar chart or a line chart with the units of work remaining on the y-axis and some time periods on the x-axis. This chart should be updated every period (e.g.every week), with the quantity of work that has been performed and the projection of work to be done for the next periods until the end of the project.

8. Deployment Frequency

What is it? 

The Deployment Frequency refers to the number of times deployments are made to your project in a fixed period of time. If it is better suited for your project, you can also count the number of features added instead of the deployments.

Why is it important? 

Looking at the Deployment Frequency of a project enables you to monitor the speed of your project. On the one hand, if your Deployment Frequency is too low, it can indicate that you have some issues with one of the deployments or that you have a bottleneck that slows down your developing process. On the other hand, having an excessively high Deployment Frequency can also mean that some features are deployed too quickly without enough testing or reviewing and this can lead to system failures.

How to calculate it? 

Deployment Frequency - Image courtesy of Castor

Your team deployed 21 new features in the last two weeks. Therefore, the Deployment Frequency of your team is 1.5 deployments per day.

What does it look like? 

It can be interesting to look at the evolution of the Deployment Frequency over time. By displaying on the Y-axis of a line chart the Deployment Frequency with the time (days/weeks/months) on the X-axis, you will be able to assess your team’s speed of development really easily.

9. Pull Request Flow Ratio

What is it? 

The Pull Request Flow Ratio refers to the sum of opened pull requests over the sum of closed pull requests over the same period of time.

Why is it important? 

Looking at this ratio can give you important insights into the balance between the features that are being developed and the ones that are deployed. You should have as many opened pull requests as closed pull requests. In fact, having an imbalanced ratio can mean that you have too many opened pull requests and that they are queuing for being deployed. It can also be a sign that the lead time for the opened pull request and that you need to address this issue. Overall, it is best to have a 1-1 ratio to make sure that your process is smooth and predictable.

How to calculate it? 

Pull Request Flow Ratio - Image courtesy of Castor

This week, you had 10 opened pull requests while only 6 closed pull requests. The ratio is therefore 5/3. It would be interesting to look into the process of the opened pull requests to see if we can lower their lead time and improve the efficiency of our development process.

What does it look like? 

It is best to monitor the Pull Request Flow Ratio over time as the goal is to aim for a 1-1 ratio. By displaying this on the Y-axis of a line chart this KPI with the time (days/weeks/months) on the X-axis, you will be able to assess the balance between the work put in to develop some features and the ones that are deployed.

10. Code Churn Rate

What is it?

The Code Churn Rate measures the number of times a piece of code is edited over a period of time.

Why is it important? 

Code churn is essential because it helps assess your development process. The main goal of the code churn is to assess the quality of your code. Generally, the more changes you make to your code, the more likely you are to make mistakes. A high Code Churn Rate can be a sign that there is too much rework on your project. This rework can be caused by problems such as bad communication of the project goals or a lack of coding skills. Moreover, looking at this metric enable you to identify the parts of your code that need to be tested as well as the allocation of your resources between the different modules of your project.

How to calculate it? 

Code Churn Rate - Image courtesy of Castor

Let’s say you have a script of 1000 lines in total. You modify 100 lines, add 200 lines and delete 75 lines, you then have a Code Churn Rate of 37.5%. Misunderstanding your clients’ requirements might be the cause and improving communication with them could help improve the efficiency of the project.

What does it look like? 

The Code Churn Rate is usually monitored for short periods of time such as weeks. Looking at the weekly evolution of this metric will enable you to detect improvements or problems in your code development. It is often said that having a low Code Churn Rate is best. However, it is impossible to achieve a zero churn rate since you always need to modify your code independently of your coding skills. In fact, having a too-low Code Churn Rate can mean that you are focusing too much on speed and not on code quality. More generally, it is best to have a Code Churn Rate between 10% and 30%.

Subscribe to the Castor Blog

About us

We write about all the processes involved when leveraging data assets: from the modern data stack to data teams composition, to data governance. Our blog covers the technical and the less technical aspects of creating tangible value from data.

At Castor, we are building a data documentation tool for the Notion, Figma, Slack generation.

Or data-wise for the Fivetran, Looker, Snowflake, DBT aficionados. We designed our catalog software to be easy to use, delightful and friendly.

Want to check it out? Improve your data monotoring. Try a free 14 day demo of CastorDoc.

New Release
Share

Get in Touch to Learn More

See Why Users Love CastorDoc
Fantastic tool for data discovery and documentation

“[I like] The easy to use interface and the speed of finding the relevant assets that you're looking for in your database. I also really enjoy the score given to each table, [which] lets you prioritize the results of your queries by how often certain data is used.” - Michal P., Head of Data