Productivity

Metrics that Matter: Mean Time to Recovery

By: Code Climate
June 12, 2023

In this article

The DORA research group, (DevOps Research and Assessment), now part of Google Cloud, identified four key software engineering metrics that their research showed have a direct impact on the teams' ability to improve deploy velocity and code quality, which directly impacts business outcomes.

The four outcomes-based DORA metrics include two incident metrics: Mean Time to Recovery (MTTR) (also referred to Time to Restore Service), and Change Failure Rate (CFR), and two deploy metrics: Deployment Frequency (DF) and Mean Lead Time for Changes (MTLC).

Gaining visibility into these metrics offers actionable insights to balance and enhance software delivery, so long as they are considered alongside other key engineering metrics and shared and discussed with your team.

The Mean Time to Recovery metric can help teams and leaders understand the risks that incidents pose to the business as incidents can cause downtime, performance degradation, and feature bugs that make an application unusable.

What is Mean Time to Recovery?

Mean Time to Recovery is a measurement of how long it takes for a team to recover from a failure in production, from when it was first reported to when it was resolved. We suggest using actual incident data to calculate MTTR, rather than proxy data which can be fallible and error-prone, in order to improve this metric and prevent future incidents. While the team may experience other incidents, MTTR should only look at the recovery time of incidents that cause a failure in production.

Why is Mean Time to Recovery Important?

Even for high-performing teams, failures in production are inevitable. MTTR offers essential insight into how quickly engineering teams respond to and resolve incidents and outages. Digging into this metric can reveal which parts of your processes need extra attention; if you’re delivering quickly but experiencing frequent incidents, your delivery is not balanced. By surfacing the data associated with your teams’ incident response, you can begin to investigate the software delivery pipeline and uncover where changes need to be made to speed up your incident recovery process.

Recovering from failures quickly is key to becoming a top-performing software organization and meeting customer expectations.

What is a good Mean Time to Recovery?

Each year, the DORA group puts out a state of DevOps report, which includes performance benchmarks for each DORA metric, classifying teams as high, medium, and low-performing. One of the most encouraging and productive ways to use benchmarking in your organization is to set goals as a team, measure how you improve over time, and congratulate teams on that improvement, rather than using “high,” “medium” and “low” to label team performance. Additionally, if you notice improvements, you can investigate which processes and changes enabled teams to improve, and scale those best practices across the organization.

More than 33,000 software engineering professionals have participated in the DORA survey in the last eight years, yet the approach to DORA assessment is not canonical and doesn’t require precise calculations from respondents, meaning different participants may interpret the questions differently, and offer only their best assumptions about their teams’ performance. That said, the DevOps report can provide a baseline for setting performance goals.

The results of the 2022 State of DevOps survey showed that high performers had a Mean Time to Recovery of less than one day, while medium-performing organizations were able to restore normal service between one day and one week, and low-performing organizations took between one week and one month to recover from incidents. For organizations managing applications that drive revenue, customer retention, or critical employee work, being a high performer is necessary for business success.

How To Improve Your Mean Time to Recovery

Visibility into team performance and all stages of your engineering processes is key to improving MTTR. With more visibility, you can dig into the following aspects of your processes:

Work in Progress (WIP)

A long MTTR could indicate that developers have too much WIP, and lack adequate resources to address failures.

Look at Other Metrics and Add Context

One of the benefits of using a Software Engineering Intelligence (SEI) platform is that you can add important context when looking at your MTTR. An SEI platform like Velocity, for example, allows you to annotate when you made organizational changes — like adding headcount or other resources — to see how those changes impacted your delivery.

You can also view DORA metrics side by side with other engineering metrics, like PR Size, to uncover opportunities for improvement. Smaller PRs can move through the development pipeline more quickly, allowing teams to deploy more frequently. If teams make PR Sizes smaller, they can find out what’s causing an outage sooner. For example, is debugging taking up a lot of time for engineers? Looking at other data like reverts or defects can help identify wasted efforts or undesirable changes that are affecting your team’s ability to recover, so you can improve areas of your process that need it most.

Improve Documentation

What did you learn from assessing your team’s incident response health? Documenting an incident-response plan that can be used by other teams in the organization and in developer onboarding can streamline recovery.

Set Up an Automated Incident Management System

To improve your team’s incident response plan, it’s helpful to use an automated incident management system, like Opsgenie or PagerDuty. With an SEI platform like Code Climate Velocity, you can push incident data from these tools, or our Jira incident source, to calculate DORA metrics like MTTR. In Velocity, users can set a board and/or issue type which will tell the platform what to consider an “incident.”

Talk to Your Team

We spoke with Nathen Harvey, Developer Advocate at DORA and Google Cloud, for his perspective on how to best use DORA metrics to drive change in an organization. Harvey emphasized learning from incident recovery by speaking with relevant stakeholders.

Looking at DORA metrics like Mean Time to Recovery is a key starting point for teams who want to improve performance, and ensure more fast and stable software delivery. By looking at MTTR in context with organizational changes and alongside other Velocity metrics, speaking with your team after an incident, and documenting and scaling best practices, you can improve MTTR overall and ultimately deliver more value to your customers.

The four DORA metrics are available in Velocity’s Analytics module. Learn how you can use these metrics to enhance engineering performance and software delivery by speaking with a Velocity product specialist.

Trending from Code Climate

1.
How to Navigate New Technology Expectations in Software Engineering Leadership

Rapid advancements in AI, No-Code/Low-Code, and SEI platforms are outpaced only by the evolving expectations they face. Learn how engineering leaders can take actionable steps to address new technology challenges.

2.
Mapping Engineering Goals to Business Outcomes

Understanding how engineering activities impact business objectives enables engineering leaders to make informed strategic decisions, keep teams aligned, advocate for resources, or communicate successes.

3.
Unlocking Efficiency: Optimizing Pull Request Reviews for Enterprise Engineering Teams

As engineering teams grow, so can the complexity of the code review process. From understanding industry benchmarks to improving alignment across teams, this article outlines strategies that large engineering organizations can use to optimize Review Cycles.

What We Know

How We Ensure Success

Your SEI Platform

Metrics that Matter: Mean Time to Recovery

What is Mean Time to Recovery?

Why is Mean Time to Recovery Important?

What is a good Mean Time to Recovery?