Quality and Stability Metrics: Measuring What Keeps Your Software Standing

Craig Risi
20 minutes ago
5 min read

In my previous blog post, I spoke about the importance of measuring flow metrics and how they can unlock blockers in your development practices and help you deliver more quickly. However, while speed and flow get software into production, its quality and stability determine whether it stays there.

High-performing engineering teams don’t optimise for velocity alone; they balance it with reliability, resilience, and customer trust.

Quality and stability metrics reveal how safe your delivery system is. They show whether fast changes are sustainable or whether they are silently accumulating risk that will eventually surface as defects, outages, and lost confidence.

Why Quality and Stability Metrics Matter

Without quality and stability metrics, organisations operate in the dark. Defects escape into production unnoticed until customers feel the impact. Incidents repeat because root causes are never fully understood or addressed. Teams become trapped in a cycle of firefighting, constantly reacting to failures instead of improving the systems that create them. Over time, this erodes customer trust, damages confidence in engineering, and ultimately slows delivery as more effort is spent fixing problems than building value.

Quality and stability metrics provide the visibility needed to break this cycle. They turn failures into data, and data into learning. Rather than relying on anecdotes or assumptions, teams can see where quality is breaking down, how often systems fail, and how quickly they recover. This insight enables leaders to prioritise improvements that reduce risk and create a more resilient delivery pipeline.

At their core, these metrics help organisations answer the most important questions about their software systems:

Are we building the right thing correctly?
How risky is each release?
Are we learning from failures, or repeating them?
Is our system becoming more stable over time, or more fragile?

When these questions can be answered with confidence, quality stops being an abstract goal and becomes a measurable, improvable capability.

Core Quality & Stability Metrics

Below are some of the key data points to track to better understand the quality and stability of your different systems. These metrics do not exist to punish teams; they exist to illuminate system health. When measured consistently and discussed openly, they enable continuous improvement, safer delivery, and stronger customer trust:

Change Failure Rate

What it measures: The percentage of releases that result in incidents, defects, rollbacks, or hotfixes.

Why it matters: It shows how risky your changes are. High failure rates indicate poor testing, unclear requirements, or fragile architecture.

Use case: Track by service or team to identify areas with unstable release practices.

How to measure it:

Count the number of releases in a given period
Count how many of those caused incidents, hotfixes, or rollbacks
Change Failure Rate = (Failed Releases ÷ Total Releases) × 100
Source data from CI/CD tools and incident management systems.

Defect Escape Rate

What it measures: The number of defects found in production compared to those found earlier in testing.

Why it matters: Production defects are expensive and damaging. A rising escape rate signals breakdowns in quality gates.

Use case: Improve test coverage and pre-release validation.

How to measure it:

Count production defects in a time period
Count total defects found across all environments
Defect Escape Rate = (Production Defects ÷ Total Defects) × 100
Use defect tracking systems like Jira or ServiceNow.

Mean Time to Detect (MTTD)

What it measures: How long it takes to identify a production issue after it occurs.

Why it matters: The faster you detect failures, the less impact they have on customers.

Use case: Enhance monitoring, alerting, and observability practices.

How to measure it:

Incident start time (from logs or monitoring)
Incident detection time (alert or ticket creation)
MTTD = Detection Time – Incident Start Time
Average across incidents for trends.

Mean Time to Recover (MTTR)

What it measures: How long it takes to restore service after an incident.

Why it matters: Fast recovery limits business impact and builds customer confidence.

Use case: Identify teams or systems that struggle to recover and improve runbooks and automation.

How to measure it:

Incident start time
Service restoration time
MTTR = Recovery Time – Incident Start Time
Track by service and severity.

Incident Frequency

What it measures: How often production incidents occur.

Why it matters: Frequent incidents indicate systemic weaknesses.

Use case: Track trends and correlate spikes to changes in deployment frequency or architecture.

How to measure it:

Count incidents per day, week, or month
Group by severity or system
Trend over time to identify stability patterns.

Test Failure Rate

What it measures: The percentage of failed tests during builds and pipelines.

Why it matters: Frequent failures indicate brittle tests or unstable code.

Use case: Improve test reliability and pipeline health.

How to measure it:

Count total tests executed
Count failed tests
Test Failure Rate = (Failed Tests ÷ Total Tests) × 100
Pull data from CI pipelines.

Rollback Rate

What it measures: How often releases must be reverted.

Why it matters: Rollbacks signal rushed or unsafe deployments.

Use case: Introduce canary releases, feature flags, and better validation.

How to measure it:

Count production deployments
Count how many required rollbacks
Rollback Rate = (Rollbacks ÷ Total Deployments) × 100
Track via deployment logs.

Service Availability

What it measures: Uptime and reliability of systems.

Why it matters: Availability is the customer’s perception of stability.

Use case: Track against SLOs and error budgets.

How to measure it:

Measure total service uptime vs downtime
Availability = (Uptime ÷ Total Time) × 100
Use monitoring tools and SLO dashboards.

Connecting Quality to Flow

Speed without stability is chaos. An organisation that only optimises for flow metrics - such as deployment frequency, change lead time, and throughput - may appear high performing on the surface, but underneath can be accumulating hidden risk. When quality is not measured alongside speed, teams can unknowingly trade reliability for velocity, leading to fragile systems, rising defect rates, and a growing operational burden.

Quality metrics must therefore be analysed in context with flow metrics. For example, an increase in deployment frequency should ideally be accompanied by stable or decreasing change failure rates. If faster delivery is paired with rising defects or rollbacks, it signals that quality controls, testing, or design practices are being bypassed. Similarly, long change lead times combined with high defect escape rates often indicate bottlenecks in validation, unclear requirements, or technical debt slowing down safe delivery.

By correlating flow and quality data, organisations can understand not just how fast they are moving, but how safely they are doing so. This balance is the hallmark of mature engineering systems, where speed is not achieved by cutting corners, but by strengthening the delivery pipeline.

Turning Quality Metrics into Improvement

Quality and stability metrics are not tools for blame; they are tools for learning and system improvement. Their purpose is not to point fingers, but to reveal patterns that would otherwise remain invisible.

They help answer critical questions such as:

Where are defects originating in the delivery lifecycle?
Which systems or components are most fragile?
Which types of changes carry the highest risk?
What processes, controls, or skills need strengthening?

When teams review these metrics regularly, they can shift from reactive firefighting to proactive prevention. Instead of repeatedly fixing the same issues in production, they can address root causes through better testing strategies, improved design, automation, and clearer feedback loops. Over time, this creates a virtuous cycle: higher quality reduces rework and incidents, which in turn improves flow, predictability, and trust across the organisation.

Closing Thought

True engineering maturity is not measured by how fast you deliver, but by how reliably you do so. Quality and stability metrics ensure that your delivery system is not just fast, but resilient, trustworthy, and sustainable.

CRAIG RISI

Quality and Stability Metrics: Measuring What Keeps Your Software Standing

Why Quality and Stability Metrics Matter

Core Quality & Stability Metrics

Change Failure Rate

Defect Escape Rate

Mean Time to Detect (MTTD)

Mean Time to Recover (MTTR)

Incident Frequency

Test Failure Rate

Rollback Rate

Service Availability

Connecting Quality to Flow

Turning Quality Metrics into Improvement

Closing Thought

Recent Posts

Comments