Why Your Engineering Metrics Are Lying to You

Every engineering leader we talk to these days has a metrics dashboard. Deployment frequency, lead time for changes, change failure rate, mean time to recovery: whether you call them DORA metrics, SPACE dimensions, or Accelerate indicators, the key signals of software delivery performance are everywhere. And yet, most of the teams we work with are measuring them wrong.

After spending decades helping organizations of every scale improve their engineering practices, we have seen the same pattern repeat itself: teams adopt a metrics framework, celebrate the numbers going up, and then wonder why their actual delivery outcomes have not improved. The dashboard says elite, but the customers say otherwise.

Let us walk through why, and more importantly, how to fix it.

The measurement trap

The first problem is what we call vanity deploys. A team at a large financial services company we advised had a deployment frequency of 45 deploys per day. Impressive, right? When we dug in, over 80% of those were config changes, feature flag toggles, and README updates pushed through the same pipeline. Their actual meaningful feature deployments happened roughly twice a week.

Here is what this looks like in practice. Their CI/CD pipeline made no distinction:

# Every merge to main counts as a "deployment"
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ./deploy.sh  # Deploys everything, counts everything

The second trap is gaming lead time. We have seen teams break large features into dozens of trivial PRs to shrink their lead time metric. One team had an average PR size of 12 lines of code. Their lead time looked phenomenal, but their actual time-to-value for a customer-facing feature was measured in months, not days.

The third, and perhaps most dangerous, is hiding change failure rate. If your definition of “failure” requires a customer support ticket to be filed before it counts, you are lying to yourself. Silent data corruption, degraded performance, and partial outages that self-heal are all failures.

# What teams report as "change failure rate"
total_incidents=$(grep -c "SEV1\|SEV2" incidents.log)
total_deploys=$(wc -l < deploys.log)
echo "CFR: $(echo "scale=2; $total_incidents / $total_deploys * 100" | bc)%"

# What they should be measuring
total_failures=$(grep -c "SEV1\|SEV2\|rollback\|hotfix\|degraded" incidents.log)
total_meaningful_deploys=$(grep -c "type:feature\|type:fix" deploys.log)
echo "Real CFR: $(echo "scale=2; $total_failures / $total_meaningful_deploys * 100" | bc)%"

What elite teams actually measure

The teams that genuinely perform at elite levels do something different. They treat frameworks like DORA, SPACE, and DevEx as menus to select from, not checklists to complete. DORA provides four tightly scoped delivery metrics. SPACE broadens the lens to satisfaction, performance, activity, communication, and efficiency. DevEx focuses on flow state, cognitive load, and feedback loops. Mature organizations blend elements from several of these frameworks to build a measurement system that fits their context.

Meaningful change failure rate. Rather than counting incidents over deploys, elite teams categorize failures by blast radius and customer impact. A deployment that causes a 2% increase in P99 latency for your checkout service is not the same as a typo in an internal admin tool. Weight your failures accordingly.

Real lead time, from idea to impact. The best teams we have worked with measure lead time from when a feature is prioritized (not when coding starts) to when it is validated in production with real users. This captures the organizational overhead that pure code-to-deploy metrics miss entirely.

Track cycle time at each stage: ideation, design, development, review, testing, deployment, validation
Identify which stages introduce the most delay
Separate waiting time from working time. The ratio is usually shocking.

Recovery as a capability, not just a metric. MTTR as a number is less useful than understanding your recovery patterns. Do you always roll back? Do you fix forward? How many people need to be involved? One insurance company we worked with had a 15-minute MTTR, but it required paging 8 engineers every time. That is not sustainable.

From metrics to KPIs: choosing what matters for your organization

Not every metric should be a KPI. A metric provides visibility into how something is performing. A KPI is tied to a strategic objective and triggers action when it moves. Tracking 30 metrics gives you observability; promoting 3 to 5 of them to KPI status gives you focus.

The distinction matters because KPIs shape behavior. When you make something a KPI, people will optimize for it. If you pick the wrong ones, you get Goodhart’s Law in full force: the metric becomes the target and ceases to be a useful measure.

Business type determines metric priority

The right KPIs depend on what your organization actually does.

B2C product companies should prioritize deployment frequency and lead time. Speed to market is the competitive advantage. If a competitor can ship a feature in days while you take weeks, no amount of reliability will save you. Your KPIs should reflect how quickly validated ideas reach real users.

B2B enterprise platforms should prioritize change failure rate and reliability. When your customers have SLAs, contractual uptime commitments, and compliance requirements baked into their procurement process, a fast release that breaks production costs you more than a slow one that works. Your KPIs should reflect stability and predictability.

Regulated industries need metrics that demonstrate compliance alongside delivery performance. A deployment pipeline that cannot produce an audit trail is a liability, regardless of how fast it ships. Consider KPIs that combine delivery speed with evidence of control: percentage of deployments with full traceability, time from vulnerability disclosure to patched deployment, or policy-as-code pass rates.

Platform and infrastructure teams should measure developer experience and cognitive load reduction. If your internal platform exists to make product teams faster, then your KPIs should reflect their experience, not your own throughput. Time-to-first-deployment for a new service, developer satisfaction scores, and support ticket volume are stronger signals than the number of features your platform ships.

Aligning metrics with strategic objectives

A practical exercise for choosing KPIs:

Start with business objectives. What does the company need to achieve this year? Revenue growth, market expansion, regulatory compliance, customer retention?
Map to engineering capabilities. What engineering behaviors support those objectives? Faster delivery, higher reliability, better security posture, reduced operational cost?
Identify candidate metrics. Which metrics, from whichever framework, actually measure those behaviors?
Promote the critical few. Select 3 to 5 metrics where movement directly correlates with business outcomes. Those are your KPIs. Everything else remains a supporting metric, tracked but not elevated to strategic status.

The anti-pattern of adopting someone else’s KPIs

One of the most common mistakes we see is teams adopting another company’s KPI framework wholesale. “Spotify uses these metrics, so we should too.” “Google tracks these four signals, so that is what elite looks like.”

Frameworks are menus, not mandates. The metrics that matter for a 50-person startup building a consumer app are fundamentally different from those that matter for a 5,000-person enterprise running a regulated financial platform. Copying someone else’s KPIs is like copying their org chart: it only works if you have the same problems, the same customers, and the same constraints. You almost certainly do not.

Review cadence

KPIs should not be permanent. Build a quarterly review into your operating rhythm:

Validate correlation. Is this KPI still correlated with the business outcome it was chosen to represent? If deployment frequency went up but time-to-market did not improve, the correlation is broken.
Retire stale KPIs. A KPI that drove improvement for two quarters may have served its purpose. If the team has reached a sustainable level, retire it and promote a new bottleneck metric.
Resist KPI inflation. There is constant pressure to add more KPIs. Resist it. Every KPI you add dilutes focus. If someone proposes a new KPI, ask what existing one it replaces.

Building a metrics culture

The hardest part of engineering metrics is not the measurement itself. It is building a culture where the metrics drive improvement rather than anxiety.

Start blameless. If your change failure rate goes up after you improve your detection, that is a win, not a problem. We worked with a healthcare technology company where the engineering VP celebrated their CFR going from 2% to 15% because it meant they were finally catching issues their monitoring had been missing for years.

Make metrics a conversation, not a scorecard. The most effective teams we have seen use engineering metrics as input to their retrospectives, not as performance reviews. When deployment frequency drops, the question is “what obstacles appeared?” not “who is underperforming?”

Iterate on your measurement approach. Your metrics pipeline should be treated like a product:

Review metric definitions quarterly
Validate that metrics correlate with business outcomes
Sunset metrics that drive the wrong behavior
Add context dimensions as your understanding matures

The goal is not to be “elite” on a dashboard. The goal is to deliver value to your customers faster and more reliably. If your engineering metrics are not helping you do that, it is time to rethink how you are measuring.

The numbers should serve the mission, not the other way around.