← Back to all heuristics

You Cannot Rely on People Under Stress

Design for tired humans.

ReliabilityOperationsCollaborationSRE

Heuristic

Design systems that are safe for tired humans.

Hook

Incidents are the worst time to improvise.

Problem

During outages, humans are stressed and error-prone. If recovery depends on perfect manual execution, failures compound.

Why it matters

Designing for human limitations reduces recovery time and prevents repeat incidents. It also protects teams from burnout.

Signals you are here

  • Incidents rely on heroic manual fixes
  • Runbooks are unclear or missing
  • Recovery steps require special access
  • On-call rotations are exhausting
  • Repeated manual paging for the same class of failures

Anti-patterns

  • Manual gatekeeping during incidents
  • Undocumented emergency procedures
  • Complex recovery steps with no automation
  • Overreliance on a few experts

Try this

  • Automate rollback and failover
  • Build fault-tolerant paths and self-healing to reduce callouts
  • Practice incident drills regularly
  • Keep runbooks concise and current
  • Use safe defaults and guardrails
  • Reduce the number of manual steps

Example

A service introduced automated rollback when error rates spiked. It prevented a widespread outage during a late-night incident when the on-call engineer was alone.

Reflection prompt

Which recovery step requires the most manual effort? Automate or simplify it.

More like this

Heuristic

Runbooks Are a Bridge Between Dev and Ops

Runbooks turn knowledge into action.

OperationsReliabilitySRE

Heuristic

You Build It, You Run It

Build it, run it.

CollaborationOperationsSRE

Heuristic

Fail Closed, Log Everything, Recover Gracefully

Safe failure beats quiet failure.

ReliabilitySecuritySecurity

Heuristic

Every Output Is Someone Else's Input

Handoff quality sets the pace of flow.

FlowCollaboration

Heuristic

Increase Contrast, Not Volume

Prompt length does not guarantee novelty. Context contrast does.

ArchitectureOperations

Heuristic

Blame the Process, Not People

Fix the system.

LearningReliability