Hook
Incidents are the worst time to improvise.
Problem
During outages, humans are stressed and error-prone. If recovery depends on perfect manual execution, failures compound.
Why it matters
Designing for human limitations reduces recovery time and prevents repeat incidents. It also protects teams from burnout.
Signals you are here
- Incidents rely on heroic manual fixes
- Runbooks are unclear or missing
- Recovery steps require special access
- On-call rotations are exhausting
- Repeated manual paging for the same class of failures
Anti-patterns
- Manual gatekeeping during incidents
- Undocumented emergency procedures
- Complex recovery steps with no automation
- Overreliance on a few experts
Try this
- Automate rollback and failover
- Build fault-tolerant paths and self-healing to reduce callouts
- Practice incident drills regularly
- Keep runbooks concise and current
- Use safe defaults and guardrails
- Reduce the number of manual steps
Example
A service introduced automated rollback when error rates spiked. It prevented a widespread outage during a late-night incident when the on-call engineer was alone.
Reflection prompt
Which recovery step requires the most manual effort? Automate or simplify it.
More like this
Heuristic
Runbooks Are a Bridge Between Dev and Ops
Runbooks turn knowledge into action.
Heuristic
You Build It, You Run It
Build it, run it.
Heuristic
Fail Closed, Log Everything, Recover Gracefully
Safe failure beats quiet failure.
Heuristic
Every Output Is Someone Else's Input
Handoff quality sets the pace of flow.
Heuristic
Increase Contrast, Not Volume
Prompt length does not guarantee novelty. Context contrast does.
Heuristic
Blame the Process, Not People
Fix the system.