Inside our incident response playbook
A production incident is a test of process, not heroics. The teams with the best mean-time-to-resolution aren't the ones with the smartest engineers - they're the ones with the clearest playbook. Here's ours, page to post-mortem.
Page (0-1 min)
An alert fires - say, p95 latency crosses 1200ms on the API gateway. Our on-call engineer is paged automatically through PagerDuty. No human triages whether it's "real" first; the runbook for that signal is already attached to the alert.
Acknowledge & assess (1-3 min)
The engineer acknowledges, opens the incident channel, and pulls up the relevant dashboards. The first question is always blast radius: how many users, which regions, is it getting worse?
Diagnose (3-5 min)
Root cause analysis starts from the signal, not a guess. Connection pool exhausted on a database replica? The metrics say so before anyone speculates.
- Check recent deploys and config changes first - most incidents are self-inflicted.
- Correlate the latency spike with resource saturation graphs.
- Confirm the hypothesis with logs before acting.
Mitigate (5-11 min)
Stop the bleeding before chasing the perfect fix. Scale the pool, fail over to a healthy replica, roll back the bad deploy. The goal is to get users healthy, then investigate at leisure.
# Scale the connection pool and fail over
kubectl scale deploy/api --replicas=12
psql -c "ALTER SYSTEM SET max_connections = 120;"
Our median time from page to resolved sits around an 11-minute MTTR.
Post-mortem (next day)
Every incident gets a blameless write-up: timeline, root cause, and - most importantly - the action items that make it impossible to recur. Those action items go into the backlog with owners, not into a document that's never read again.
This is what an optional Premium SLA buys: a dedicated engineer on-call, proactive monitoring, and war-room support when it matters most. See the pricing model for the full SLA tiers.