←  All posts
Apr 30, 2026 · 7 min read

Inside our incident response playbook

SREOn-callReliability

A production incident is a test of process, not heroics. The teams with the best mean-time-to-resolution aren't the ones with the smartest engineers - they're the ones with the clearest playbook. Here's ours, page to post-mortem.

Page (0-1 min)

An alert fires - say, p95 latency crosses 1200ms on the API gateway. Our on-call engineer is paged automatically through PagerDuty. No human triages whether it's "real" first; the runbook for that signal is already attached to the alert.

Acknowledge & assess (1-3 min)

The engineer acknowledges, opens the incident channel, and pulls up the relevant dashboards. The first question is always blast radius: how many users, which regions, is it getting worse?

Diagnose (3-5 min)

Root cause analysis starts from the signal, not a guess. Connection pool exhausted on a database replica? The metrics say so before anyone speculates.

  • Check recent deploys and config changes first - most incidents are self-inflicted.
  • Correlate the latency spike with resource saturation graphs.
  • Confirm the hypothesis with logs before acting.

Mitigate (5-11 min)

Stop the bleeding before chasing the perfect fix. Scale the pool, fail over to a healthy replica, roll back the bad deploy. The goal is to get users healthy, then investigate at leisure.

# Scale the connection pool and fail over
kubectl scale deploy/api --replicas=12
psql -c "ALTER SYSTEM SET max_connections = 120;"

Our median time from page to resolved sits around an 11-minute MTTR.

Post-mortem (next day)

Every incident gets a blameless write-up: timeline, root cause, and - most importantly - the action items that make it impossible to recur. Those action items go into the backlog with owners, not into a document that's never read again.

This is what an optional Premium SLA buys: a dedicated engineer on-call, proactive monitoring, and war-room support when it matters most. See the pricing model for the full SLA tiers.