Operations

Incident Readiness

Foundational

Incidents are not a question of if, but when. What separates a contained event from a crisis is whether you prepared before it happened: who is in charge, how you communicate, how fast you can detect and roll back, and whether you keep the regulatory clock in mind. You build readiness on calm days so it is there on the worst one.

Incident readiness covers the whole lifecycle. Detect quickly (good observability and alerting). Respond calmly (clear roles, a known process, the ability to roll back). Communicate honestly (internally and, where required, to customers and regulators). Learn afterwards (a blameless review that fixes the system). The aim is to shrink both the time to detect and the time to recover, and to make the response a rehearsed routine rather than improvisation.

Our regulated context adds two duties that ordinary outages do not carry. A security or personal-data breach starts a regulatory clock: GDPR breach notification is measured in hours, not days. So recognising "this is a reportable incident" and escalating it is itself a critical step. And throughout, you must preserve the evidence and audit trail, never destroy it in the rush to fix.

Be ready before it happens

DoAlert on the symptoms users feel (error rate, latency, failed payments/KYC) so monitoring detects incidents, not customers.
DoDefine the response up front: who is on call, who runs the incident, how to escalate, and where coordination happens. Write it down; do not improvise.
DoKeep recovery fast and rehearsed: tested rollback, runbooks for likely failures, and access ready for the people who will need it.
ConsiderPractising with game days and drills, so the process is familiar and the gaps are found before a real incident finds them.
AlwaysHave a way to recognise and escalate a security or personal-data breach, because it starts a regulatory notification clock measured in hours.

Respond and learn

DoFirst stop the harm and restore service, then diagnose. Preserve evidence while you do.
DoCommunicate clearly and honestly during an incident (status, impact, and next update) to stakeholders, and to customers and regulators where obligations require.
DoRun a blameless post-incident review that finds the systemic cause and produces tracked fixes, not a culprit (see Ownership & Accountability).
ConsiderTracking time-to-detect and time-to-recover over time, and closing out post-incident actions as real work, not good intentions.
AlwaysPreserve the audit trail and evidence during incident response. Record what was done and when, and never destroy regulated evidence in the rush.
NeverHide, downplay, or fail to escalate a security or data-breach incident. Concealment turns a manageable event into a regulatory and trust catastrophe.

Quiet fix, clock ignored

// noticed customer data was exposed via a misconfig
// quietly fixed it, didn't tell anyone, no record kept

A personal-data breach was patched in silence. The GDPR notification window is being missed, no evidence was kept, and a contained, reportable incident is now also a concealment, which is far more serious.

Contain, escalate, record

1. Contain the exposure (close the misconfig, rotate affected keys)
2. Escalate immediately as a suspected data breach (starts the clock)
3. Preserve logs/evidence; record the timeline of actions
4. Blameless review → tracked fixes so the class cannot recur

The harm is stopped, the regulatory duty is met on time, the evidence is intact, and the system is improved. That is the difference between an incident handled well and a crisis.

Self-review checklist

AskIf this broke at 3 a.m., would monitoring catch it, and would the on-call person know what to do?
AskCould this be a security or personal-data breach? If so, has the clock been started and the right people told?
AskAre we preserving the evidence and audit trail while we fix it?
AskAfter this, will we run a blameless review and actually close the follow-up actions?

Why it matters: Every system fails eventually. Readiness decides whether that failure is a brief, well-handled blip or a long, reputation-damaging crisis. In a regulated business, how an incident is recognised, escalated, evidenced, and reported can matter even more than the original fault. Concealment or a missed notification deadline turns a manageable event into a far graver one.