Incident Management Checklist
Steps an IT operations or MSP response team runs to triage, contain, and recover from a production incident — from first page through post-mortem. Covers security incidents and outage incidents, with branches for SEV1 escalation and breach-specific handling.
Detection and Triage
-
Classify the incident type
Pull the triggering alert from the SIEM, RMM, or ticket and pick the closest type. Misclassification at this step routes the incident to the wrong runbook — a suspected breach handled as a system outage loses forensic evidence in the first hour.
Collects list -
Assign severity per the SEV matrix
Use the published SEV matrix — SEV1 is customer-impacting outage or confirmed breach of regulated data; SEV2 is partial degradation or contained suspicious activity; SEV3/4 are limited-scope or single-user. When in doubt, escalate up; downgrade is cheap, missed escalation is expensive.
Collects list -
Page the on-call responder via PagerDuty
Use the PagerDuty (or Opsgenie) escalation policy for the affected service — never DM an engineer directly. The page creates the audit trail and triggers the secondary if primary doesn't ack within the window.
-
Open the incident bridge and ticket
Spin up the incident channel in Slack/Teams, start the Zoom bridge, and open the ticket in ServiceNow / ConnectWise / Halo. Pin the bridge link, ticket number, and incident commander assignment in the channel topic so late joiners can self-orient.
-
Document affected systems and user impact
List the affected hostnames, services, sites, and approximate user count. For MSP incidents, identify which clients are impacted — single-tenant vs. shared-infrastructure scope changes notification obligations.
Collects paragraph
Containment and Mitigation
-
Isolate affected hosts from the network
Use EDR network containment (CrowdStrike, SentinelOne, Defender) rather than yanking cables — keeps the host reachable for forensic collection while blocking lateral movement. Do not reboot or wipe; volatile memory and running-process state are evidence.
-
Activate the IR team and exec bridge
SEV1 requires the full IR rota: incident commander, comms lead, scribe, and exec sponsor. Bring legal counsel onto the bridge before substantive discussion of cause or impact — preserves attorney-client privilege over the incident record.
-
Apply temporary mitigation or workaround
Examples: failover to secondary DC, route traffic around affected node, disable the offending feature flag, block the IOC at the firewall. Document the workaround as a known-tech-debt ticket — temporary fixes that survive past the incident become permanent if not tracked.
-
Verify backup integrity and immutability
Confirm the most recent Veeam / Datto / Rubrik restore point predates the incident window and that the immutable / object-locked copy is intact. Ransomware actors target backups first; an encrypted backup repository turns a recoverable incident into a six-figure ransom decision.
-
Snapshot affected VMs for forensic preservation
Take vCenter / Hyper-V / cloud-provider snapshots and export memory captures before any remediation touches the host. Hash the artifacts and store in a separate evidence locker — chain of custody matters if the incident becomes a legal or insurance matter.
-
Post containment status to the bridge
Containment confirmed means: blast radius is bounded, no further spread observed for 30+ minutes, and stakeholders agree the situation is stable. State this explicitly — silence reads as ongoing crisis to people on the bridge.
Collects list
Eradication and Recovery
-
Identify root cause via log review
Pull SIEM (Sentinel, Splunk, QRadar), EDR telemetry, firewall logs, and identity-provider sign-in logs for the relevant window. Correlate by IOC and timestamp; an unsupported root-cause hypothesis leads to incomplete eradication and recurrence within days.
-
Remove malicious artifacts from affected hosts
For confirmed compromise, prefer rebuild from gold image over cleanup — persistence mechanisms (scheduled tasks, WMI subscriptions, BITS jobs, registry run keys) are easy to miss. If rebuild isn't viable, run the EDR's full-system scan and document every artifact removed.
-
Rotate compromised credentials and tokens
Revoke Entra ID / Okta sessions, reset passwords for affected accounts, rotate service-account credentials and API keys, and invalidate cached Kerberos tickets. Don't forget secrets stored in pipelines, RMM, and password vaults — attackers commonly pivot through these.
-
Restore systems from clean backup
Restore into an isolated VLAN first, validate, then cut over. Confirm the restore point predates first known IOC — restoring from a contaminated backup just reintroduces the compromise.
-
Validate system integrity with an EDR scan
-
Monitor for recurrence indicators
Tune SIEM detections for the IOCs from this incident and watch closely for 48–72 hours. Re-emergence after declared recovery is a strong signal that eradication was incomplete and the IR cycle should restart.
Communication and Coordination
-
Send the initial stakeholder notification
Use the pre-approved template — what is impacted, what we know, what we're doing, when the next update is due. Avoid speculation about cause; a wrong cause statement on the public record is harder to retract than "investigation in progress."
-
Provide hourly status updates during the incident
Cadence is hourly during active SEV1, every two hours for SEV2, on milestone change for SEV3. Send the update even if there is no progress — "investigation continues, next update at 14:00" is informative; silence is not.
-
Coordinate with affected vendors and MSP partners
Open vendor support cases with Microsoft, the firewall vendor, the EDR vendor, the cloud provider — whichever are in scope. Reference existing premier / TAM contacts; the public support queue is too slow for an active incident.
-
Notify legal and compliance teams
Confirmed breach of regulated data triggers notification clocks — 72 hours under GDPR, varying by state under US privacy laws, and contractually-defined windows in SOC 2 / HIPAA BAAs. Legal owns the determination; IT provides the facts.
Post-Incident Review
-
Schedule the blameless post-mortem
Within five business days of recovery, while memory is fresh. Invite responders, system owners, and a facilitator outside the response team — the facilitator's job is to keep the conversation focused on systemic causes rather than individual fault.
-
Document the incident timeline and root cause
Reconstruct minute-by-minute from chat logs, ticket history, and monitoring data. Apply the 5-whys or causal-chain analysis — the first answer ("the disk filled up") is rarely the root cause ("log rotation broke six months ago and nobody noticed").
Collects paragraph -
File the final incident report
The report goes to the customer (for MSP incidents), to the auditor (for SOC 2 / ISO evidence), and into the internal knowledge base. Capture the closure status, attach the written report, and record lessons learned for the next responder who hits a similar pattern.
Collects list Collects file Collects paragraph -
Update the runbook and detection rules
Translate the lessons into durable artifacts: new SIEM correlation rules, updated EDR policies, revised runbook steps, added monitoring thresholds. A post-mortem with no runbook change is a post-mortem the team will repeat.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Incident Management Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.