Incident Management Checklist

Steps an IT operations or MSP response team runs to triage, contain, and recover from a production incident — from first page through post-mortem. Covers security incidents and outage incidents, with branches for SEV1 escalation and breach-specific handling.

5 sections 25 steps Collects data

Detection and Triage

Classify the incident type
- Pull the triggering alert from the SIEM, RMM, or ticket and pick the closest type. Misclassification at this step routes the incident to the wrong runbook — a suspected breach handled as a system outage loses forensic evidence in the first hour.
Collects list
Assign severity per the SEV matrix
- Use the published SEV matrix — SEV1 is customer-impacting outage or confirmed breach of regulated data; SEV2 is partial degradation or contained suspicious activity; SEV3/4 are limited-scope or single-user. When in doubt, escalate up; downgrade is cheap, missed escalation is expensive.
Collects list
Page the on-call responder via PagerDuty
- Use the PagerDuty (or Opsgenie) escalation policy for the affected service — never DM an engineer directly. The page creates the audit trail and triggers the secondary if primary doesn't ack within the window.
Open the incident bridge and ticket
- Spin up the incident channel in Slack/Teams, start the Zoom bridge, and open the ticket in ServiceNow / ConnectWise / Halo. Pin the bridge link, ticket number, and incident commander assignment in the channel topic so late joiners can self-orient.
Document affected systems and user impact
- List the affected hostnames, services, sites, and approximate user count. For MSP incidents, identify which clients are impacted — single-tenant vs. shared-infrastructure scope changes notification obligations.
Collects paragraph

Containment and Mitigation

Isolate affected hosts from the network
- Use EDR network containment (CrowdStrike, SentinelOne, Defender) rather than yanking cables — keeps the host reachable for forensic collection while blocking lateral movement. Do not reboot or wipe; volatile memory and running-process state are evidence.
Activate the IR team and exec bridge
- SEV1 requires the full IR rota: incident commander, comms lead, scribe, and exec sponsor. Bring legal counsel onto the bridge before substantive discussion of cause or impact — preserves attorney-client privilege over the incident record.
Apply temporary mitigation or workaround
- Examples: failover to secondary DC, route traffic around affected node, disable the offending feature flag, block the IOC at the firewall. Document the workaround as a known-tech-debt ticket — temporary fixes that survive past the incident become permanent if not tracked.
Verify backup integrity and immutability
- Confirm the most recent Veeam / Datto / Rubrik restore point predates the incident window and that the immutable / object-locked copy is intact. Ransomware actors target backups first; an encrypted backup repository turns a recoverable incident into a six-figure ransom decision.
Snapshot affected VMs for forensic preservation
- Take vCenter / Hyper-V / cloud-provider snapshots and export memory captures before any remediation touches the host. Hash the artifacts and store in a separate evidence locker — chain of custody matters if the incident becomes a legal or insurance matter.
Post containment status to the bridge
- Containment confirmed means: blast radius is bounded, no further spread observed for 30+ minutes, and stakeholders agree the situation is stable. State this explicitly — silence reads as ongoing crisis to people on the bridge.
Collects list

Eradication and Recovery

Identify root cause via log review
- Pull SIEM (Sentinel, Splunk, QRadar), EDR telemetry, firewall logs, and identity-provider sign-in logs for the relevant window. Correlate by IOC and timestamp; an unsupported root-cause hypothesis leads to incomplete eradication and recurrence within days.
Remove malicious artifacts from affected hosts
- For confirmed compromise, prefer rebuild from gold image over cleanup — persistence mechanisms (scheduled tasks, WMI subscriptions, BITS jobs, registry run keys) are easy to miss. If rebuild isn't viable, run the EDR's full-system scan and document every artifact removed.
Rotate compromised credentials and tokens
- Revoke Entra ID / Okta sessions, reset passwords for affected accounts, rotate service-account credentials and API keys, and invalidate cached Kerberos tickets. Don't forget secrets stored in pipelines, RMM, and password vaults — attackers commonly pivot through these.
Restore systems from clean backup
- Restore into an isolated VLAN first, validate, then cut over. Confirm the restore point predates first known IOC — restoring from a contaminated backup just reintroduces the compromise.
Validate system integrity with an EDR scan
Monitor for recurrence indicators
- Tune SIEM detections for the IOCs from this incident and watch closely for 48–72 hours. Re-emergence after declared recovery is a strong signal that eradication was incomplete and the IR cycle should restart.

Communication and Coordination

Send the initial stakeholder notification
- Use the pre-approved template — what is impacted, what we know, what we're doing, when the next update is due. Avoid speculation about cause; a wrong cause statement on the public record is harder to retract than "investigation in progress."
Provide hourly status updates during the incident
- Cadence is hourly during active SEV1, every two hours for SEV2, on milestone change for SEV3. Send the update even if there is no progress — "investigation continues, next update at 14:00" is informative; silence is not.
Coordinate with affected vendors and MSP partners
- Open vendor support cases with Microsoft, the firewall vendor, the EDR vendor, the cloud provider — whichever are in scope. Reference existing premier / TAM contacts; the public support queue is too slow for an active incident.
Notify legal and compliance teams
- Confirmed breach of regulated data triggers notification clocks — 72 hours under GDPR, varying by state under US privacy laws, and contractually-defined windows in SOC 2 / HIPAA BAAs. Legal owns the determination; IT provides the facts.

Post-Incident Review

Schedule the blameless post-mortem
- Within five business days of recovery, while memory is fresh. Invite responders, system owners, and a facilitator outside the response team — the facilitator's job is to keep the conversation focused on systemic causes rather than individual fault.
Document the incident timeline and root cause
- Reconstruct minute-by-minute from chat logs, ticket history, and monitoring data. Apply the 5-whys or causal-chain analysis — the first answer ("the disk filled up") is rarely the root cause ("log rotation broke six months ago and nobody noticed").
Collects paragraph
File the final incident report
- The report goes to the customer (for MSP incidents), to the auditor (for SOC 2 / ISO evidence), and into the internal knowledge base. Capture the closure status, attach the written report, and record lessons learned for the next responder who hits a similar pattern.
Collects list Collects file Collects paragraph
Update the runbook and detection rules
- Translate the lessons into durable artifacts: new SIEM correlation rules, updated EDR policies, revised runbook steps, added monitoring thresholds. A post-mortem with no runbook change is a post-mortem the team will repeat.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 5

Steps 25

Category Systems Administration

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Systems Administration

Run Incident Management Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.