Incident Management Checklist

Steps an IT operations or MSP response team runs to triage, contain, and recover from a production incident — from first page through post-mortem. Covers security incidents and outage incidents, with branches for SEV1 escalation and breach-specific handling.

5 sections 25 steps Collects data
1

Detection and Triage

  1. Classify the incident type
    • Pull the triggering alert from the SIEM, RMM, or ticket and pick the closest type. Misclassification at this step routes the incident to the wrong runbook — a suspected breach handled as a system outage loses forensic evidence in the first hour.

    Collects list
  2. Assign severity per the SEV matrix
    • Use the published SEV matrix — SEV1 is customer-impacting outage or confirmed breach of regulated data; SEV2 is partial degradation or contained suspicious activity; SEV3/4 are limited-scope or single-user. When in doubt, escalate up; downgrade is cheap, missed escalation is expensive.

    Collects list
  3. Page the on-call responder via PagerDuty
    • Use the PagerDuty (or Opsgenie) escalation policy for the affected service — never DM an engineer directly. The page creates the audit trail and triggers the secondary if primary doesn't ack within the window.

  4. Open the incident bridge and ticket
    • Spin up the incident channel in Slack/Teams, start the Zoom bridge, and open the ticket in ServiceNow / ConnectWise / Halo. Pin the bridge link, ticket number, and incident commander assignment in the channel topic so late joiners can self-orient.

  5. Document affected systems and user impact
    • List the affected hostnames, services, sites, and approximate user count. For MSP incidents, identify which clients are impacted — single-tenant vs. shared-infrastructure scope changes notification obligations.

    Collects paragraph
2

Containment and Mitigation

  1. Isolate affected hosts from the network
    • Use EDR network containment (CrowdStrike, SentinelOne, Defender) rather than yanking cables — keeps the host reachable for forensic collection while blocking lateral movement. Do not reboot or wipe; volatile memory and running-process state are evidence.

  2. Activate the IR team and exec bridge
    • SEV1 requires the full IR rota: incident commander, comms lead, scribe, and exec sponsor. Bring legal counsel onto the bridge before substantive discussion of cause or impact — preserves attorney-client privilege over the incident record.

  3. Apply temporary mitigation or workaround
    • Examples: failover to secondary DC, route traffic around affected node, disable the offending feature flag, block the IOC at the firewall. Document the workaround as a known-tech-debt ticket — temporary fixes that survive past the incident become permanent if not tracked.

  4. Verify backup integrity and immutability
    • Confirm the most recent Veeam / Datto / Rubrik restore point predates the incident window and that the immutable / object-locked copy is intact. Ransomware actors target backups first; an encrypted backup repository turns a recoverable incident into a six-figure ransom decision.

  5. Snapshot affected VMs for forensic preservation
    • Take vCenter / Hyper-V / cloud-provider snapshots and export memory captures before any remediation touches the host. Hash the artifacts and store in a separate evidence locker — chain of custody matters if the incident becomes a legal or insurance matter.

  6. Post containment status to the bridge
    • Containment confirmed means: blast radius is bounded, no further spread observed for 30+ minutes, and stakeholders agree the situation is stable. State this explicitly — silence reads as ongoing crisis to people on the bridge.

    Collects list
3

Eradication and Recovery

  1. Identify root cause via log review
    • Pull SIEM (Sentinel, Splunk, QRadar), EDR telemetry, firewall logs, and identity-provider sign-in logs for the relevant window. Correlate by IOC and timestamp; an unsupported root-cause hypothesis leads to incomplete eradication and recurrence within days.

  2. Remove malicious artifacts from affected hosts
    • For confirmed compromise, prefer rebuild from gold image over cleanup — persistence mechanisms (scheduled tasks, WMI subscriptions, BITS jobs, registry run keys) are easy to miss. If rebuild isn't viable, run the EDR's full-system scan and document every artifact removed.

  3. Rotate compromised credentials and tokens
    • Revoke Entra ID / Okta sessions, reset passwords for affected accounts, rotate service-account credentials and API keys, and invalidate cached Kerberos tickets. Don't forget secrets stored in pipelines, RMM, and password vaults — attackers commonly pivot through these.

  4. Restore systems from clean backup
    • Restore into an isolated VLAN first, validate, then cut over. Confirm the restore point predates first known IOC — restoring from a contaminated backup just reintroduces the compromise.

  5. Validate system integrity with an EDR scan
  6. Monitor for recurrence indicators
    • Tune SIEM detections for the IOCs from this incident and watch closely for 48–72 hours. Re-emergence after declared recovery is a strong signal that eradication was incomplete and the IR cycle should restart.

4

Communication and Coordination

  1. Send the initial stakeholder notification
    • Use the pre-approved template — what is impacted, what we know, what we're doing, when the next update is due. Avoid speculation about cause; a wrong cause statement on the public record is harder to retract than "investigation in progress."

  2. Provide hourly status updates during the incident
    • Cadence is hourly during active SEV1, every two hours for SEV2, on milestone change for SEV3. Send the update even if there is no progress — "investigation continues, next update at 14:00" is informative; silence is not.

  3. Coordinate with affected vendors and MSP partners
    • Open vendor support cases with Microsoft, the firewall vendor, the EDR vendor, the cloud provider — whichever are in scope. Reference existing premier / TAM contacts; the public support queue is too slow for an active incident.

  4. Notify legal and compliance teams
    • Confirmed breach of regulated data triggers notification clocks — 72 hours under GDPR, varying by state under US privacy laws, and contractually-defined windows in SOC 2 / HIPAA BAAs. Legal owns the determination; IT provides the facts.

5

Post-Incident Review

  1. Schedule the blameless post-mortem
    • Within five business days of recovery, while memory is fresh. Invite responders, system owners, and a facilitator outside the response team — the facilitator's job is to keep the conversation focused on systemic causes rather than individual fault.

  2. Document the incident timeline and root cause
    • Reconstruct minute-by-minute from chat logs, ticket history, and monitoring data. Apply the 5-whys or causal-chain analysis — the first answer ("the disk filled up") is rarely the root cause ("log rotation broke six months ago and nobody noticed").

    Collects paragraph
  3. File the final incident report
    • The report goes to the customer (for MSP incidents), to the auditor (for SOC 2 / ISO evidence), and into the internal knowledge base. Capture the closure status, attach the written report, and record lessons learned for the next responder who hits a similar pattern.

    Collects list Collects file Collects paragraph
  4. Update the runbook and detection rules
    • Translate the lessons into durable artifacts: new SIEM correlation rules, updated EDR policies, revised runbook steps, added monitoring thresholds. A post-mortem with no runbook change is a post-mortem the team will repeat.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 25
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Incident Management Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.