Incident Management Checklist

Detection and Triage

    Pull the triggering alert from the SIEM, RMM, or ticket and pick the closest type. Misclassification at this step routes the incident to the wrong runbook — a suspected breach handled as a system outage loses forensic evidence in the first hour.

    Use the published SEV matrix — SEV1 is customer-impacting outage or confirmed breach of regulated data; SEV2 is partial degradation or contained suspicious activity; SEV3/4 are limited-scope or single-user. When in doubt, escalate up; downgrade is cheap, missed escalation is expensive.

    Use the PagerDuty (or Opsgenie) escalation policy for the affected service — never DM an engineer directly. The page creates the audit trail and triggers the secondary if primary doesn't ack within the window.

    Spin up the incident channel in Slack/Teams, start the Zoom bridge, and open the ticket in ServiceNow / ConnectWise / Halo. Pin the bridge link, ticket number, and incident commander assignment in the channel topic so late joiners can self-orient.

    List the affected hostnames, services, sites, and approximate user count. For MSP incidents, identify which clients are impacted — single-tenant vs. shared-infrastructure scope changes notification obligations.

Containment and Mitigation

    Use EDR network containment (CrowdStrike, SentinelOne, Defender) rather than yanking cables — keeps the host reachable for forensic collection while blocking lateral movement. Do not reboot or wipe; volatile memory and running-process state are evidence.

    SEV1 requires the full IR rota: incident commander, comms lead, scribe, and exec sponsor. Bring legal counsel onto the bridge before substantive discussion of cause or impact — preserves attorney-client privilege over the incident record.

    Examples: failover to secondary DC, route traffic around affected node, disable the offending feature flag, block the IOC at the firewall. Document the workaround as a known-tech-debt ticket — temporary fixes that survive past the incident become permanent if not tracked.

    Confirm the most recent Veeam / Datto / Rubrik restore point predates the incident window and that the immutable / object-locked copy is intact. Ransomware actors target backups first; an encrypted backup repository turns a recoverable incident into a six-figure ransom decision.

    Take vCenter / Hyper-V / cloud-provider snapshots and export memory captures before any remediation touches the host. Hash the artifacts and store in a separate evidence locker — chain of custody matters if the incident becomes a legal or insurance matter.

    Containment confirmed means: blast radius is bounded, no further spread observed for 30+ minutes, and stakeholders agree the situation is stable. State this explicitly — silence reads as ongoing crisis to people on the bridge.

Eradication and Recovery

    Pull SIEM (Sentinel, Splunk, QRadar), EDR telemetry, firewall logs, and identity-provider sign-in logs for the relevant window. Correlate by IOC and timestamp; an unsupported root-cause hypothesis leads to incomplete eradication and recurrence within days.

    For confirmed compromise, prefer rebuild from gold image over cleanup — persistence mechanisms (scheduled tasks, WMI subscriptions, BITS jobs, registry run keys) are easy to miss. If rebuild isn't viable, run the EDR's full-system scan and document every artifact removed.

    Revoke Entra ID / Okta sessions, reset passwords for affected accounts, rotate service-account credentials and API keys, and invalidate cached Kerberos tickets. Don't forget secrets stored in pipelines, RMM, and password vaults — attackers commonly pivot through these.

    Restore into an isolated VLAN first, validate, then cut over. Confirm the restore point predates first known IOC — restoring from a contaminated backup just reintroduces the compromise.

    Tune SIEM detections for the IOCs from this incident and watch closely for 48–72 hours. Re-emergence after declared recovery is a strong signal that eradication was incomplete and the IR cycle should restart.

Communication and Coordination

    Use the pre-approved template — what is impacted, what we know, what we're doing, when the next update is due. Avoid speculation about cause; a wrong cause statement on the public record is harder to retract than "investigation in progress."

    Cadence is hourly during active SEV1, every two hours for SEV2, on milestone change for SEV3. Send the update even if there is no progress — "investigation continues, next update at 14:00" is informative; silence is not.

    Open vendor support cases with Microsoft, the firewall vendor, the EDR vendor, the cloud provider — whichever are in scope. Reference existing premier / TAM contacts; the public support queue is too slow for an active incident.

    Confirmed breach of regulated data triggers notification clocks — 72 hours under GDPR, varying by state under US privacy laws, and contractually-defined windows in SOC 2 / HIPAA BAAs. Legal owns the determination; IT provides the facts.

Post-Incident Review

    Within five business days of recovery, while memory is fresh. Invite responders, system owners, and a facilitator outside the response team — the facilitator's job is to keep the conversation focused on systemic causes rather than individual fault.

    Reconstruct minute-by-minute from chat logs, ticket history, and monitoring data. Apply the 5-whys or causal-chain analysis — the first answer ("the disk filled up") is rarely the root cause ("log rotation broke six months ago and nobody noticed").

    The report goes to the customer (for MSP incidents), to the auditor (for SOC 2 / ISO evidence), and into the internal knowledge base. Capture the closure status, attach the written report, and record lessons learned for the next responder who hits a similar pattern.

    Translate the lessons into durable artifacts: new SIEM correlation rules, updated EDR policies, revised runbook steps, added monitoring thresholds. A post-mortem with no runbook change is a post-mortem the team will repeat.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack