Problem Management Checklist

Identification and Logging

    Create a problem ticket in ServiceNow, Jira Service Management, ConnectWise PSA, or Autotask — separate from the underlying incident tickets. Link the parent problem to every related incident so MTTR and recurrence metrics roll up correctly.

    Capture user count, sites, business units, and the affected services (M365, VPN, file shares, line-of-business apps). For MSPs, name the client tenant. Pull from the RMM (NinjaOne, Datto RMM, Kaseya VSA) and the helpdesk queue, not just user reports.

    Use the priority matrix (impact × urgency). P1/SEV1 means business-stopping for multiple users; P2 is degraded service; P3 is recurring nuisance. The priority drives SLA timers and whether the major-incident path applies.

    Named problem manager owns the record end-to-end. Pull in subject-matter experts by domain — network engineer for routing/DNS issues, identity admin for Entra ID / Okta issues, DBA for database issues. Avoid the anti-pattern of leaving the record assigned to a queue.

Investigation and Diagnosis

    Collect the relevant time window from Splunk, Sentinel, or Datadog; syslog from network gear; Windows Event Log and Linux journalctl from servers. Note any correlations with patch Tuesday rollouts, GPO changes, or recent CRs in the change calendar.

    Use 5 Whys, fishbone, or Kepner-Tregoe — pick a method and document the output. Distinguish proximate cause (the thing that broke) from root cause (why it was allowed to break). Common gotcha: stopping at the first plausible explanation.

    Use a lab tenant, dev VLAN, or pilot OU to recreate the conditions. If the issue can't be reproduced, document why (intermittent, environment-specific, only triggered under load) and proceed with the strongest hypothesis.

    Add the known error to the KEDB (known error database) in IT Glue, Hudu, or Confluence. Include the symptom, root cause, and any temporary workaround so the helpdesk can resolve recurring incidents at Tier 1 while the permanent fix is staged.

Workaround and Resolution

    Push the workaround to the helpdesk queue and end-user portal so Tier 1 can resolve incidents without escalating. Tag the workaround clearly as temporary — workarounds that ossify into permanent state are a common debt source.

    File the RFC with rollback plan, test evidence, blast radius, and proposed change window. Standard pre-approved fixes can ride the standard-change track; novel fixes go to the next CAB. Emergency change only if business impact is active.

    Deploy to a pilot OU, ring-1 device collection, or pilot client tenant before broad rollout. For patches, three-ring deployment (test → pilot → prod) over 7-14 days catches KB-introduced regressions before they reach 800 users.

    Execute exactly the approved plan — no off-script changes during the window. If conditions on the ground deviate from the plan, abort and re-CAB. Capture pre/post screenshots and config diffs for the change record.

    Watch PRTG, Auvik, or LogicMonitor dashboards for the affected service; confirm the originating incidents have not recurred. For user-facing issues, contact a sample of affected users to confirm — green dashboards aren't proof of user-perceived resolution.

Review and Closure

    If the verification step came back Partial or No, route back to the SME team with the new evidence. Do not close the problem record on the assumption that another fix attempt will be sufficient — recurrence has to be proven absent over a watch period.

    Hold a 30-60 minute PIR with the problem manager, SME team, and a service-owner stakeholder. Cover what happened, what worked, what didn't, and what changes prevent recurrence. Document attendees and action items in the problem record.

    Promote the temporary workaround entry into a permanent known-error record with the validated fix. Update Tier 1 runbooks so the helpdesk handles future occurrences without escalation.

    Problem manager confirms all linked incidents are resolved, the KEDB entry is current, and any preventive actions have owners and due dates. Capture the sign-off in the ticket — closure without sign-off is the most common audit finding in SOC 2 change-management reviews.

Continuous Improvement

    Pull MTTR, recurrence rate, and top-cause categories from the PSA / ITSM reporting (BrightGauge, ServiceNow Performance Analytics, ConnectWise BrightGauge). Look for patterns — is one vendor, one site, or one service generating disproportionate problems?

    Convert PIR action items into trackable work — RMM script updates, monitoring threshold changes, GPO refinements, vendor escalations, project-engineer SOWs. Items without owners and due dates are the failure mode here.

    Walk the service desk through the new known error and resolution path so the next ticket gets resolved at Tier 1 instead of escalating. For MSPs, share across all client-supporting techs — silos by client lead to the same problem being re-discovered separately.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists
Related Itil Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack