Problem Management Checklist

ITIL-aligned problem management workflow for sysadmin and MSP teams investigating recurring incidents, identifying root cause, and driving permanent fixes back into the environment.

5 sections 20 steps Collects data
1

Identification and Logging

  1. Open the problem record in the PSA or ITSM
    • Create a problem ticket in ServiceNow, Jira Service Management, ConnectWise PSA, or Autotask — separate from the underlying incident tickets. Link the parent problem to every related incident so MTTR and recurrence metrics roll up correctly.

    Collects text Collects paragraph
  2. Document affected users and systems
    • Capture user count, sites, business units, and the affected services (M365, VPN, file shares, line-of-business apps). For MSPs, name the client tenant. Pull from the RMM (NinjaOne, Datto RMM, Kaseya VSA) and the helpdesk queue, not just user reports.

  3. Set problem priority and severity
    • Use the priority matrix (impact × urgency). P1/SEV1 means business-stopping for multiple users; P2 is degraded service; P3 is recurring nuisance. The priority drives SLA timers and whether the major-incident path applies.

    Collects list
  4. Assign the problem manager and SME team
    • Named problem manager owns the record end-to-end. Pull in subject-matter experts by domain — network engineer for routing/DNS issues, identity admin for Entra ID / Okta issues, DBA for database issues. Avoid the anti-pattern of leaving the record assigned to a queue.

2

Investigation and Diagnosis

  1. Pull logs from SIEM and monitoring
    • Collect the relevant time window from Splunk, Sentinel, or Datadog; syslog from network gear; Windows Event Log and Linux journalctl from servers. Note any correlations with patch Tuesday rollouts, GPO changes, or recent CRs in the change calendar.

  2. Run root cause analysis with the SME team
    • Use 5 Whys, fishbone, or Kepner-Tregoe — pick a method and document the output. Distinguish proximate cause (the thing that broke) from root cause (why it was allowed to break). Common gotcha: stopping at the first plausible explanation.

  3. Reproduce the issue in a non-production environment
    • Use a lab tenant, dev VLAN, or pilot OU to recreate the conditions. If the issue can't be reproduced, document why (intermittent, environment-specific, only triggered under load) and proceed with the strongest hypothesis.

    Collects list
  4. Document findings as a known error
    • Add the known error to the KEDB (known error database) in IT Glue, Hudu, or Confluence. Include the symptom, root cause, and any temporary workaround so the helpdesk can resolve recurring incidents at Tier 1 while the permanent fix is staged.

    Collects url
3

Workaround and Resolution

  1. Publish a temporary workaround
    • Push the workaround to the helpdesk queue and end-user portal so Tier 1 can resolve incidents without escalating. Tag the workaround clearly as temporary — workarounds that ossify into permanent state are a common debt source.

  2. Submit a change request to the CAB
    • File the RFC with rollback plan, test evidence, blast radius, and proposed change window. Standard pre-approved fixes can ride the standard-change track; novel fixes go to the next CAB. Emergency change only if business impact is active.

    Collects list
  3. Test the permanent fix in the pilot ring
    • Deploy to a pilot OU, ring-1 device collection, or pilot client tenant before broad rollout. For patches, three-ring deployment (test → pilot → prod) over 7-14 days catches KB-introduced regressions before they reach 800 users.

  4. Roll out the fix during the change window
    • Execute exactly the approved plan — no off-script changes during the window. If conditions on the ground deviate from the plan, abort and re-CAB. Capture pre/post screenshots and config diffs for the change record.

  5. Verify resolution with monitoring and end users
    • Watch PRTG, Auvik, or LogicMonitor dashboards for the affected service; confirm the originating incidents have not recurred. For user-facing issues, contact a sample of affected users to confirm — green dashboards aren't proof of user-perceived resolution.

    Collects list
4

Review and Closure

  1. Reopen investigation if the fix did not hold
    • If the verification step came back Partial or No, route back to the SME team with the new evidence. Do not close the problem record on the assumption that another fix attempt will be sufficient — recurrence has to be proven absent over a watch period.

  2. Run the post-implementation review
    • Hold a 30-60 minute PIR with the problem manager, SME team, and a service-owner stakeholder. Cover what happened, what worked, what didn't, and what changes prevent recurrence. Document attendees and action items in the problem record.

    Collects paragraph
  3. Update the KEDB and runbooks
    • Promote the temporary workaround entry into a permanent known-error record with the validated fix. Update Tier 1 runbooks so the helpdesk handles future occurrences without escalation.

  4. Close the problem record with sign-off
    • Problem manager confirms all linked incidents are resolved, the KEDB entry is current, and any preventive actions have owners and due dates. Capture the sign-off in the ticket — closure without sign-off is the most common audit finding in SOC 2 change-management reviews.

    Collects signature
5

Continuous Improvement

  1. Review problem trends in the monthly ops meeting
    • Pull MTTR, recurrence rate, and top-cause categories from the PSA / ITSM reporting (BrightGauge, ServiceNow Performance Analytics, ConnectWise BrightGauge). Look for patterns — is one vendor, one site, or one service generating disproportionate problems?

  2. File preventive action items in the backlog
    • Convert PIR action items into trackable work — RMM script updates, monitoring threshold changes, GPO refinements, vendor escalations, project-engineer SOWs. Items without owners and due dates are the failure mode here.

  3. Train Tier 1 and Tier 2 on the updated runbook
    • Walk the service desk through the new known error and resolution path so the next ticket gets resolved at Tier 1 instead of escalating. For MSPs, share across all client-supporting techs — silos by client lead to the same problem being re-discovered separately.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 20
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Problem Management Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.