Software Project Risk Management Checklist

Quarterly risk-management cycle for a software engineering team — identify, score, and mitigate technical, security, vendor, and schedule risks across the project portfolio. Run by an engineering manager or technical program manager with...

1

Risk Identification Kickoff

  1. Schedule the pre-mortem workshop
    • Block 90 minutes with tech leads, SRE on-call, AppSec, and the product manager. Pre-mortem framing: "It's six months from now and the project failed — what went wrong?" Async brainstorming in a shared doc 24 hours ahead surfaces more than a cold-start meeting.

  2. Pull contributing factors from past PIRs
    • Review the last 4 quarters of post-incident reviews in Confluence/Notion. Extract recurring contributing factors — flaky CI, untested rollbacks, certificate expiry, unowned services. Recurring factors are the strongest signal for risks worth registering.

    Collects paragraph
  3. Inventory third-party dependencies and SaaS vendors
    • Generate the SBOM (Syft, Trivy, or your registry's built-in) and list paid vendors from the procurement system. Watch for transitive critical-CVE dependencies (think Log4Shell-class), single-vendor lock-in (auth provider, payments), and packages without a maintained upstream.

  4. Capture risks raised by on-call engineers
    • On-call sees the rough edges first — alert noise, runbooks that don't match reality, services with a single SME. Ask the last two rotations directly; don't rely on tickets alone.

2

Risk Analysis and Scoring

  1. Score each risk on probability and impact
    • Use a 1–5 × 1–5 matrix for probability × impact. Impact dimensions: customer-facing downtime, data exposure, revenue, and engineering toil. Anything scoring 15+ goes on the top-tier list and needs a named owner this cycle.

  2. Classify the project's regulatory scope
    • Confirm what regulated data the in-scope services touch. PHI pulls in HIPAA + BAA review; cardholder data pulls in PCI scope; EU resident data pulls in GDPR sub-processor obligations. Misclassification here is the most common reason auditors find a control gap later.

    Collects list
  3. Log entries in the risk register
    • Single source of truth — Jira, Linear, or a Notion table linked from the engineering wiki. Each entry gets: ID, description, category (technical / security / vendor / schedule / compliance), score, owner, mitigation, status. Avoid private spreadsheets; auditors and successors won't find them.

    Collects file Collects number
  4. Run the SOC 2 / HIPAA / PCI control mapping review
    • Map each compliance-relevant risk to the affected control (CC6.x for access, CC7.x for monitoring, CC8.x for change management under SOC 2). Loop in the compliance lead or your Vanta/Drata/Secureframe owner to confirm the gap is registered and an evidence task exists.

3

Mitigation Planning

  1. Assign a named owner to each top-tier risk
    • One human per risk, not a team. The owner drives the mitigation plan, reports status at the monthly review, and closes the entry. Rotate ownership when people change roles — orphaned risks are how a tracked gap becomes a Sev1.

  2. Draft mitigation plans for top-tier risks
    • Each plan needs: concrete engineering work (linked tickets), a target completion date, and the residual-risk score after mitigation. Vague mitigations ("improve observability") don't ship; "add SLO burn-rate alert on checkout-service p99" does.

  3. Define rollback triggers and kill-switch flags
    • For each release-related risk, document the trigger condition (error rate > X%, p99 > Y ms, customer support tickets > Z/hr) and the operator action (flip the LaunchDarkly flag, redeploy previous container tag, run the rollback migration). The PagerDuty runbook link goes here too.

  4. Confirm residual risk is within appetite
    • After applying mitigations, re-score each top-tier risk. Anything still scoring 15+ is residual exposure leadership needs to accept explicitly — it doesn't go away because you wrote a plan.

    Collects list
4

Monitoring and Control

  1. Wire risk indicators into Datadog or Grafana
    • If the risk has a leading indicator (Dependabot critical-CVE count, certificate days-to-expiry, p99 latency budget burn), it goes on a dashboard with an alert routing to the risk owner — not a deprecated #alerts channel. "Backup nightly green for 18 months" without a restore test is not monitoring.

  2. Hold the monthly risk register review
    • 30 minutes, calendar-recurring. Owners report status on their entries, retire mitigated risks, add new ones surfaced since last cycle. Skipping the review is how registers become museum pieces.

  3. Re-test rollback and restore procedures
    • Quarterly drill into a non-prod environment: restore the latest backup, redeploy the previous container tag, run the down migration. The first restore attempt usually fails on a rotated credential or a missing IAM permission — finding that during a drill is the point.

5

Stakeholder Communication

  1. Brief the CTO on accepted residual risk
    • For any risk still rated High or Critical after mitigation, schedule a 15-minute briefing with engineering leadership. Capture the explicit accept/reject decision in the register so it's defensible at the next audit walkthrough.

  2. Post the risk summary in #engineering
    • One Slack post per cycle: top three risks, owners, target dates, and a link to the register. Async visibility prevents "nobody told me" surprises during release weeks.

  3. Hold the quarterly risk retrospective
    • Look back at the cycle: which risks materialized despite mitigation, which we missed entirely, and which controls actually held. Feed the answers into next quarter's identification step — that's how risk management compounds instead of resetting.