Software Project Risk Management Checklist
Risk Identification Kickoff
Block 90 minutes with tech leads, SRE on-call, AppSec, and the product manager. Pre-mortem framing: "It's six months from now and the project failed — what went wrong?" Async brainstorming in a shared doc 24 hours ahead surfaces more than a cold-start meeting.
Review the last 4 quarters of post-incident reviews in Confluence/Notion. Extract recurring contributing factors — flaky CI, untested rollbacks, certificate expiry, unowned services. Recurring factors are the strongest signal for risks worth registering.
Generate the SBOM (Syft, Trivy, or your registry's built-in) and list paid vendors from the procurement system. Watch for transitive critical-CVE dependencies (think Log4Shell-class), single-vendor lock-in (auth provider, payments), and packages without a maintained upstream.
On-call sees the rough edges first — alert noise, runbooks that don't match reality, services with a single SME. Ask the last two rotations directly; don't rely on tickets alone.
Risk Analysis and Scoring
Use a 1–5 × 1–5 matrix for probability × impact. Impact dimensions: customer-facing downtime, data exposure, revenue, and engineering toil. Anything scoring 15+ goes on the top-tier list and needs a named owner this cycle.
Confirm what regulated data the in-scope services touch. PHI pulls in HIPAA + BAA review; cardholder data pulls in PCI scope; EU resident data pulls in GDPR sub-processor obligations. Misclassification here is the most common reason auditors find a control gap later.
Single source of truth — Jira, Linear, or a Notion table linked from the engineering wiki. Each entry gets: ID, description, category (technical / security / vendor / schedule / compliance), score, owner, mitigation, status. Avoid private spreadsheets; auditors and successors won't find them.
Map each compliance-relevant risk to the affected control (CC6.x for access, CC7.x for monitoring, CC8.x for change management under SOC 2). Loop in the compliance lead or your Vanta/Drata/Secureframe owner to confirm the gap is registered and an evidence task exists.
Mitigation Planning
One human per risk, not a team. The owner drives the mitigation plan, reports status at the monthly review, and closes the entry. Rotate ownership when people change roles — orphaned risks are how a tracked gap becomes a Sev1.
Each plan needs: concrete engineering work (linked tickets), a target completion date, and the residual-risk score after mitigation. Vague mitigations ("improve observability") don't ship; "add SLO burn-rate alert on checkout-service p99" does.
For each release-related risk, document the trigger condition (error rate > X%, p99 > Y ms, customer support tickets > Z/hr) and the operator action (flip the LaunchDarkly flag, redeploy previous container tag, run the rollback migration). The PagerDuty runbook link goes here too.
After applying mitigations, re-score each top-tier risk. Anything still scoring 15+ is residual exposure leadership needs to accept explicitly — it doesn't go away because you wrote a plan.
Monitoring and Control
If the risk has a leading indicator (Dependabot critical-CVE count, certificate days-to-expiry, p99 latency budget burn), it goes on a dashboard with an alert routing to the risk owner — not a deprecated #alerts channel. "Backup nightly green for 18 months" without a restore test is not monitoring.
30 minutes, calendar-recurring. Owners report status on their entries, retire mitigated risks, add new ones surfaced since last cycle. Skipping the review is how registers become museum pieces.
Quarterly drill into a non-prod environment: restore the latest backup, redeploy the previous container tag, run the down migration. The first restore attempt usually fails on a rotated credential or a missing IAM permission — finding that during a drill is the point.
Stakeholder Communication
For any risk still rated High or Critical after mitigation, schedule a 15-minute briefing with engineering leadership. Capture the explicit accept/reject decision in the register so it's defensible at the next audit walkthrough.
One Slack post per cycle: top three risks, owners, target dates, and a link to the register. Async visibility prevents "nobody told me" surprises during release weeks.
Look back at the cycle: which risks materialized despite mitigation, which we missed entirely, and which controls actually held. Feed the answers into next quarter's identification step — that's how risk management compounds instead of resetting.
