CI/CD Pipeline Review Checklist

Version Control & Branch Hygiene

    Pick the strategy in use today, not the one in the wiki from 2021. Trunk-based with short-lived branches works well when the team is shipping daily and feature flags are mature; Gitflow fits scheduled-release shops with a long QA tail. The reviewer should be able to point to a recent PR that exemplifies the strategy.

    Required status checks should include the full PR pipeline (lint, unit, integration, SAST). Required reviewer count should be at least 1 from CODEOWNERS. Force-push and branch-deletion should be disabled. Admins-can-bypass should be off — the most common drift is an admin override that nobody remembers turning on for a hotfix six months ago.

    Every directory with production code should map to a team, not an individual. Spot-check three recent PRs and confirm the auto-requested reviewers were the right people. Stale ownership is the most common reason reviews stall.

    Trunk-based development relies on flags to keep main shippable. Pull the LaunchDarkly / Unleash / Flagsmith inventory and flag any toggle older than 90 days without a named owner. Stale flags fork the test matrix and hide dead code paths.

Automated Testing

    Capture the current line-coverage percentage from the CI output (Jest, pytest-cov, JaCoCo, etc.). The number itself matters less than the trend — coverage that's been silently sliding for two quarters is the signal to dig into. Confirm the build fails when coverage drops below the floor; coverage-as-warning gets ignored.

    Integration suite should run on every PR, not just nightly. If the suite takes longer than 15 minutes, that's a separate problem worth a ticket — but skipping it on PRs is how regressions reach main.

    Flakes that have been quarantined for more than 30 days are either dead code waiting to be deleted or real bugs being ignored. Either resolve them or assign owners with a deadline. The cultural failure mode is 'just rerun it' becoming permission to ignore real regressions.

    Playwright / Cypress / Selenium suite should run against the post-deploy staging build and gate promotion to production. Confirm the suite covers the top three customer paths — login, primary workflow, billing — at minimum.

Build Pipeline

    Confirm Semgrep / CodeQL / SonarQube run as required PR checks, not nightly batches. New findings should fail the PR with the offending line annotated; existing findings can be tracked as backlog. Lint config (ESLint, RuboCop, golangci-lint) lives in-repo so contributors run it locally.

    Use Syft, CycloneDX, or the GitHub-native SBOM action to emit an SPDX or CycloneDX file per build artifact. Increasingly required for federal contracts (EO 14028) and a precondition for fast triage when the next Log4Shell-class CVE drops.

    Container images should be signed with cosign / Sigstore and the deployment side should verify signatures before pulling. Unsigned-image-by-default in production is a supply-chain gap auditors flag.

Deployment Automation

    Capture the strategy actually in use for the primary service — not the one in the architecture diagram. Rolling is the default for stateless services on Kubernetes; canary requires a traffic-splitting layer (Istio, ArgoRollouts, LaunchDarkly server-side); blue-green requires duplicate infrastructure and a cutover step.

    The canary needs its own dashboard with error rate, p99 latency, and saturation isolated to the canary cohort — comparing canary vs. baseline. If the dashboard lumps canary metrics with the rest of production, the canary signal is invisible and the strategy is theater.

    The rollback procedure documented in the runbook is fiction until someone has run it end-to-end. Pick a recent release in staging, redeploy it, then roll it back. Common failures: the previous container image was pruned, the migration wasn't reversible, the rollback script depends on a credential that rotated. File a ticket for every gap; do not declare success until the drill works clean.

    Terraform plan / Pulumi preview should run nightly against production state and post drift to a Slack channel that someone reads. Without drift detection, a console-edit during an incident is permanent and silent.

Observability & Alerting

    Every production service needs latency, traffic, errors, and saturation visible at a glance. Datadog, Grafana, or New Relic — the tool matters less than coverage. Spot-check that a new on-call engineer could open the dashboard for any service and read the current health in under 30 seconds.

    Trigger a test page through PagerDuty / Opsgenie and confirm it reaches the on-call engineer's phone within two minutes. Common failure: rotation references a Slack channel that was renamed, or an escalation policy that points to someone who left six months ago.

    For each tier-1 service, check the SLO target, current month's burn, and remaining error budget. If a service has been burning budget for two consecutive months without action items, that's the signal to slow feature work and invest in reliability.

Security & Compliance

    Capture the compliance frameworks the pipeline must satisfy. SOC 2 touches change management, access reviews, and vulnerability management. HIPAA adds encryption in transit/at rest and a BAA with infrastructure providers. PCI scope depends on whether you store PAN.

    Export a sample of merged PRs from the past quarter showing approver, ticket reference, and CI status. Vanta / Drata / Secureframe should be wired to GitHub for automated collection — confirm the integration is healthy and a recent sample looks right. Manual evidence collection at audit time is the slow path.

    Pull the Dependabot / Snyk / Renovate queue. Critical and high CVEs in production deps should have a ticket and a deadline; transitive low-severity findings can be batched. The failure mode is a queue of 80+ ignored notifications hiding the next Log4Shell.

    Run gitleaks / trufflehog against the full repo history and confirm zero hits on production secrets. Confirm pre-commit hooks are installed for new contributors. Production credentials live in HashiCorp Vault, AWS Secrets Manager, or Doppler — never in committed .env files or CI variable UIs that aren't audited.

Documentation & Sign-Off

    The runbook in Confluence / Notion / Backstage should match what this review found — current branching strategy, test gates, deployment strategy, paging routes. Diagrams older than 12 months need a refresh. A new engineer should be able to read the runbook and ship their first change without shoulder-tapping a senior.

    Engineering manager or platform lead signs off on the review outcome. Open action items go into the next sprint with named owners; a punt-list with no owners is the same as no review.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack