CI/CD Pipeline Review Checklist

Quarterly review a platform or release engineering team runs to audit a CI/CD pipeline end-to-end — branch protection, test gating, build provenance, deployment strategy, observability, security controls, and runbook freshness. Designed ...

1

Version Control & Branch Hygiene

  1. Confirm the team's branching strategy
    • Pick the strategy in use today, not the one in the wiki from 2021. Trunk-based with short-lived branches works well when the team is shipping daily and feature flags are mature; Gitflow fits scheduled-release shops with a long QA tail. The reviewer should be able to point to a recent PR that exemplifies the strategy.

    Collects list
  2. Audit branch protection rules on main
    • Required status checks should include the full PR pipeline (lint, unit, integration, SAST). Required reviewer count should be at least 1 from CODEOWNERS. Force-push and branch-deletion should be disabled. Admins-can-bypass should be off — the most common drift is an admin override that nobody remembers turning on for a hotfix six months ago.

  3. Verify CODEOWNERS routes review requests
    • Every directory with production code should map to a team, not an individual. Spot-check three recent PRs and confirm the auto-requested reviewers were the right people. Stale ownership is the most common reason reviews stall.

  4. Review the feature flag inventory
    • Trunk-based development relies on flags to keep main shippable. Pull the LaunchDarkly / Unleash / Flagsmith inventory and flag any toggle older than 90 days without a named owner. Stale flags fork the test matrix and hide dead code paths.

2

Automated Testing

  1. Review unit test coverage thresholds
    • Capture the current line-coverage percentage from the CI output (Jest, pytest-cov, JaCoCo, etc.). The number itself matters less than the trend — coverage that's been silently sliding for two quarters is the signal to dig into. Confirm the build fails when coverage drops below the floor; coverage-as-warning gets ignored.

    Collects number
  2. Confirm integration tests gate PR merges
    • Integration suite should run on every PR, not just nightly. If the suite takes longer than 15 minutes, that's a separate problem worth a ticket — but skipping it on PRs is how regressions reach main.

  3. Audit the flaky test quarantine list
    • Flakes that have been quarantined for more than 30 days are either dead code waiting to be deleted or real bugs being ignored. Either resolve them or assign owners with a deadline. The cultural failure mode is 'just rerun it' becoming permission to ignore real regressions.

  4. Verify the e2e suite runs against staging
    • Playwright / Cypress / Selenium suite should run against the post-deploy staging build and gate promotion to production. Confirm the suite covers the top three customer paths — login, primary workflow, billing — at minimum.

3

Build Pipeline

  1. Run SAST and lint on every PR
    • Confirm Semgrep / CodeQL / SonarQube run as required PR checks, not nightly batches. New findings should fail the PR with the offending line annotated; existing findings can be tracked as backlog. Lint config (ESLint, RuboCop, golangci-lint) lives in-repo so contributors run it locally.

  2. Generate the SBOM during build
    • Use Syft, CycloneDX, or the GitHub-native SBOM action to emit an SPDX or CycloneDX file per build artifact. Increasingly required for federal contracts (EO 14028) and a precondition for fast triage when the next Log4Shell-class CVE drops.

  3. Sign release artifacts in the registry
    • Container images should be signed with cosign / Sigstore and the deployment side should verify signatures before pulling. Unsigned-image-by-default in production is a supply-chain gap auditors flag.

4

Deployment Automation

  1. Confirm the deployment strategy
    • Capture the strategy actually in use for the primary service — not the one in the architecture diagram. Rolling is the default for stateless services on Kubernetes; canary requires a traffic-splitting layer (Istio, ArgoRollouts, LaunchDarkly server-side); blue-green requires duplicate infrastructure and a cutover step.

    Collects list
  2. Verify the canary cohort dashboard
    • The canary needs its own dashboard with error rate, p99 latency, and saturation isolated to the canary cohort — comparing canary vs. baseline. If the dashboard lumps canary metrics with the rest of production, the canary signal is invisible and the strategy is theater.

  3. Run a rollback drill against staging
    • The rollback procedure documented in the runbook is fiction until someone has run it end-to-end. Pick a recent release in staging, redeploy it, then roll it back. Common failures: the previous container image was pruned, the migration wasn't reversible, the rollback script depends on a credential that rotated. File a ticket for every gap; do not declare success until the drill works clean.

    Collects list
  4. Check IaC drift detection runs nightly
    • Terraform plan / Pulumi preview should run nightly against production state and post drift to a Slack channel that someone reads. Without drift detection, a console-edit during an incident is permanent and silent.

5

Observability & Alerting

  1. Audit golden-signal dashboards per service
    • Every production service needs latency, traffic, errors, and saturation visible at a glance. Datadog, Grafana, or New Relic — the tool matters less than coverage. Spot-check that a new on-call engineer could open the dashboard for any service and read the current health in under 30 seconds.

  2. Verify PagerDuty routes to the current rotation
    • Trigger a test page through PagerDuty / Opsgenie and confirm it reaches the on-call engineer's phone within two minutes. Common failure: rotation references a Slack channel that was renamed, or an escalation policy that points to someone who left six months ago.

  3. Review SLOs and error budgets
    • For each tier-1 service, check the SLO target, current month's burn, and remaining error budget. If a service has been burning budget for two consecutive months without action items, that's the signal to slow feature work and invest in reliability.

6

Security & Compliance

  1. Confirm the regulatory scope
    • Capture the compliance frameworks the pipeline must satisfy. SOC 2 touches change management, access reviews, and vulnerability management. HIPAA adds encryption in transit/at rest and a BAA with infrastructure providers. PCI scope depends on whether you store PAN.

    Collects list
  2. Pull SOC 2 change-management evidence
    • Export a sample of merged PRs from the past quarter showing approver, ticket reference, and CI status. Vanta / Drata / Secureframe should be wired to GitHub for automated collection — confirm the integration is healthy and a recent sample looks right. Manual evidence collection at audit time is the slow path.

    Collects file
  3. Triage open CVEs from Dependabot
    • Pull the Dependabot / Snyk / Renovate queue. Critical and high CVEs in production deps should have a ticket and a deadline; transitive low-severity findings can be batched. The failure mode is a queue of 80+ ignored notifications hiding the next Log4Shell.

  4. Verify secrets live in Vault, not .env files
    • Run gitleaks / trufflehog against the full repo history and confirm zero hits on production secrets. Confirm pre-commit hooks are installed for new contributors. Production credentials live in HashiCorp Vault, AWS Secrets Manager, or Doppler — never in committed .env files or CI variable UIs that aren't audited.

7

Documentation & Sign-Off

  1. Update the CI/CD pipeline runbook
    • The runbook in Confluence / Notion / Backstage should match what this review found — current branching strategy, test gates, deployment strategy, paging routes. Diagrams older than 12 months need a refresh. A new engineer should be able to read the runbook and ship their first change without shoulder-tapping a senior.

  2. Capture the review sign-off
    • Engineering manager or platform lead signs off on the review outcome. Open action items go into the next sprint with named owners; a punt-list with no owners is the same as no review.

    Collects list Collects paragraph Collects signature