Infrastructure as Code (IaC) Checklist

Version Control and Branching

    Use tfenv (or asdf) and commit a .terraform-version file at the repo root so every contributor and CI runner uses the same minor version. State-file incompatibility between 1.5 and 1.6 is a common gotcha when one teammate upgrades locally.

    Branch name should reference the ticket — e.g. infra/PLAT-482-rds-encryption. Branch protection on main requires PR + passing checks + 1 CODEOWNERS approval; do not push directly.

    Prefix with feat:, fix:, chore:, or refactor:. Breaking changes use feat!: with a BREAKING CHANGE: footer — release-please reads this to bump the module's semver tag.

Testing and Continuous Integration

    Run terraform plan -out=tfplan against the staging workspace and attach the plan to the PR. Watch for unintended destroys (especially of data resources like RDS or S3) — a single ~ on a name attribute often hides a forced replacement.

    Run the module's Go-based Terratest suite plus any terraform validate and terraform fmt -check gates. Integration tests that spin real AWS resources should target the sandbox account, not staging.

    GitHub Actions / GitLab CI runs fmt, validate, plan, tfsec, and Terratest. Required status checks must all be green before merge — never merge with a flaky check waved through.

Security and Compliance Scanning

    Both tools catch overlapping but distinct rules — tfsec is fast and Terraform-native, Checkov has broader CIS / SOC 2 / HIPAA policy packs. Configure them to fail the build on Critical and High by default; suppressions need an inline comment with the ticket justifying the exception.

    Summarize the scanner output. Auditors collecting SOC 2 evidence look for the scan-result artifact attached to the PR; the ticket should also link to the SARIF upload in GitHub Advanced Security.

    Fix, accept-with-suppression (with security review), or escalate. Common Critical hits: S3 buckets without encryption, security groups open 0.0.0.0/0 on 22, IAM policies with * on resources. Document any accepted risk in the security register.

Documentation and Knowledge Sharing

    Document new variables, default values, and outputs. If a variable changed type or default, call it out as a breaking change in the changelog — module consumers will hit it on their next terraform init -upgrade.

    Run terraform-docs markdown table --output-file README.md . (or via pre-commit). The CI pipeline fails if the generated section drifts from committed; that gate keeps the docs honest.

    One-paragraph summary: what changed, blast radius (which environments, which services), rollback approach, and a link to the PR. Application-team leads watch this channel to know when shared infra moved underneath them.

Configuration Management

    Run terraform plan twice in a row against the same workspace — the second run should show No changes. Drift between runs usually means a local-exec provisioner or a data source returning non-deterministic values; refactor those out.

    Never put secret values in .tfvars or commit them to git — even rotating later doesn't remove the value from history. Reference secrets via vault_generic_secret or aws_secretsmanager_secret_version data sources, and confirm gitleaks runs in pre-commit.

    One module, multiple env/*.tfvars files (dev, staging, prod). Do not fork the module per environment — divergence is the #1 source of "works in staging, breaks in prod" infra incidents.

Monitoring and Performance

    Define monitors as code in the same module — datadog_monitor or aws_cloudwatch_metric_alarm. Cover the four golden signals (latency, traffic, errors, saturation) at minimum; orphan resources without alerts is how outages slip past on-call.

    Critical alerts go to a PagerDuty service that maps to the on-call schedule for the owning team. Warning-level alerts should route to Slack, not PagerDuty — unactionable pages erode response discipline within weeks.

    Open the service dashboard and confirm new resources appear in the resource-list widgets and SLO panels. Also check that the auto-generated AWS service quota dashboard hasn't gone red (e.g. EIPs per region, RDS instances per account).

Dependency Management

    Use ~> 5.40 style pessimistic constraints, not >= 5.0. Commit the .terraform.lock.hcl — without it, terraform init on different machines pulls different provider versions and CI plans diverge from local plans.

    Don't let upgrade PRs pile to 80+. Auto-merge passing patch and minor for vetted providers (aws, hashicorp/random, hashicorp/null); major bumps need a human reading the upstream changelog because they often shift resource schemas.

    Run terraform plan after the upgrade and look for unexpected diffs — provider major versions sometimes rename or default attributes. Roll the upgrade through sandbox → staging → prod, never straight to prod.

Code Review and Apply

    The CODEOWNERS file routes review to the team that owns the module. PR description should include: blast radius, plan output link, scan results, rollback steps. "LGTM" on a 1,200-line plan is a red flag — break large changes into reviewable PRs under ~400 lines.

    Low = additive, non-prod, easily reversible (new tag, new monitor). Medium = prod-touching but reversible (new resource, parameter change). High = stateful resource changes, IAM scope expansions, breaking module changes, or anything touching shared networking. High requires a second approver and a deploy window outside Friday afternoon.

    Second approver should be a staff engineer or platform lead outside the original author's immediate sub-team. SOC 2 segregation-of-duties evidence pulls directly from this approval; auditors will sample PRs and check for two distinct reviewers.

    Apply via the CI runner (Atlantis, Terraform Cloud, or a protected GitHub Actions workflow) — never terraform apply from a laptop against prod. Tag the merged commit (e.g. v2024.45.0), push the changelog, and confirm the post-apply plan shows No changes.