Infrastructure as Code (IaC) Checklist
Steps a platform or DevOps team runs on every Terraform / OpenTofu / Pulumi change — from branch through plan, security scan, review, and apply — to keep infra changes safe, reviewable, and reversible.
Version Control and Branching
-
Pin the Terraform version in .terraform-version
Use tfenv (or asdf) and commit a .terraform-version file at the repo root so every contributor and CI runner uses the same minor version. State-file incompatibility between 1.5 and 1.6 is a common gotcha when one teammate upgrades locally.
-
Open a feature branch off main
Branch name should reference the ticket — e.g. infra/PLAT-482-rds-encryption. Branch protection on main requires PR + passing checks + 1 CODEOWNERS approval; do not push directly.
-
Write conventional-commit messages
Prefix with feat:, fix:, chore:, or refactor:. Breaking changes use feat!: with a BREAKING CHANGE: footer — release-please reads this to bump the module's semver tag.
Testing and Continuous Integration
-
Run terraform plan against the staging workspace
Run terraform plan -out=tfplan against the staging workspace and attach the plan to the PR. Watch for unintended destroys (especially of data resources like RDS or S3) — a single ~ on a name attribute often hides a forced replacement.
Collects file -
Execute Terratest unit and integration suites
Run the module's Go-based Terratest suite plus any terraform validate and terraform fmt -check gates. Integration tests that spin real AWS resources should target the sandbox account, not staging.
-
Trigger the IaC pipeline on push
GitHub Actions / GitLab CI runs fmt, validate, plan, tfsec, and Terratest. Required status checks must all be green before merge — never merge with a flaky check waved through.
Security and Compliance Scanning
-
Run tfsec and Checkov against the module
Both tools catch overlapping but distinct rules — tfsec is fast and Terraform-native, Checkov has broader CIS / SOC 2 / HIPAA policy packs. Configure them to fail the build on Critical and High by default; suppressions need an inline comment with the ticket justifying the exception.
-
Record scan severity findings
Summarize the scanner output. Auditors collecting SOC 2 evidence look for the scan-result artifact attached to the PR; the ticket should also link to the SARIF upload in GitHub Advanced Security.
Collects list -
Triage Critical and High findings before merge
Fix, accept-with-suppression (with security review), or escalate. Common Critical hits: S3 buckets without encryption, security groups open 0.0.0.0/0 on 22, IAM policies with * on resources. Document any accepted risk in the security register.
Documentation and Knowledge Sharing
-
Update the module README with input and output changes
Document new variables, default values, and outputs. If a variable changed type or default, call it out as a breaking change in the changelog — module consumers will hit it on their next terraform init -upgrade.
-
Regenerate the terraform-docs reference
Run terraform-docs markdown table --output-file README.md . (or via pre-commit). The CI pipeline fails if the generated section drifts from committed; that gate keeps the docs honest.
-
Post the change summary in #infra-changes
One-paragraph summary: what changed, blast radius (which environments, which services), rollback approach, and a link to the PR. Application-team leads watch this channel to know when shared infra moved underneath them.
Configuration Management
-
Confirm resources are declarative and idempotent
Run terraform plan twice in a row against the same workspace — the second run should show No changes. Drift between runs usually means a local-exec provisioner or a data source returning non-deterministic values; refactor those out.
-
Move new secrets into Vault or AWS Secrets Manager
Never put secret values in .tfvars or commit them to git — even rotating later doesn't remove the value from history. Reference secrets via vault_generic_secret or aws_secretsmanager_secret_version data sources, and confirm gitleaks runs in pre-commit.
-
Parameterize environment differences via tfvars
One module, multiple env/*.tfvars files (dev, staging, prod). Do not fork the module per environment — divergence is the #1 source of "works in staging, breaks in prod" infra incidents.
Monitoring and Performance
-
Add Datadog or CloudWatch monitors for new resources
Define monitors as code in the same module — datadog_monitor or aws_cloudwatch_metric_alarm. Cover the four golden signals (latency, traffic, errors, saturation) at minimum; orphan resources without alerts is how outages slip past on-call.
-
Wire alerts to the existing PagerDuty service
Critical alerts go to a PagerDuty service that maps to the on-call schedule for the owning team. Warning-level alerts should route to Slack, not PagerDuty — unactionable pages erode response discipline within weeks.
-
Verify dashboards reflect the new resource set
Open the service dashboard and confirm new resources appear in the resource-list widgets and SLO panels. Also check that the auto-generated AWS service quota dashboard hasn't gone red (e.g. EIPs per region, RDS instances per account).
Dependency Management
-
Pin provider and module versions in versions.tf
Use ~> 5.40 style pessimistic constraints, not >= 5.0. Commit the .terraform.lock.hcl — without it, terraform init on different machines pulls different provider versions and CI plans diverge from local plans.
-
Review open Renovate or Dependabot upgrade PRs
Don't let upgrade PRs pile to 80+. Auto-merge passing patch and minor for vetted providers (aws, hashicorp/random, hashicorp/null); major bumps need a human reading the upstream changelog because they often shift resource schemas.
-
Test pinned upgrades in the sandbox workspace
Run terraform plan after the upgrade and look for unexpected diffs — provider major versions sometimes rename or default attributes. Roll the upgrade through sandbox → staging → prod, never straight to prod.
Code Review and Apply
-
Request review from the CODEOWNERS infra team
The CODEOWNERS file routes review to the team that owns the module. PR description should include: blast radius, plan output link, scan results, rollback steps. "LGTM" on a 1,200-line plan is a red flag — break large changes into reviewable PRs under ~400 lines.
Collects url -
Classify the change risk level
Low = additive, non-prod, easily reversible (new tag, new monitor). Medium = prod-touching but reversible (new resource, parameter change). High = stateful resource changes, IAM scope expansions, breaking module changes, or anything touching shared networking. High requires a second approver and a deploy window outside Friday afternoon.
Collects list -
Obtain a second approver for High-risk changes
Second approver should be a staff engineer or platform lead outside the original author's immediate sub-team. SOC 2 segregation-of-duties evidence pulls directly from this approval; auditors will sample PRs and check for two distinct reviewers.
-
Apply the plan and tag the release
Apply via the CI runner (Atlantis, Terraform Cloud, or a protected GitHub Actions workflow) — never terraform apply from a laptop against prod. Tag the merged commit (e.g. v2024.45.0), push the changelog, and confirm the post-apply plan shows No changes.
Collects list Collects text Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Infrastructure as Code (IaC) Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.