Infrastructure as Code Checklist

Module and Code Hygiene

    Format with terraform fmt -recursive and lint with tflint (or pulumi lint / az bicep lint depending on stack). CI should block on lint failures — local runs catch noise before the PR.

    Confirm required_providers and module sources are pinned to exact versions or narrow ranges. Floating to ~> 5.0 on the AWS provider has burned teams when minor releases change resource defaults.

    Regenerate docs with terraform-docs so input variables, outputs, and examples match the code. Stale README entries are the most common reason a downstream consumer pins the wrong inputs.

    Run gitleaks or trufflehog against the diff. Secrets belong in Vault, AWS Secrets Manager, Azure Key Vault, or 1Password references — never inline in .tf, .tfvars, or pipeline YAML.

Repository and Branch Discipline

    Branch off main using the team's naming convention (e.g., iac/JIRA-1234-vpc-peering). Direct commits to main should be blocked by branch protection.

    Reference the ticket, summarize the blast radius, and paste the terraform plan output (or link to the CI job). Future incident responders read this first.

    CODEOWNERS should enforce a platform-team approver on the module path. Self-merge of infrastructure changes is the single most common SOC 2 change-management finding.

    Cut a semver tag (e.g., v1.4.2) so downstream stacks can pin to a known-good version. Untagged modules referenced by ref=main drift silently and break reproducibility.

Security and Compliance Scanning

    Checkov, tfsec, and Trivy catch open S3 buckets, unrestricted security groups, and unencrypted RDS instances. Treat HIGH/CRITICAL findings as merge blockers unless explicitly waived.

    EBS volumes, RDS, S3, and managed disks must declare KMS / CMK encryption explicitly. Default-encryption account settings don't survive every resource type — make it explicit in the resource block.

    No Action: "*" on Resource: "*". Scope IAM policies to the specific resource ARNs the workload needs. Standing administrator privilege is the #1 cloud-breach root cause.

    If a scanner finding cannot be remediated this release, open a ticket in the security backlog with the CVE / rule ID, business justification, compensating control, and expiry date. Auditors will ask.

State and Secrets Handling

    S3 + DynamoDB lock, Terraform Cloud workspace, or Azure Storage with blob lease — verify another engineer cannot apply concurrently. Two simultaneous applies corrupt state and require manual surgery to recover.

    Even if the commit is force-pushed away, treat anything that touched a public CI log or remote as compromised. Rewriting history does not un-leak a key.

    Rotate at the source of truth (IAM, Vault, Entra ID app registration), then pull CloudTrail / Entra audit logs for the window the secret was exposed. File an incident ticket regardless of whether use was observed.

Plan Review and Apply

    Save the speculative plan from Atlantis, Terraform Cloud, or Spacelift. Reviewers gate on the plan, not the source diff — a one-line variable change can cascade into hundreds of resource replacements.

    The - destroy and -/+ replace lines are where outages live. An RDS instance replacement deletes the database. A renamed resource in code reads as destroy + create to Terraform.

    For RDS, EBS, managed disks, and stateful workloads: take a manual snapshot or final backup before apply. Verify the snapshot is restorable, not just that it was created.

    Apply via the pipeline runner (not from a workstation) so the action is logged and uses the service principal's scoped credentials. Workstation applies bypass CAB and break the audit trail.

Post-Apply Verification

    Confirm Datadog, CloudWatch, or Azure Monitor is collecting metrics from the new resources and that alerting rules cover them. New resources without monitoring drift into production blind spots.

    AWS Backup plan, Azure Backup vault, or Veeam job must pick up the new RDS / disk / database. Tag-based backup selection is easy to miss when a tag is typo'd.

    Run terraform plan again post-apply with no code changes. Output should be a no-op. Any drift means a resource was changed out of band — investigate before closing the change.

    Attach the full apply output to the ServiceNow / Jira ticket and mark the CR complete. The apply log is the SOX/SOC 2 evidence that the approved change matches what executed.