Cloud Deployment Checklist

Steps an infrastructure or platform-operations team runs to deploy a workload into a cloud environment, from planning and security review through go-live and the post-deployment review. Designed for in-house IT and MSP project engineers running production cutovers under change...

5 sections 27 steps Collects data

Pre-Deployment Planning

Define scope, region, and workload owner
- Document the workload's named owner, target cloud (AWS / Azure / GCP), primary region, and whether secondary-region failover is in scope. Pin the region early — data-residency commitments (Schrems II, regional contracts) are downstream of this choice.
Collects file
Confirm CAB approval and change window
- Submit the RFC through the PSA / ITSM (ServiceNow, ConnectWise, Jira Service Management). Confirm the change is classified as normal vs. emergency, the approved window does not overlap a freeze (quarter-end, peak season), and the rollback approver is named.
Build the deployment runbook and rollback plan
- The runbook lists every action with owner, timestamp, and verification command. The rollback plan must be tested in staging, not theoretical — a rollback that depends on an untested DNS TTL or a snapshot nobody verified is the most common failed-rollback pattern.
Inventory dependencies and IaC modules
- List upstream and downstream services with versions: Terraform module versions pinned, container base images digest-pinned, vendor SaaS endpoints, certificates, secrets in the vault. Floating tags (latest) in production are a recurring outage source.
Confirm budget and committed-use reservations
- Validate the cost estimate against the approved budget. Reserve instances or commit savings plans before go-live where utilization is predictable; tag every resource with cost-center, environment, and owner so showback works on day one.

Security and Compliance

Classify the workload's data sensitivity
- Classify per the org's data-handling policy. Regulated workloads (PHI, cardholder data, PII subject to GDPR / CCPA) trigger additional controls — BAAs, PCI scoping, encryption-key custody requirements, pen-test scheduling. Get this right before the network design hardens.
Collects list
Run a threat model against the architecture
- Walk the data flow from ingress through processing to storage. Catalog trust boundaries, identify where untrusted input enters, and confirm authentication on every internal hop. STRIDE or a lightweight equivalent is fine — the output is a list of mitigations, not a thick document.
Configure IAM roles with least-privilege boundaries
- No standing admin. Workload roles use short-lived credentials (IRSA, workload identity, managed identities); human access is JIT-elevated through Entra PIM or the PAM (CyberArk, BeyondTrust). Service principals have a documented owner and a rotation schedule — the 6-year-old service account with hardcoded Domain Admin is the failure pattern to avoid.
Enable encryption at rest and TLS in transit
- KMS-managed keys on storage, databases, and snapshots; TLS 1.2+ on every public and internal endpoint with ACME automation for renewal. Confirm key rotation policy and that backup copies are encrypted with the same custody model — an unencrypted snapshot in a cross-account bucket defeats the entire control.
Wire activity logs into the SIEM
- CloudTrail / Azure Activity Log / GCP Audit Logs forward to Splunk, Sentinel, or the MDR provider. Confirm log retention meets the longest applicable requirement (PCI: 1 year, SOX: 7 years), and that detections fire on the canary event you stage as part of validation.
Schedule penetration test before go-live
- Triggered when the workload handles regulated data (PCI ASV scan, HIPAA risk analysis, SOC 2 annual pen test). Book the assessor with enough lead time to complete and remediate before the change window — findings discovered the day before go-live force a delay or a documented risk acceptance.

Infrastructure Setup

Provision the landing zone via Terraform
- Apply the IaC pipeline (Terraform Cloud, Atlantis, or in-CI). State backend is remote with locking; no one runs terraform apply from a laptop. Plan output is reviewed in PR; resources land in the account and OU dictated by the landing-zone framework (Control Tower, Azure Landing Zones).
Configure VPC, subnets, and route tables
- Public, private, and data subnets across at least two AZs. Confirm CIDR ranges do not overlap with on-prem (transit-gateway peering breaks otherwise) and that NAT egress is sized for the workload — bursty egress through a single NAT gateway is a quiet outage source.
Deploy load balancers and auto-scaling groups
- ALB / Application Gateway / GCLB with health checks pointing at a real readiness endpoint, not /. ASG min/max sized for the SLA; cooldowns set so a brief restart storm does not double the fleet. Validate health-check thresholds with a deliberately failing instance.
Apply security groups and WAF rules
- Default-deny, explicit allows, no 0.0.0.0/0 on management ports. WAF rules in count-mode first to validate against legitimate traffic, then flip to block once the false-positive rate is acceptable.
Validate DNS records and TLS certificates
- Confirm each FQDN resolves, the cert chain is valid (intermediates included — partial chains break older clients), and TTLs are pre-lowered to 60s before the cutover so a rollback can flip DNS quickly. Add the cert to the renewal monitor immediately.

Application Deployment

Tag the release artifact and freeze dependencies
- Cut a signed tag in the registry; pin container images by digest, not tag. The artifact deployed to staging must be byte-identical to the one promoted to prod — rebuilding between environments invalidates the staging signal.
Run the CI/CD pipeline to staging
- GitHub Actions / GitLab CI / Azure DevOps deploys to the staging environment, runs database migrations in dry-run mode, and confirms the application boots cleanly. Migration failures in production usually replay a missed staging warning.
Execute smoke and load tests in staging
- Synthetic checks for the named user journeys plus a load test at 1.5x peak. Compare p95 / p99 latency against the prior baseline. A passing functional test with degraded p99 is a fail — note it before promoting.
Collects list
File a remediation ticket for failed staging tests
- Open a P2 in the PSA / ITSM with the failing test output, owner, and target re-test date. Notify the CAB so the change window can be rescheduled rather than blown through; a partial deploy past a known failure is the worst outcome.
Promote the release to production
- Execute the runbook step-by-step within the approved window. Use blue/green or canary where the platform supports it — flip a small traffic percentage first, watch error rates and latency for the agreed soak period, then increase.
Collects list Collects paragraph Collects datetime
Execute the rollback playbook
- Triggered when the deployment outcome is rollback. Flip DNS or the canary weight back to the prior version, restore the database from the pre-deploy snapshot if migrations ran forward, and notify the CAB and downstream consumers. Capture timeline detail for the post-incident review while it is fresh.

Post-Deployment Management

Confirm monitoring alerts and dashboards
- Datadog / Sentinel / LogicMonitor dashboards loaded with the new resources, paging routes pointed at the right PagerDuty / Opsgenie schedule, and a synthetic check fires a low-severity test page that the on-call actually receives. Alerts that route to a deprecated mailbox are how outages go silent.
Validate auto-scaling and failover behavior
- Drive synthetic load to confirm scale-out triggers at the documented threshold and scale-in does not flap. Force a single-AZ failure (terminate an instance, drain a node) to confirm the LB removes it before users notice.
Run a backup and restore drill
- Restore from the latest snapshot into an isolated account or VPC and verify the application boots and serves a known-good record. The 3-2-1 rule (3 copies, 2 media, 1 offsite, immutable) is documentation; the drill is the proof. A green backup metric without a tested restore is the single most common ransomware-day surprise.
Collects number Collects file
Right-size resources from one-week telemetry
- Pull CPU / memory / IOPS percentiles from the first week and reconcile against the provisioned SKU. Over-provisioned instances waste reservation; under-provisioned ones cause paging at the next traffic peak. Update the IaC, do not click in the console.
Hold the post-deployment review
- 30 minutes with the deploy lead, a security rep, and the application owner. What surprised us, what we would change in the runbook, what carries forward to the next deployment template. File the notes against the change record so the next CAB submission inherits the lesson.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 5

Steps 27

Category Systems Administration

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Systems Administration

Run Cloud Deployment Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.