Cloud Deployment Checklist
Pre-Deployment Planning
Document the workload's named owner, target cloud (AWS / Azure / GCP), primary region, and whether secondary-region failover is in scope. Pin the region early — data-residency commitments (Schrems II, regional contracts) are downstream of this choice.
Submit the RFC through the PSA / ITSM (ServiceNow, ConnectWise, Jira Service Management). Confirm the change is classified as normal vs. emergency, the approved window does not overlap a freeze (quarter-end, peak season), and the rollback approver is named.
The runbook lists every action with owner, timestamp, and verification command. The rollback plan must be tested in staging, not theoretical — a rollback that depends on an untested DNS TTL or a snapshot nobody verified is the most common failed-rollback pattern.
List upstream and downstream services with versions: Terraform module versions pinned, container base images digest-pinned, vendor SaaS endpoints, certificates, secrets in the vault. Floating tags (latest) in production are a recurring outage source.
Validate the cost estimate against the approved budget. Reserve instances or commit savings plans before go-live where utilization is predictable; tag every resource with cost-center, environment, and owner so showback works on day one.
Security and Compliance
Classify per the org's data-handling policy. Regulated workloads (PHI, cardholder data, PII subject to GDPR / CCPA) trigger additional controls — BAAs, PCI scoping, encryption-key custody requirements, pen-test scheduling. Get this right before the network design hardens.
Walk the data flow from ingress through processing to storage. Catalog trust boundaries, identify where untrusted input enters, and confirm authentication on every internal hop. STRIDE or a lightweight equivalent is fine — the output is a list of mitigations, not a thick document.
No standing admin. Workload roles use short-lived credentials (IRSA, workload identity, managed identities); human access is JIT-elevated through Entra PIM or the PAM (CyberArk, BeyondTrust). Service principals have a documented owner and a rotation schedule — the 6-year-old service account with hardcoded Domain Admin is the failure pattern to avoid.
KMS-managed keys on storage, databases, and snapshots; TLS 1.2+ on every public and internal endpoint with ACME automation for renewal. Confirm key rotation policy and that backup copies are encrypted with the same custody model — an unencrypted snapshot in a cross-account bucket defeats the entire control.
CloudTrail / Azure Activity Log / GCP Audit Logs forward to Splunk, Sentinel, or the MDR provider. Confirm log retention meets the longest applicable requirement (PCI: 1 year, SOX: 7 years), and that detections fire on the canary event you stage as part of validation.
Triggered when the workload handles regulated data (PCI ASV scan, HIPAA risk analysis, SOC 2 annual pen test). Book the assessor with enough lead time to complete and remediate before the change window — findings discovered the day before go-live force a delay or a documented risk acceptance.
Infrastructure Setup
Apply the IaC pipeline (Terraform Cloud, Atlantis, or in-CI). State backend is remote with locking; no one runs terraform apply from a laptop. Plan output is reviewed in PR; resources land in the account and OU dictated by the landing-zone framework (Control Tower, Azure Landing Zones).
Public, private, and data subnets across at least two AZs. Confirm CIDR ranges do not overlap with on-prem (transit-gateway peering breaks otherwise) and that NAT egress is sized for the workload — bursty egress through a single NAT gateway is a quiet outage source.
ALB / Application Gateway / GCLB with health checks pointing at a real readiness endpoint, not /. ASG min/max sized for the SLA; cooldowns set so a brief restart storm does not double the fleet. Validate health-check thresholds with a deliberately failing instance.
Default-deny, explicit allows, no 0.0.0.0/0 on management ports. WAF rules in count-mode first to validate against legitimate traffic, then flip to block once the false-positive rate is acceptable.
Confirm each FQDN resolves, the cert chain is valid (intermediates included — partial chains break older clients), and TTLs are pre-lowered to 60s before the cutover so a rollback can flip DNS quickly. Add the cert to the renewal monitor immediately.
Application Deployment
Cut a signed tag in the registry; pin container images by digest, not tag. The artifact deployed to staging must be byte-identical to the one promoted to prod — rebuilding between environments invalidates the staging signal.
GitHub Actions / GitLab CI / Azure DevOps deploys to the staging environment, runs database migrations in dry-run mode, and confirms the application boots cleanly. Migration failures in production usually replay a missed staging warning.
Synthetic checks for the named user journeys plus a load test at 1.5x peak. Compare p95 / p99 latency against the prior baseline. A passing functional test with degraded p99 is a fail — note it before promoting.
Open a P2 in the PSA / ITSM with the failing test output, owner, and target re-test date. Notify the CAB so the change window can be rescheduled rather than blown through; a partial deploy past a known failure is the worst outcome.
Execute the runbook step-by-step within the approved window. Use blue/green or canary where the platform supports it — flip a small traffic percentage first, watch error rates and latency for the agreed soak period, then increase.
Triggered when the deployment outcome is rollback. Flip DNS or the canary weight back to the prior version, restore the database from the pre-deploy snapshot if migrations ran forward, and notify the CAB and downstream consumers. Capture timeline detail for the post-incident review while it is fresh.
Post-Deployment Management
Datadog / Sentinel / LogicMonitor dashboards loaded with the new resources, paging routes pointed at the right PagerDuty / Opsgenie schedule, and a synthetic check fires a low-severity test page that the on-call actually receives. Alerts that route to a deprecated mailbox are how outages go silent.
Drive synthetic load to confirm scale-out triggers at the documented threshold and scale-in does not flap. Force a single-AZ failure (terminate an instance, drain a node) to confirm the LB removes it before users notice.
Restore from the latest snapshot into an isolated account or VPC and verify the application boots and serves a known-good record. The 3-2-1 rule (3 copies, 2 media, 1 offsite, immutable) is documentation; the drill is the proof. A green backup metric without a tested restore is the single most common ransomware-day surprise.
Pull CPU / memory / IOPS percentiles from the first week and reconcile against the provisioned SKU. Over-provisioned instances waste reservation; under-provisioned ones cause paging at the next traffic peak. Update the IaC, do not click in the console.
30 minutes with the deploy lead, a security rep, and the application owner. What surprised us, what we would change in the runbook, what carries forward to the next deployment template. File the notes against the change record so the next CAB submission inherits the lesson.
Use this template in Manifestly
- Cloud Migration Checklist
- Cloud Security Checklist
- User Access Review Checklist
- Data Recovery Checklist
- Containerization Rollout Checklist
- Database Backup Checklist
- Password Management Checklist
- Backup and Restore Checklist
- Network Upgrade Checklist
- Server Backup Checklist
- Business Continuity Plan Checklist
- Problem Management Checklist
- Server Decommissioning Checklist
- Cloud Monitoring Checklist
- Hardware Inventory Checklist
- IT Regulatory Compliance Review
- Release Management Checklist
- Server Maintenance Checklist
- Rollback Plan Checklist
- Customer Support Ticket Workflow
- Software Upgrade Checklist
- Quarterly Compliance Reporting Checklist
- Patch Management Checklist
- Hardware Maintenance Checklist
- Server Security Checklist
- IT Emergency Response Checklist
- Incident Management Checklist
- Disaster Recovery Plan Checklist
- User Role Management Checklist
- Software Installation Checklist
- Compliance Audit Checklist
- Access Control Checklist
- Cloud Cost Management Checklist
- IT Staff Performance Review
- Change Management Checklist
- Firewall Configuration Checklist
- Security Audit Checklist
- Quarterly Network Security Review
- Database Migration Checklist
- Employee Onboarding Checklist
- Capacity Planning Checklist
- IT Budgeting Checklist
- Network Monitoring Checklist
- Database Installation Checklist
- IT Service Request Checklist
- Database Security Checklist
- System Monitoring Checklist
- Hardware Troubleshooting Checklist
- IT Strategy Checklist
- Patch Deployment Checklist
- Hardware Upgrade Checklist
- Performance Tuning Checklist
- Application Performance Monitoring Checklist
- Employee Training Checklist
- User Onboarding Checklist
- IT Vendor Management Checklist
- Server Build and Hardening Checklist
- IT Policy Review Checklist
- Help Desk Ticket Handling Checklist
- Infrastructure as Code Checklist
- Hardware Disposal Checklist
- IT Resource Allocation Checklist
- Incident Response Checklist
- Network Troubleshooting Checklist
- User Offboarding Checklist
- Change Management Checklist
- Software Installation Checklist
- Software Update Checklist
- Network Upgrade Checklist
- Server Decommissioning Checklist
- Release Management Checklist
- Rollback Plan Checklist
- Software Upgrade Checklist
- Software Installation Checklist
- Change Management Checklist
- Database Installation Checklist
- Patch Deployment Checklist
- Hardware Upgrade Checklist
- Performance Tuning Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
