Containerization Rollout Checklist
Runtime and Registry Setup
Verify kernel version (4.x+ for cgroups v2), confirm overlay2 storage driver is available, and check that SELinux/AppArmor profiles are compatible. RHEL 7 and Ubuntu 18.04 hosts are common gotchas — they ship with cgroups v1 and need explicit boot-parameter changes.
Install via the vendor repo (Docker CE, containerd from Kubernetes apt/yum repo) — never the distro default, which lags by 1-2 minor versions. Pin the version explicitly so unattended-upgrades don't bump the runtime mid-week.
Apply CIS Docker Benchmark settings: disable legacy registry v1, enable user namespace remapping, set live-restore, configure log rotation on the json-file driver (default is unbounded — fills /var fast on busy hosts).
Configure a registry — Harbor self-hosted, ECR, ACR, GAR, or Artifactory. Set up pull-through cache for Docker Hub to dodge rate-limit outages. Store registry credentials in the orchestrator's secret store, not in /root/.docker/config.json.
Pull and run a known-good image (alpine, hello-world, or your team's canary) to confirm DNS resolution, registry auth, and outbound network egress all work end-to-end before any real workloads land.
Image Security and Compliance
Run Trivy, Grype, or Snyk against the base images you plan to allow. Block any image with unpatched critical CVEs from the registry. Distroless and Chainguard images dramatically reduce the attack surface vs. ubuntu:latest.
If accepting a critical CVE, file the exception with named owner, compensating control, and review date. Exceptions without expiry become permanent — set a 90-day re-review on the calendar and tag the exception in the registry.
Sign images with Cosign or Notation (Notary v2). Configure the orchestrator's admission controller (Kyverno, OPA Gatekeeper, or Connaisseur) to reject unsigned images. Don't ship the signing key on builder nodes — use a KMS-backed key.
Map roles to AD/Entra ID groups via SSO — never local kubeconfig accounts for humans. Grant cluster-admin only via break-glass account stored in PAM. Developers get edit on their namespace, not view at cluster scope.
Default-deny ingress and egress per namespace, then allow only required paths. Calico, Cilium, or the built-in NetworkPolicy work; without them, a flat pod network lets a compromised container reach every other workload.
Ship container stdout/stderr and Kubernetes audit logs to Splunk, Sentinel, or Elastic via Fluent Bit or Vector. Audit logs are required for SOC 2 CC7.2 and the only way to forensically reconstruct a kubectl exec incident.
Resource Governance
Every pod gets requests + limits. No-limit pods can OOM the node; identical-request-and-limit pods land in the Guaranteed QoS class and survive node pressure. Use VPA recommendations as a starting point, not as truth.
Set HPA on CPU or custom metrics via the metrics-server or Prometheus Adapter. Pair with cluster-autoscaler or Karpenter so pod scale-up actually gets nodes. Tune scale-down stabilization window — aggressive defaults cause flapping under bursty load.
ResourceQuota caps total CPU/memory/storage per namespace; LimitRange sets per-pod defaults. Without these, one team's runaway CronJob can starve the cluster. Quota a dev namespace tighter than prod to encourage right-sizing.
Deploy kube-prometheus-stack with the standard kubernetes-mixin dashboards. Alert on node memory pressure, persistent OOMKilled events, and throttling > 25%. Page on cluster-level signals; ticket on namespace-level signals.
Multi-stage builds drop compilers and build artifacts from the runtime image — a Go service goes from 800MB to 20MB. Smaller images mean faster pulls, faster pod startup, and a smaller CVE attack surface.
Persistent Storage and Backup
Configure a CSI driver — EBS, Azure Disk, GCP PD, Longhorn, or Rook-Ceph. Define StorageClasses with reclaimPolicy=Retain for prod and Delete for dev. WaitForFirstConsumer binding mode prevents zone-mismatch errors on multi-AZ clusters.
Velero or Kasten K10 for cluster-state + PV snapshots, replicated to an object-locked S3 bucket in a separate AWS account. Backup that's writable from the production cluster is not ransomware-resilient — that's the lesson from every 2023-2024 K8s ransomware case.
If immutable offsite copy is not in place, do not promote the cluster to production. Stand up a Velero target with S3 Object Lock in compliance mode, or contract MSP360 / Kasten / Veeam Kasten as a managed alternative. Re-run the prior verification before proceeding.
Pull secrets from Vault, AWS Secrets Manager, or Azure Key Vault via the Secrets Store CSI driver. Plain Kubernetes Secrets are base64, not encrypted — anyone with namespace get-secret can read them. Never bake secrets into the image.
Enable EncryptionConfiguration on the API server (KMS provider via Vault or cloud KMS) so etcd-stored secrets are encrypted. Enable volume-level encryption on the underlying storage. Required for HIPAA, PCI DSS, and most SOC 2 controls.
Restore a representative PV and namespace into an isolated test cluster. Time the restore against your stated RTO. Backup that's never restored is a backup that doesn't work — and the first restore attempt is always where the credential rotation, format change, or missing key is discovered.
Use this template in Manifestly
- Cloud Migration Checklist
- Cloud Security Checklist
- User Access Review Checklist
- Data Recovery Checklist
- Database Backup Checklist
- Password Management Checklist
- Backup and Restore Checklist
- Network Upgrade Checklist
- Server Backup Checklist
- Business Continuity Plan Checklist
- Problem Management Checklist
- Server Decommissioning Checklist
- Cloud Monitoring Checklist
- Hardware Inventory Checklist
- IT Regulatory Compliance Review
- Release Management Checklist
- Server Maintenance Checklist
- Rollback Plan Checklist
- Customer Support Ticket Workflow
- Software Upgrade Checklist
- Quarterly Compliance Reporting Checklist
- Patch Management Checklist
- Hardware Maintenance Checklist
- Server Security Checklist
- IT Emergency Response Checklist
- Incident Management Checklist
- Disaster Recovery Plan Checklist
- User Role Management Checklist
- Software Installation Checklist
- Compliance Audit Checklist
- Access Control Checklist
- Cloud Cost Management Checklist
- IT Staff Performance Review
- Change Management Checklist
- Firewall Configuration Checklist
- Security Audit Checklist
- Quarterly Network Security Review
- Database Migration Checklist
- Employee Onboarding Checklist
- Capacity Planning Checklist
- IT Budgeting Checklist
- Network Monitoring Checklist
- Cloud Deployment Checklist
- Database Installation Checklist
- IT Service Request Checklist
- Database Security Checklist
- System Monitoring Checklist
- Hardware Troubleshooting Checklist
- IT Strategy Checklist
- Patch Deployment Checklist
- Hardware Upgrade Checklist
- Performance Tuning Checklist
- Application Performance Monitoring Checklist
- Employee Training Checklist
- User Onboarding Checklist
- IT Vendor Management Checklist
- Server Build and Hardening Checklist
- IT Policy Review Checklist
- Help Desk Ticket Handling Checklist
- Infrastructure as Code Checklist
- Hardware Disposal Checklist
- IT Resource Allocation Checklist
- Incident Response Checklist
- Network Troubleshooting Checklist
- User Offboarding Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
