Containerization Rollout Checklist

Runtime and Registry Setup

    Verify kernel version (4.x+ for cgroups v2), confirm overlay2 storage driver is available, and check that SELinux/AppArmor profiles are compatible. RHEL 7 and Ubuntu 18.04 hosts are common gotchas — they ship with cgroups v1 and need explicit boot-parameter changes.

    Install via the vendor repo (Docker CE, containerd from Kubernetes apt/yum repo) — never the distro default, which lags by 1-2 minor versions. Pin the version explicitly so unattended-upgrades don't bump the runtime mid-week.

    Apply CIS Docker Benchmark settings: disable legacy registry v1, enable user namespace remapping, set live-restore, configure log rotation on the json-file driver (default is unbounded — fills /var fast on busy hosts).

    Configure a registry — Harbor self-hosted, ECR, ACR, GAR, or Artifactory. Set up pull-through cache for Docker Hub to dodge rate-limit outages. Store registry credentials in the orchestrator's secret store, not in /root/.docker/config.json.

    Pull and run a known-good image (alpine, hello-world, or your team's canary) to confirm DNS resolution, registry auth, and outbound network egress all work end-to-end before any real workloads land.

Image Security and Compliance

    Run Trivy, Grype, or Snyk against the base images you plan to allow. Block any image with unpatched critical CVEs from the registry. Distroless and Chainguard images dramatically reduce the attack surface vs. ubuntu:latest.

    If accepting a critical CVE, file the exception with named owner, compensating control, and review date. Exceptions without expiry become permanent — set a 90-day re-review on the calendar and tag the exception in the registry.

    Sign images with Cosign or Notation (Notary v2). Configure the orchestrator's admission controller (Kyverno, OPA Gatekeeper, or Connaisseur) to reject unsigned images. Don't ship the signing key on builder nodes — use a KMS-backed key.

    Map roles to AD/Entra ID groups via SSO — never local kubeconfig accounts for humans. Grant cluster-admin only via break-glass account stored in PAM. Developers get edit on their namespace, not view at cluster scope.

    Default-deny ingress and egress per namespace, then allow only required paths. Calico, Cilium, or the built-in NetworkPolicy work; without them, a flat pod network lets a compromised container reach every other workload.

    Ship container stdout/stderr and Kubernetes audit logs to Splunk, Sentinel, or Elastic via Fluent Bit or Vector. Audit logs are required for SOC 2 CC7.2 and the only way to forensically reconstruct a kubectl exec incident.

Resource Governance

    Every pod gets requests + limits. No-limit pods can OOM the node; identical-request-and-limit pods land in the Guaranteed QoS class and survive node pressure. Use VPA recommendations as a starting point, not as truth.

    Set HPA on CPU or custom metrics via the metrics-server or Prometheus Adapter. Pair with cluster-autoscaler or Karpenter so pod scale-up actually gets nodes. Tune scale-down stabilization window — aggressive defaults cause flapping under bursty load.

    ResourceQuota caps total CPU/memory/storage per namespace; LimitRange sets per-pod defaults. Without these, one team's runaway CronJob can starve the cluster. Quota a dev namespace tighter than prod to encourage right-sizing.

    Deploy kube-prometheus-stack with the standard kubernetes-mixin dashboards. Alert on node memory pressure, persistent OOMKilled events, and throttling > 25%. Page on cluster-level signals; ticket on namespace-level signals.

    Multi-stage builds drop compilers and build artifacts from the runtime image — a Go service goes from 800MB to 20MB. Smaller images mean faster pulls, faster pod startup, and a smaller CVE attack surface.

Persistent Storage and Backup

    Configure a CSI driver — EBS, Azure Disk, GCP PD, Longhorn, or Rook-Ceph. Define StorageClasses with reclaimPolicy=Retain for prod and Delete for dev. WaitForFirstConsumer binding mode prevents zone-mismatch errors on multi-AZ clusters.

    Velero or Kasten K10 for cluster-state + PV snapshots, replicated to an object-locked S3 bucket in a separate AWS account. Backup that's writable from the production cluster is not ransomware-resilient — that's the lesson from every 2023-2024 K8s ransomware case.

    If immutable offsite copy is not in place, do not promote the cluster to production. Stand up a Velero target with S3 Object Lock in compliance mode, or contract MSP360 / Kasten / Veeam Kasten as a managed alternative. Re-run the prior verification before proceeding.

    Pull secrets from Vault, AWS Secrets Manager, or Azure Key Vault via the Secrets Store CSI driver. Plain Kubernetes Secrets are base64, not encrypted — anyone with namespace get-secret can read them. Never bake secrets into the image.

    Enable EncryptionConfiguration on the API server (KMS provider via Vault or cloud KMS) so etcd-stored secrets are encrypted. Enable volume-level encryption on the underlying storage. Required for HIPAA, PCI DSS, and most SOC 2 controls.

    Restore a representative PV and namespace into an isolated test cluster. Time the restore against your stated RTO. Backup that's never restored is a backup that doesn't work — and the first restore attempt is always where the credential rotation, format change, or missing key is discovered.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack