Cloud Monitoring Checklist
Monthly review that the sysadmin or cloud operations lead runs to verify infrastructure, application, security, cost, and backup monitoring are all working as designed across AWS, Azure, and SaaS estates.
Infrastructure Monitoring
-
Audit CloudWatch and Datadog agent coverage
Reconcile the asset inventory against agents reporting in the last 24 hours. Common drift sources: new EC2 instances launched outside the golden AMI, autoscaling groups missing the agent in user-data, and RDS / managed services that need agentless integrations rather than host agents.
-
Tune CPU, memory, and disk thresholds
Review last month's alert-to-incident ratio per host class. Static 80% CPU thresholds on burstable t-series instances are a classic noisy-alert source; switch to anomaly-based or P95-over-window alerts for those workloads.
-
Review network latency and packet loss
Check VPN and SD-WAN tunnels, transit gateway flows, and inter-region replication links. Compare latency P50/P95 against last month's baseline; flag links that drift more than 20%.
-
Verify uptime SLOs across critical services
Pull the synthetic / Pingdom / StatusCake results for each Tier 1 service. If burn-rate against the monthly error budget exceeds 50%, schedule a reliability review with the service owner before month-end.
-
Test alert routing to PagerDuty
Fire a synthetic test alert into each routing key and confirm it pages the correct on-call. The most common silent failure is a service that was renamed, leaving the integration key orphaned with no escalation policy attached.
Collects list -
Repair broken alert routing keys
For each failed routing test, rebind the integration key to the live escalation policy and re-test. File a ticket back to the service owner so the runbook reflects the current routing.
Application Performance
-
Review APM response times and throughput
Pull New Relic or Datadog APM trends for the top 10 transactions by traffic. Watch for slow regressions — a 50ms creep over four weeks rarely pages but degrades user experience and is harder to diagnose later.
-
Triage open Sentry error groups
Group by service and frequency. Anything firing more than 100 times per day with no assigned owner gets escalated to the service team. Auto-resolved errors with recurring spikes are usually a release-regression signal worth investigating.
-
Map application dependency health
Review the service map for each Tier 1 app: third-party APIs, internal microservices, databases, caches. Document any dependency added or removed since last review so on-call runbooks reflect the current topology.
-
Audit Core Web Vitals on customer-facing apps
Pull LCP, INP, and CLS from real-user monitoring for the past 28 days. Flag pages drifting toward the 'Needs Improvement' band; SEO impact lags behind the metric drift by weeks.
-
Flag slow database queries for optimization
Review RDS Performance Insights or query store for top-N slow queries. File optimization tickets for any query whose total time-spent doubled month-over-month — often a missing index after a schema change.
Security and Compliance
-
Review SIEM alerts for anomalous logins
Look at impossible-travel, MFA-fatigue patterns, and legacy basic-auth attempts in Sentinel or Splunk. The classic miss: MFA enabled but legacy IMAP/POP/SMTP not blocked at the conditional access layer.
-
Investigate unauthorized access attempts
Walk every CRITICAL and HIGH from the SIEM since last review. For each: confirm the attacker source, blast radius, and whether any account or token requires rotation.
Collects list Collects paragraph -
Activate the incident response playbook
Open the IR ticket, page the security on-call, isolate the affected identity or host, and start the chain-of-custody log. Notify the vCISO or DPO if PHI, PCI, or EU personal data may be in scope — breach-notification clocks start now, not after triage finishes.
-
Reconcile patching against CVE advisories
Cross-check Tenable / Qualys results against the current CISA KEV catalog. Anything on KEV with no compensating control gets a documented exception or a hotfix scheduled before next month's review.
-
Verify SOC 2 evidence collection is current
Confirm Vanta, Drata, or Secureframe is pulling current evidence for monitoring and alerting controls (CC7.x). Stale connectors are the most common audit finding — auditors notice gaps, not green dashboards.
-
Review CloudTrail logs for privileged actions
Walk through every IAM policy change, root-account use, and KMS key change for the period. Each privileged action should map to a known change ticket; unknown actions get an investigation ticket opened with the actor's manager.
Cost Management
-
Review FinOps dashboard for cost anomalies
Pull the AWS Cost Anomaly Detection or Azure Cost Management view. Investigate every anomaly above $500 — runaway data egress, an autoscaler stuck open, or a NAT gateway in a chatty VPC are the usual suspects.
-
Validate budget alert thresholds
Confirm the 80%, 100%, and 120%-of-budget alarms route to finance and the engineering owner, not just IT. Adjust forecasts for known seasonal spikes so the 100% alarm is meaningful.
-
Identify underutilized EC2 and RDS instances
Pull rightsizing recommendations from AWS Compute Optimizer or Azure Advisor. Anything under 20% average CPU and 40% memory for 14+ days is a rightsizing or termination candidate.
Collects list -
Schedule a rightsizing change request
For larger savings opportunities, file a CR with the service owner and CAB. Include the workload's prod / non-prod tier, the proposed instance family change, and the rollback plan in case throughput drops after resize.
-
Reconcile monthly billing against forecast
Compare the closed month against the FinOps forecast at the account / subscription level. Variance over 10% gets a written note in the QBR deck — finance hates surprises more than overruns.
Backup and Recovery
-
Verify backup job success in Veeam and AWS Backup
Walk every protected workload's backup job log for the past 30 days. A green dashboard hides skipped objects and zero-byte snapshots; open the job detail view and confirm bytes-written matches expected volume size.
-
Confirm immutable and air-gapped copies exist
Verify S3 Object Lock or Azure Immutable Blob Storage is on for at least one copy of each Tier 1 dataset. If backup storage can be deleted from the same credentials that touch production, it isn't ransomware-resilient — the 3-2-1 with one immutable rule is non-negotiable.
-
Run a restore drill into an isolated VPC
Pick one Tier 1 dataset and restore into a sandboxed VPC. Validate row counts, application boot, and credential availability. The classic failure mode: backup script depends on a KMS key or service-account password that rotated since the last drill.
Collects list Collects number Collects file -
File a remediation ticket for the failed restore
Open a P1 with backup engineering, capture the failure mode (credential, format, network, or capacity), and schedule a re-drill within 7 days. A failed restore is a real outage waiting to happen — treat it as a SEV-2 incident, not a paperwork item.
-
Update RPO and RTO targets per service tier
Reconfirm the documented RPO/RTO with each service owner. New services launched this period need targets assigned; deprecated services come off the protection list so backup spend stays aligned with what actually matters.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Cloud Monitoring Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.