Cloud Monitoring Checklist

Monthly review that the sysadmin or cloud operations lead runs to verify infrastructure, application, security, cost, and backup monitoring are all working as designed across AWS, Azure, and SaaS estates.

Use this workflow

Infrastructure Monitoring

Audit CloudWatch and Datadog agent coverage
- Reconcile the asset inventory against agents reporting in the last 24 hours. Common drift sources: new EC2 instances launched outside the golden AMI, autoscaling groups missing the agent in user-data, and RDS / managed services that need agentless integrations rather than host agents.
Tune CPU, memory, and disk thresholds
- Review last month's alert-to-incident ratio per host class. Static 80% CPU thresholds on burstable t-series instances are a classic noisy-alert source; switch to anomaly-based or P95-over-window alerts for those workloads.
Review network latency and packet loss
- Check VPN and SD-WAN tunnels, transit gateway flows, and inter-region replication links. Compare latency P50/P95 against last month's baseline; flag links that drift more than 20%.
Verify uptime SLOs across critical services
- Pull the synthetic / Pingdom / StatusCake results for each Tier 1 service. If burn-rate against the monthly error budget exceeds 50%, schedule a reliability review with the service owner before month-end.
Test alert routing to PagerDuty
- Fire a synthetic test alert into each routing key and confirm it pages the correct on-call. The most common silent failure is a service that was renamed, leaving the integration key orphaned with no escalation policy attached.
Collects list
Repair broken alert routing keys
- For each failed routing test, rebind the integration key to the live escalation policy and re-test. File a ticket back to the service owner so the runbook reflects the current routing.

Application Performance

Review APM response times and throughput
- Pull New Relic or Datadog APM trends for the top 10 transactions by traffic. Watch for slow regressions — a 50ms creep over four weeks rarely pages but degrades user experience and is harder to diagnose later.
Triage open Sentry error groups
- Group by service and frequency. Anything firing more than 100 times per day with no assigned owner gets escalated to the service team. Auto-resolved errors with recurring spikes are usually a release-regression signal worth investigating.
Map application dependency health
- Review the service map for each Tier 1 app: third-party APIs, internal microservices, databases, caches. Document any dependency added or removed since last review so on-call runbooks reflect the current topology.
Audit Core Web Vitals on customer-facing apps
- Pull LCP, INP, and CLS from real-user monitoring for the past 28 days. Flag pages drifting toward the 'Needs Improvement' band; SEO impact lags behind the metric drift by weeks.
Flag slow database queries for optimization
- Review RDS Performance Insights or query store for top-N slow queries. File optimization tickets for any query whose total time-spent doubled month-over-month — often a missing index after a schema change.

Security and Compliance

Review SIEM alerts for anomalous logins
- Look at impossible-travel, MFA-fatigue patterns, and legacy basic-auth attempts in Sentinel or Splunk. The classic miss: MFA enabled but legacy IMAP/POP/SMTP not blocked at the conditional access layer.
Investigate unauthorized access attempts
- Walk every CRITICAL and HIGH from the SIEM since last review. For each: confirm the attacker source, blast radius, and whether any account or token requires rotation.
Collects list Collects paragraph
Activate the incident response playbook
- Open the IR ticket, page the security on-call, isolate the affected identity or host, and start the chain-of-custody log. Notify the vCISO or DPO if PHI, PCI, or EU personal data may be in scope — breach-notification clocks start now, not after triage finishes.
Reconcile patching against CVE advisories
- Cross-check Tenable / Qualys results against the current CISA KEV catalog. Anything on KEV with no compensating control gets a documented exception or a hotfix scheduled before next month's review.
Verify SOC 2 evidence collection is current
- Confirm Vanta, Drata, or Secureframe is pulling current evidence for monitoring and alerting controls (CC7.x). Stale connectors are the most common audit finding — auditors notice gaps, not green dashboards.
Review CloudTrail logs for privileged actions
- Walk through every IAM policy change, root-account use, and KMS key change for the period. Each privileged action should map to a known change ticket; unknown actions get an investigation ticket opened with the actor's manager.

Cost Management

Review FinOps dashboard for cost anomalies
- Pull the AWS Cost Anomaly Detection or Azure Cost Management view. Investigate every anomaly above $500 — runaway data egress, an autoscaler stuck open, or a NAT gateway in a chatty VPC are the usual suspects.
Validate budget alert thresholds
- Confirm the 80%, 100%, and 120%-of-budget alarms route to finance and the engineering owner, not just IT. Adjust forecasts for known seasonal spikes so the 100% alarm is meaningful.
Identify underutilized EC2 and RDS instances
- Pull rightsizing recommendations from AWS Compute Optimizer or Azure Advisor. Anything under 20% average CPU and 40% memory for 14+ days is a rightsizing or termination candidate.
Collects list
Schedule a rightsizing change request
- For larger savings opportunities, file a CR with the service owner and CAB. Include the workload's prod / non-prod tier, the proposed instance family change, and the rollback plan in case throughput drops after resize.
Reconcile monthly billing against forecast
- Compare the closed month against the FinOps forecast at the account / subscription level. Variance over 10% gets a written note in the QBR deck — finance hates surprises more than overruns.

Backup and Recovery

Verify backup job success in Veeam and AWS Backup
- Walk every protected workload's backup job log for the past 30 days. A green dashboard hides skipped objects and zero-byte snapshots; open the job detail view and confirm bytes-written matches expected volume size.
Confirm immutable and air-gapped copies exist
- Verify S3 Object Lock or Azure Immutable Blob Storage is on for at least one copy of each Tier 1 dataset. If backup storage can be deleted from the same credentials that touch production, it isn't ransomware-resilient — the 3-2-1 with one immutable rule is non-negotiable.
Run a restore drill into an isolated VPC
- Pick one Tier 1 dataset and restore into a sandboxed VPC. Validate row counts, application boot, and credential availability. The classic failure mode: backup script depends on a KMS key or service-account password that rotated since the last drill.
Collects list Collects number Collects file
File a remediation ticket for the failed restore
- Open a P1 with backup engineering, capture the failure mode (credential, format, network, or capacity), and schedule a re-drill within 7 days. A failed restore is a real outage waiting to happen — treat it as a SEV-2 incident, not a paperwork item.
Update RPO and RTO targets per service tier
- Reconfirm the documented RPO/RTO with each service owner. New services launched this period need targets assigned; deprecated services come off the protection list so backup spend stays aligned with what actually matters.