Cloud Monitoring Checklist

Infrastructure Monitoring

    Reconcile the asset inventory against agents reporting in the last 24 hours. Common drift sources: new EC2 instances launched outside the golden AMI, autoscaling groups missing the agent in user-data, and RDS / managed services that need agentless integrations rather than host agents.

    Review last month's alert-to-incident ratio per host class. Static 80% CPU thresholds on burstable t-series instances are a classic noisy-alert source; switch to anomaly-based or P95-over-window alerts for those workloads.

    Check VPN and SD-WAN tunnels, transit gateway flows, and inter-region replication links. Compare latency P50/P95 against last month's baseline; flag links that drift more than 20%.

    Pull the synthetic / Pingdom / StatusCake results for each Tier 1 service. If burn-rate against the monthly error budget exceeds 50%, schedule a reliability review with the service owner before month-end.

    Fire a synthetic test alert into each routing key and confirm it pages the correct on-call. The most common silent failure is a service that was renamed, leaving the integration key orphaned with no escalation policy attached.

    For each failed routing test, rebind the integration key to the live escalation policy and re-test. File a ticket back to the service owner so the runbook reflects the current routing.

Application Performance

    Pull New Relic or Datadog APM trends for the top 10 transactions by traffic. Watch for slow regressions — a 50ms creep over four weeks rarely pages but degrades user experience and is harder to diagnose later.

    Group by service and frequency. Anything firing more than 100 times per day with no assigned owner gets escalated to the service team. Auto-resolved errors with recurring spikes are usually a release-regression signal worth investigating.

    Review the service map for each Tier 1 app: third-party APIs, internal microservices, databases, caches. Document any dependency added or removed since last review so on-call runbooks reflect the current topology.

    Pull LCP, INP, and CLS from real-user monitoring for the past 28 days. Flag pages drifting toward the 'Needs Improvement' band; SEO impact lags behind the metric drift by weeks.

    Review RDS Performance Insights or query store for top-N slow queries. File optimization tickets for any query whose total time-spent doubled month-over-month — often a missing index after a schema change.

Security and Compliance

    Look at impossible-travel, MFA-fatigue patterns, and legacy basic-auth attempts in Sentinel or Splunk. The classic miss: MFA enabled but legacy IMAP/POP/SMTP not blocked at the conditional access layer.

    Walk every CRITICAL and HIGH from the SIEM since last review. For each: confirm the attacker source, blast radius, and whether any account or token requires rotation.

    Open the IR ticket, page the security on-call, isolate the affected identity or host, and start the chain-of-custody log. Notify the vCISO or DPO if PHI, PCI, or EU personal data may be in scope — breach-notification clocks start now, not after triage finishes.

    Cross-check Tenable / Qualys results against the current CISA KEV catalog. Anything on KEV with no compensating control gets a documented exception or a hotfix scheduled before next month's review.

    Confirm Vanta, Drata, or Secureframe is pulling current evidence for monitoring and alerting controls (CC7.x). Stale connectors are the most common audit finding — auditors notice gaps, not green dashboards.

    Walk through every IAM policy change, root-account use, and KMS key change for the period. Each privileged action should map to a known change ticket; unknown actions get an investigation ticket opened with the actor's manager.

Cost Management

    Pull the AWS Cost Anomaly Detection or Azure Cost Management view. Investigate every anomaly above $500 — runaway data egress, an autoscaler stuck open, or a NAT gateway in a chatty VPC are the usual suspects.

    Confirm the 80%, 100%, and 120%-of-budget alarms route to finance and the engineering owner, not just IT. Adjust forecasts for known seasonal spikes so the 100% alarm is meaningful.

    Pull rightsizing recommendations from AWS Compute Optimizer or Azure Advisor. Anything under 20% average CPU and 40% memory for 14+ days is a rightsizing or termination candidate.

    For larger savings opportunities, file a CR with the service owner and CAB. Include the workload's prod / non-prod tier, the proposed instance family change, and the rollback plan in case throughput drops after resize.

    Compare the closed month against the FinOps forecast at the account / subscription level. Variance over 10% gets a written note in the QBR deck — finance hates surprises more than overruns.

Backup and Recovery

    Walk every protected workload's backup job log for the past 30 days. A green dashboard hides skipped objects and zero-byte snapshots; open the job detail view and confirm bytes-written matches expected volume size.

    Verify S3 Object Lock or Azure Immutable Blob Storage is on for at least one copy of each Tier 1 dataset. If backup storage can be deleted from the same credentials that touch production, it isn't ransomware-resilient — the 3-2-1 with one immutable rule is non-negotiable.

    Pick one Tier 1 dataset and restore into a sandboxed VPC. Validate row counts, application boot, and credential availability. The classic failure mode: backup script depends on a KMS key or service-account password that rotated since the last drill.

    Open a P1 with backup engineering, capture the failure mode (credential, format, network, or capacity), and schedule a re-drill within 7 days. A failed restore is a real outage waiting to happen — treat it as a SEV-2 incident, not a paperwork item.

    Reconfirm the documented RPO/RTO with each service owner. New services launched this period need targets assigned; deprecated services come off the protection list so backup spend stays aligned with what actually matters.