Service Level Agreement (SLA) Checklist
Service Definition
List the named services, APIs, and customer-facing endpoints covered. Reference the Backstage service catalog or internal service registry. Out-of-scope items (sandbox, beta endpoints, third-party SaaS pass-throughs) should be called out explicitly — ambiguity here is the most common SLA dispute trigger.
Tier drives uptime targets, support response, and credit structure. Enterprise customers typically get higher SLOs (99.95%+) and named TAMs; Standard customers get the public SLA with shared on-call.
RACI-style split: who patches the runtime, who owns the customer's IAM config, who is responsible for client-side SDK upgrades. Customer-caused incidents (misconfigured webhooks, expired API tokens) should not consume the provider's error budget.
Performance Monitoring and Reporting
SLIs typically pull from the four golden signals — latency, traffic, errors, saturation. Specify the metric source (Datadog, Prometheus, New Relic), the aggregation window (rolling 30-day vs. calendar month), and how planned maintenance is excluded. Vague language like "reasonable uptime" is unenforceable.
Wire Statuspage or Better Stack to the relevant Datadog monitors so component status reflects reality without manual updates. Define posting cadence for incidents — initial post within 15 minutes of SEV1 detection, updates every 30 minutes until resolution.
Specify the report contents — SLO attainment, incident summary with root causes, change-management activity, any service credits owed. Delivery channel (PDF to designated contact, customer portal, shared dashboard) and delivery deadline (e.g., by the 10th of the following month).
Service Management and Escalation
Identify the named service owner (typically the engineering manager) and the PagerDuty rotation backing first response. Avoid single points of failure — every primary needs a documented secondary so vacations don't break the response chain.
Concrete examples per severity, not just "high impact." SEV1 = customer-facing outage or data loss, page within 5 minutes, IC assigned within 15. SEV2 = degraded performance, business-hours response, 1-hour acknowledgement. SEV3 = single-customer issues routed via support tier-2.
Specify customer-notification windows for breaking changes (typically 90 days for API deprecations), how planned maintenance is announced, and the freeze windows during which the customer can request no deploys (peak retail, fiscal close). Reference SOC 2 CC8.1 if the customer is audit-driven.
Security and Compliance
Determines which compliance addenda the SLA needs. Even "None" should be confirmed in writing — customers sometimes assume a BAA is in place when one was never executed.
Encryption at rest (KMS, customer-managed keys if Enterprise tier) and in transit (TLS 1.2+). Access controls for production data — break-glass only, with audit trail to CloudTrail or equivalent. Retention period for audit logs (typically 1 year minimum, 7 for SOX-relevant).
Attach the BAA (HIPAA), DPA with SCCs (GDPR), or PCI responsibility matrix as appropriate. Breach notification timelines vary: GDPR is 72 hours to the supervisory authority, HIPAA is 60 days to affected individuals, customer contracts often require notice within 24 hours of confirmation.
Pricing and Service Credits
Document the metered units (API calls, seats, GB stored), the included quota at each tier, and overage pricing. Specify how spikes are handled — hard cutoff with 429s, soft cap with overage billing, or burst allowance with monthly true-up.
Tiered credits are standard: e.g., 10% credit for 99.0–99.9% monthly uptime, 25% for 95.0–99.0%, 50% below 95%. Specify the claim mechanism (customer must request within 30 days), the cap (typically one month's fees), and that credits are sole and exclusive remedy.
Net-30 is the default; Net-60 for enterprise procurement is common. Specify accepted payment methods (ACH, wire, credit card with surcharge), late fee terms, and the price-revision notice window (typically 60 days before renewal).
Service Level Objectives and Error Budgets
Pick the SLO honestly based on historical data, not aspirationally. 99.9% allows ~43 min downtime/month; 99.95% allows ~22 min; 99.99% allows ~4 min and requires multi-region active-active. Don't promise four nines on a single-region deployment.
Specify per-endpoint or per-endpoint-class targets — read APIs typically 200ms p95 / 500ms p99, write APIs 500ms / 1000ms. Include the measurement boundary (server-side, excluding network from client) so disputes don't hinge on client-side variance.
What happens when budget is exhausted: feature freeze until burn rate recovers, mandatory reliability work in next sprint, exec escalation. Configure burn-rate alerts (Datadog SLO monitors, Sloth) at 2% in 1hr and 5% in 6hr — the standard fast-burn / slow-burn pair.
Dispute Resolution
Tier 1: account manager + customer contact (5 business days). Tier 2: VP-level both sides (10 business days). Tier 3: formal mediation. Most disputes resolve at Tier 1 if the path is documented; ambiguity is what pushes things to legal.
Governing law and venue (e.g., Delaware, AAA arbitration in San Francisco). For international customers, specify ICC arbitration to avoid jurisdictional fights. Have legal review — boilerplate from a US contract may be unenforceable in EU.
Acknowledgement within 5 business days, initial response within 15, resolution targeted within 60. Without timelines, disputes drift indefinitely while goodwill erodes.
Termination and Exit
Termination for convenience (typically 60–90 days notice), termination for cause (material breach uncured after 30 days), and termination for repeated SLO failure (e.g., three consecutive months below target). Specify whether termination triggers a refund of prepaid fees.
Specify the export format (JSON dump, Parquet, CSV per schema), the delivery mechanism (signed S3 URL, customer-supplied bucket), and the support hours included for migration assistance. Customers without an exit plan are locked in by default — call this out as a feature.
Destruction timeline (typically 30–90 days after termination) covering primary stores, replicas, backups, and any analytics warehouse copies. Provide a written certificate of destruction signed by the security lead. GDPR Article 28 requires this explicitly for processors; align language to satisfy the customer's DPA.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
