Service Level Agreement (SLA) Checklist
Steps an engineering or platform lead runs to draft, negotiate, and sign off on a customer-facing SLA — covering service scope, SLOs, error budgets, security controls, service credits, and exit terms.
Service Definition
-
Inventory the services in scope
List the named services, APIs, and customer-facing endpoints covered. Reference the Backstage service catalog or internal service registry. Out-of-scope items (sandbox, beta endpoints, third-party SaaS pass-throughs) should be called out explicitly — ambiguity here is the most common SLA dispute trigger.
-
Classify the service tier and customer segment
Tier drives uptime targets, support response, and credit structure. Enterprise customers typically get higher SLOs (99.95%+) and named TAMs; Standard customers get the public SLA with shared on-call.
Collects list -
Document provider and customer responsibilities
RACI-style split: who patches the runtime, who owns the customer's IAM config, who is responsible for client-side SDK upgrades. Customer-caused incidents (misconfigured webhooks, expired API tokens) should not consume the provider's error budget.
Performance Monitoring and Reporting
-
Define the SLIs and measurement windows
SLIs typically pull from the four golden signals — latency, traffic, errors, saturation. Specify the metric source (Datadog, Prometheus, New Relic), the aggregation window (rolling 30-day vs. calendar month), and how planned maintenance is excluded. Vague language like "reasonable uptime" is unenforceable.
-
Configure the customer status page
Wire Statuspage or Better Stack to the relevant Datadog monitors so component status reflects reality without manual updates. Define posting cadence for incidents — initial post within 15 minutes of SEV1 detection, updates every 30 minutes until resolution.
-
Set the monthly performance reporting cadence
Specify the report contents — SLO attainment, incident summary with root causes, change-management activity, any service credits owed. Delivery channel (PDF to designated contact, customer portal, shared dashboard) and delivery deadline (e.g., by the 10th of the following month).
Service Management and Escalation
-
Name the service owner and on-call rotation
Identify the named service owner (typically the engineering manager) and the PagerDuty rotation backing first response. Avoid single points of failure — every primary needs a documented secondary so vacations don't break the response chain.
-
Define the SEV1, SEV2, and SEV3 escalation matrix
Concrete examples per severity, not just "high impact." SEV1 = customer-facing outage or data loss, page within 5 minutes, IC assigned within 15. SEV2 = degraded performance, business-hours response, 1-hour acknowledgement. SEV3 = single-customer issues routed via support tier-2.
-
Document the change-control process
Specify customer-notification windows for breaking changes (typically 90 days for API deprecations), how planned maintenance is announced, and the freeze windows during which the customer can request no deploys (peak retail, fiscal close). Reference SOC 2 CC8.1 if the customer is audit-driven.
Security and Compliance
-
Identify regulated data categories handled
Determines which compliance addenda the SLA needs. Even "None" should be confirmed in writing — customers sometimes assume a BAA is in place when one was never executed.
Collects list -
Document encryption, access, and audit controls
Encryption at rest (KMS, customer-managed keys if Enterprise tier) and in transit (TLS 1.2+). Access controls for production data — break-glass only, with audit trail to CloudTrail or equivalent. Retention period for audit logs (typically 1 year minimum, 7 for SOX-relevant).
-
Attach the regulatory addendum and breach notification terms
Attach the BAA (HIPAA), DPA with SCCs (GDPR), or PCI responsibility matrix as appropriate. Breach notification timelines vary: GDPR is 72 hours to the supervisory authority, HIPAA is 60 days to affected individuals, customer contracts often require notice within 24 hours of confirmation.
Pricing and Service Credits
-
Define the pricing tiers and overage rates
Document the metered units (API calls, seats, GB stored), the included quota at each tier, and overage pricing. Specify how spikes are handled — hard cutoff with 429s, soft cap with overage billing, or burst allowance with monthly true-up.
-
Specify service credit calculation for SLO breaches
Tiered credits are standard: e.g., 10% credit for 99.0–99.9% monthly uptime, 25% for 95.0–99.0%, 50% below 95%. Specify the claim mechanism (customer must request within 30 days), the cap (typically one month's fees), and that credits are sole and exclusive remedy.
Collects number Collects paragraph -
Set invoicing and payment terms
Net-30 is the default; Net-60 for enterprise procurement is common. Specify accepted payment methods (ACH, wire, credit card with surcharge), late fee terms, and the price-revision notice window (typically 60 days before renewal).
Service Level Objectives and Error Budgets
-
Set the uptime SLO target
Pick the SLO honestly based on historical data, not aspirationally. 99.9% allows ~43 min downtime/month; 99.95% allows ~22 min; 99.99% allows ~4 min and requires multi-region active-active. Don't promise four nines on a single-region deployment.
Collects list -
Set latency targets for p95 and p99
Specify per-endpoint or per-endpoint-class targets — read APIs typically 200ms p95 / 500ms p99, write APIs 500ms / 1000ms. Include the measurement boundary (server-side, excluding network from client) so disputes don't hinge on client-side variance.
-
Define the error budget policy
What happens when budget is exhausted: feature freeze until burn rate recovers, mandatory reliability work in next sprint, exec escalation. Configure burn-rate alerts (Datadog SLO monitors, Sloth) at 2% in 1hr and 5% in 6hr — the standard fast-burn / slow-burn pair.
Dispute Resolution
-
Document the dispute escalation workflow
Tier 1: account manager + customer contact (5 business days). Tier 2: VP-level both sides (10 business days). Tier 3: formal mediation. Most disputes resolve at Tier 1 if the path is documented; ambiguity is what pushes things to legal.
-
Specify mediation and arbitration venue
Governing law and venue (e.g., Delaware, AAA arbitration in San Francisco). For international customers, specify ICC arbitration to avoid jurisdictional fights. Have legal review — boilerplate from a US contract may be unenforceable in EU.
-
Set response and resolution timelines
Acknowledgement within 5 business days, initial response within 15, resolution targeted within 60. Without timelines, disputes drift indefinitely while goodwill erodes.
Termination and Exit
-
Outline termination conditions and notice period
Termination for convenience (typically 60–90 days notice), termination for cause (material breach uncured after 30 days), and termination for repeated SLO failure (e.g., three consecutive months below target). Specify whether termination triggers a refund of prepaid fees.
-
Define the data export format and migration support
Specify the export format (JSON dump, Parquet, CSV per schema), the delivery mechanism (signed S3 URL, customer-supplied bucket), and the support hours included for migration assistance. Customers without an exit plan are locked in by default — call this out as a feature.
-
Confirm post-termination data destruction obligations
Destruction timeline (typically 30–90 days after termination) covering primary stores, replicas, backups, and any analytics warehouse copies. Provide a written certificate of destruction signed by the security lead. GDPR Article 28 requires this explicitly for processors; align language to satisfy the customer's DPA.
Collects number Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Service Level Agreement (SLA) Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.