Performance Optimization Checklist
A structured pass for engineering teams investigating production latency, throughput, or cost regressions — covering profiling, database tuning, infrastructure, frontend, network, and ongoing monitoring. Run quarterly or whenever SLO bur...
Baseline and Profiling
-
Capture current p50, p95, p99 latency
Pull latency from Datadog / New Relic / Honeycomb for the last 14 days, broken out by the top 5 endpoints by traffic. Record traffic volume and error rate alongside latency — optimizing p99 on a low-traffic endpoint is rarely worth the effort.
Collects list Collects file Collects paragraph -
Run a CPU and memory profile
Use pprof, py-spy, async-profiler, or the equivalent for your runtime. Capture a 60-second profile under representative load — not idle. Flame graphs from production traffic catch hot paths that synthetic benchmarks miss.
-
Identify the top three bottlenecks
Rank by share of total wall-clock time, not by intuition. Common surprises: JSON serialization, ORM N+1 queries, synchronous logging in hot paths, gzip on already-compressed payloads.
-
Define target SLOs for the optimization pass
Write the goal as a measurable SLI: "p95 checkout latency under 400ms over 7-day window." Vague targets like "make it faster" produce vague results. Get sign-off from the product manager so the team isn't optimizing in the dark.
Application Code Optimization
-
Eliminate N+1 queries on hot paths
Use the ORM's eager-loading (Rails includes, Django select_related/prefetch_related, SQLAlchemy joinedload) or write the join explicitly. Add bullet-train or Bullet (Rails) / nplusone (Python) to CI to fail builds that introduce regressions.
-
Replace synchronous calls with batched or async patterns
Loops that hit external APIs sequentially are a common culprit. Batch through DataLoader, asyncio.gather, or a background job (Sidekiq, Celery, BullMQ) when the work doesn't need to block the request.
-
Add memoization for expensive pure functions
Cache results of pure deterministic functions in-process (functools.lru_cache, lodash memoize) or in Redis for cross-process. Watch the cache key — including a mutable object as a key is a bug magnet.
-
Open a code-review PR with profiler-driven changes
Keep PRs under ~400 lines so reviewers actually read them. Include before/after profile snapshots in the PR description. Route through CODEOWNERS to engineers who own the touched modules.
Database Performance
-
Review slow query logs from the last 7 days
In Postgres enable pg_stat_statements; in MySQL enable the slow query log with long_query_time=0.5. Sort by total time (calls × mean), not single-call time — the query running 100k times at 50ms is more important than the one running once at 8s.
-
Run EXPLAIN ANALYZE on the worst offenders
Look for sequential scans on large tables, hash joins where a nested-loop with an index would be faster, and rows-removed-by-filter numbers that dwarf rows-returned. Paste plans into explain.depesz.com or pev2 for readability.
-
Add indexes concurrently in production
Use CREATE INDEX CONCURRENTLY in Postgres and ALGORITHM=INPLACE / pt-online-schema-change in MySQL. Plain CREATE INDEX takes an exclusive lock and will pause writes on a busy table for the duration.
Verify with EXPLAIN that the planner actually picks up the new index — composite index column order matters.
-
Tune connection pooling and statement timeouts
For Postgres, use PgBouncer in transaction-pooling mode in front of the application. Set a server-side statement_timeout so a runaway query can't tie up a connection forever. Right-size pool size to (cores × 2) + effective_io, not arbitrarily large.
-
Plan a backfill or denormalization if needed
If a query joins five tables on every request, a denormalized read model or materialized view may be the right fix. Backfill in batches with sleeps so replication lag stays under your alert threshold; do not run a single transaction over millions of rows.
Caching and Infrastructure
-
Add a Redis or Memcached layer for hot reads
Cache at the read-model boundary, not inside the ORM. Use cache-aside with a sensible TTL plus explicit invalidation on writes. The two hard problems are still cache invalidation and naming things — write down your key schema before shipping.
-
Configure CDN caching for static assets
Set long Cache-Control: max-age values for fingerprinted assets, short or no-cache for HTML. Verify with curl -I against CloudFront / Cloudflare / Fastly that you see X-Cache: Hit on second request.
-
Right-size autoscaling and HPA thresholds
Default HPA at 80% CPU is often too high for latency-sensitive services — saturation creates queueing well before CPU pegs. Consider scaling on p95 latency or request concurrency instead. Ensure the cluster has burst headroom for sudden scale-ups.
-
Decide whether to escalate to a load test
If the regression severity is moderate or severe, run a k6 or Locust load test in staging before shipping. For minor SLO-internal optimizations, canary in production is usually sufficient.
Collects list -
Run k6 load test against staging
Replay realistic traffic shapes from production logs — synthetic uniform load misses tail behaviors. Hold for at least 30 minutes at target RPS to surface memory leaks, GC pauses, and connection-pool exhaustion. Compare p95/p99 against the SLO defined earlier.
Collects list Collects file
Frontend and Network
-
Audit Core Web Vitals with Lighthouse
Run Lighthouse against production from a throttled mobile profile, not your fiber-connected laptop. Track LCP, INP, and CLS — Google's thresholds are 2.5s / 200ms / 0.1. Field data from CrUX is more honest than lab data.
-
Code-split and lazy-load heavy routes
Use dynamic import() in webpack/Vite/Next.js for non-critical routes and components. Defer below-the-fold images with native loading="lazy". Watch the bundle analyzer — a single 2MB lodash import in the main chunk is a common find.
-
Enable HTTP/2 or HTTP/3 at the edge
Most CDNs support HTTP/3 (QUIC) with a flag flip. The win is largest on lossy mobile networks. Confirm with curl --http3 or Chrome DevTools Network panel that the protocol is actually negotiated.
-
Reduce TLS and DNS handshake time
Enable OCSP stapling and TLS 1.3 0-RTT where session resumption is safe. Use a low-latency DNS provider (Route53, NS1, Cloudflare) and short TTLs only where needed — over-aggressive TTLs hurt more than they help.
Validation and Ongoing Monitoring
-
Canary deploy at 5 percent traffic
Watch error rate and p99 on the canary fleet for at least 30 minutes before progressing. Have the rollback path tested and a kill-switch feature flag ready — performance regressions sometimes only show up under real production cardinality.
-
Compare post-deploy metrics against the baseline
Pull the same dashboard captured at baseline. The optimization is only "done" if the SLI defined in the kickoff actually moved — declaring victory off a green CI run is how regressions ship.
Collects list Collects file -
Add a perf regression test to CI
Pin the optimization in place with a benchmark gate in GitHub Actions / GitLab CI. Bundle-size budgets (size-limit), Lighthouse CI, or a k6 smoke test on every PR — pick the layer that matched the win.
-
Open a follow-up incident review
Schedule a blameless review focused on why the regression went undetected for so long — alert tuning, missing dashboard, or a coverage gap in synthetic checks. Track action items to closure in Jira / Linear with named owners.