Database Performance Tuning Checklist

Baseline and Workload Profiling

    Pull at least 24 hours of metrics from your monitoring tool (Datadog DBM, SolarWinds DPA, pganalyze, Percona PMM, or Query Store on SQL Server). Record CPU, buffer cache hit ratio, page life expectancy, wait stats, and average query latency at p50/p95/p99 — these are what you compare against after the tuning pass.

    Use pg_stat_statements (Postgres), Query Store (SQL Server), or Performance Schema (MySQL) to rank queries by total time, not just per-execution time — a 5ms query running 50,000 times an hour matters more than a 30s nightly report.

    Classify the dominant wait type — CPU, IO (PAGEIOLATCH, IO:DataFileRead), lock (LCK_M_X), or memory pressure. The bottleneck class determines whether tuning effort goes to indexes, query rewrites, or hardware.

Index Optimization

    EXPLAIN ANALYZE (Postgres / MySQL) or actual execution plan (SQL Server) on the top offenders from baseline. Watch for sequential scans on large tables, key lookups, and high estimated-vs-actual row count gaps that signal stale statistics rather than a missing index.

    Check pg_stat_user_indexes (idx_scan = 0) or sys.dm_db_index_usage_stats. Every unused index costs write throughput and storage. Duplicate-leading-column indexes are also a common find — consolidate before adding new ones.

    For multi-column WHERE / ORDER BY patterns, evaluate composite indexes with the most-selective leading column. A covering index (INCLUDE columns in SQL Server, or index-only scan in Postgres) eliminates lookup cost on hot read paths.

    Reorganize at 5-30% fragmentation; rebuild above 30%. Use ONLINE = ON (SQL Server Enterprise) or pg_repack (Postgres) to avoid table-level locks. Always run during the change window.

Query Performance

    Replace SELECT * with explicit column lists, eliminate correlated subqueries that can be JOINs, push predicates into the most-selective branch, and avoid functions on indexed columns (WHERE UPPER(email) = ... defeats the index).

    Ad-hoc literal queries blow up the plan cache and prevent reuse. Confirm the application is using prepared statements or parameterized queries; check sys.dm_exec_cached_plans (SQL Server) or pg_stat_statements normalization for ad-hoc patterns.

    A query that runs 10x faster on a 1-million-row dev dataset can run 5x slower on the 800-million-row production table. Run rewrites against a recent restore or read replica before scheduling the change.

    Decide whether the win belongs in Redis / Memcached at the app tier, in materialized views, or in the database buffer pool. Hot read-mostly lookups belong in Redis with a defined TTL and invalidation path.

Resource and Configuration Tuning

    Set shared_buffers (Postgres, ~25% of RAM), innodb_buffer_pool_size (MySQL, ~70-80%), or max server memory (SQL Server, leaving 4-8 GB for OS). Validate cache hit ratio improves before declaring done.

    Run ANALYZE / UPDATE STATISTICS WITH FULLSCAN on tables that have churned heavily. Stale stats are the most common cause of sudden plan regressions and a faster fix than any index change.

    Right-size PgBouncer, ProxySQL, or HikariCP. A connection storm — 2,000 app connections hitting Postgres directly — is the silent killer behind many CPU-bound databases. Pool size should be near (cores × 2) + spindles, not the application's whim.

    For tables exceeding 100M rows or with clear time-based access patterns, evaluate range partitioning. Partition pruning turns scans of a year of data into scans of a month — but only if every hot query's predicate aligns with the partition key.

    Add or refresh thresholds on buffer cache hit ratio, deadlocks/sec, long-running queries (>30s), replication lag, and tempdb / WAL growth. Route to PagerDuty or Opsgenie with severity tied to whether it's degrading users.

Change Rollout and Validation

    Document the proposed changes, blast radius, rollback plan, and expected gain. Schedule against the maintenance window; index rebuilds and partitioning are not standard changes and require explicit CAB approval.

    Apply changes in the approved order with checkpoints between each. Stop and roll back if any post-change probe regresses against baseline by more than the agreed threshold (typically 10%).

    Pull the same 24-hour window of metrics captured in baseline. Confirm p95 latency, top-query total time, and wait-stat distribution all moved the right direction. Regressions go on a follow-up ticket; do not silently accept them.

    Execute the documented rollback: drop new indexes, restore prior config values, revert query changes via deploy. File an incident review note describing what regressed and why the pre-change validation missed it.

Backup and Recovery Verification

    Index rebuilds and partitioning can grow the WAL / transaction log dramatically. Verify backup windows still complete on time and that log shipping or replication has caught up before declaring the tuning pass complete.

    Restore the most recent backup to an isolated environment and validate row counts on the changed tables. A green backup job is not a working backup — only a successful restore proves it.

    Record what was changed, why, the measured before/after, and any follow-up items in the team's runbook (Hudu, IT Glue, Confluence). The next DBA on this database should be able to read the trail without asking you.