Performance Tuning Checklist

Baseline and Change Scope

    Pull at least one full business day of metrics from PRTG, LogicMonitor, Datadog, or the RMM (CPU, memory, disk IOPS / latency, network throughput, app response time). Attach the export so you can compare against post-tuning numbers in validation. Tuning without a baseline is the most common reason engineers cannot prove improvement at QBR.

    The workload drives every downstream tuning choice — swappiness, I/O scheduler, NUMA pinning, buffer pool sizing. Pick the dominant role; mark the host "Mixed / general purpose" only when no role exceeds ~60% of utilization.

    Confirm the window with the application owner and any dependent client (MSP context: notify the account manager so the QBR doesn't surface a surprise). Avoid month-end close, payroll runs, and patch Tuesday collisions.

    File the RFC in ConnectWise PSA, ServiceNow, or Autotask with implementation steps, rollback plan, and CAB approver. Tuning that touches kernel parameters or DB memory settings is never a standard change — keep it normal-track so the rollback plan is reviewed.

Operating System Tuning

    Stay one minor release behind the bleeding edge unless a CVE forces otherwise. Confirm the kernel version is on the application vendor's support matrix (Oracle, SAP, and SQL Server all publish certified kernel ranges).

    Adjust net.core.somaxconn, net.ipv4.tcp_max_syn_backlog, vm.dirty_ratio, and vm.dirty_background_ratio to match the workload profile captured earlier. Persist changes in /etc/sysctl.d so they survive reboot — runtime-only tuning that disappears at next boot is a frequent post-tuning regression.

    Run systemctl list-unit-files --state=enabled and stop anything not required (cups, avahi, bluetooth on a server, sample web stacks). Document each disabled service in the change record so the next engineer knows it was deliberate.

    Database hosts typically run swappiness of 1-10; general-purpose hosts run 60. Verify swap is on a dedicated partition or fast local NVMe — swap on the SAN is a known latency trap during memory pressure.

Storage and File System

    Pull the firmware from the vendor portal (Dell iDRAC Lifecycle, HPE SUM, Lenovo XClarity). Match the firmware against the storage vendor's HCL — a mismatched HBA firmware against a SAN array is a top cause of intermittent latency spikes that masquerade as application bugs.

    For NVMe use "none" or "mq-deadline"; for SSDs "mq-deadline"; for spinning disk "bfq" or "mq-deadline". Persist via udev rule so the scheduler sticks across reboots and disk replacements.

    Add noatime on hot-read volumes; verify discard or run weekly fstrim on SSD-backed file systems; align XFS or ext4 stripe parameters to the underlying RAID geometry. Misalignment doubles small-write latency on RAID-5/6.

    Database workloads with 8K random I/O perform poorly on 256K stripes; sequential workloads (backup target, video) want the larger stripe. If you cannot rebuild the array, document the mismatch as a known ceiling so it doesn't surface later as a "tuning failed" complaint.

Database Optimization

    Confirm the engine version against the vendor lifecycle page (Microsoft SQL Server, PostgreSQL, MySQL, Oracle). Apply the latest cumulative update inside the maintenance window with a rollback snapshot taken first.

    Identify fragmentation above 30% (sys.dm_db_index_physical_stats on SQL Server, pgstattuple on Postgres). Rebuild online where the edition supports it; reorganize where it does not. Schedule a follow-up update of statistics after the rebuild.

    Pull the top 10 queries by total elapsed time from Query Store, pg_stat_statements, or the slow-query log. Add covering indexes, rewrite SELECT * patterns, and flag any queries that need application-side changes for the dev team — those go on the engineering backlog, not this runbook.

    SQL Server max server memory typically lands at 75-85% of host RAM with the OS reservation accounted for; PostgreSQL shared_buffers commonly 25% of RAM; MySQL innodb_buffer_pool_size 60-70% on dedicated hosts. Verify huge pages or large pages are enabled where the engine supports it.

    Postgres VACUUM ANALYZE, SQL Server UPDATE STATISTICS, MySQL ANALYZE TABLE. Confirm autovacuum or auto-stats is enabled going forward — manual maintenance during the tuning window is a one-time payoff; the recurring schedule is what holds gains.

Network Performance

    Pull from Intel, Mellanox, or Broadcom driver pages — the in-box driver lags by 12-24 months and frequently misses TSO / LRO offload fixes that show up as throughput ceilings.

    Raise net.core.rmem_max and wmem_max for high-bandwidth-delay paths; switch congestion control to BBR on long-haul links where supported. On 1 Gb LAN the defaults are usually fine — don't tune what isn't broken.

    Coordinate with the network team on the switch-side LACP configuration; mismatched bonding modes cause flapping under load. Verify with ethtool and /proc/net/bonding/bond0 that both members are active before declaring done.

    Mark database replication, backup, and VoIP signaling with appropriate DSCP values; confirm the upstream switch (Meraki, Catalyst, FortiGate) trusts the markings. Untrusted markings get rewritten at the edge and the QoS work is wasted.

Application Layer Tuning

    Check the vendor support matrix and any extended-support contracts. Out-of-support application versions are the single most common finding when a tuning engagement turns into a remediation engagement.

    Use k6, JMeter, or a vendor-specific load generator to drive expected peak traffic. Compare the captured profile against the baseline; "tuned" without a profile under load is just patched.

    JVM heap, IIS application-pool worker count, Tomcat maxThreads, .NET ThreadPool min/max — pick one per app. Connection-pool exhaustion presents as application timeouts that look like network problems to the helpdesk, so this step prevents tickets later.

    Redis or Memcached for session and object cache; IIS output caching or Varnish in front of web tiers. Set explicit TTLs and an eviction policy — caches without bounds eventually consume all available memory and trigger the swap problem you tuned away in step OS-4.

Validation and Handoff

    Run the same tools and the same business-day window as the pre-tuning capture so the comparison is apples-to-apples. Different tooling between baseline and validation is the easiest way to fool yourself into thinking tuning worked.

    Quantify the delta in CPU saturation, p95 disk latency, network throughput, and application response time. Attach the post-tuning export so the QBR deck has real numbers, not adjectives.

    Execute the rollback plan filed with the change request: revert sysctl, restore database memory settings, roll the snapshot if needed. File a follow-up RFC with revised tuning hypothesis — do not retry inside the same change window without CAB review.

    Capture the final settings in IT Glue, Hudu, or Confluence so the next on-call engineer doesn't reverse-engineer your work at 2 a.m. Update the asset record in ConnectWise or ServiceNow with the post-tuning configuration baseline.

    Application owner or vCIO signs off that the tuning meets the agreed success criteria. Close the change in PSA referencing this checklist run and attach the post-tuning metrics file.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists
Related Performance Tuning Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack