Server Diagnostics

Mar 25, 2018 17:22 · 629 words · 3 minute read Linux Command Line

Netflix Linux Performance



Check load averages. The load number is calculated by counting the number of running (currently running or waiting to run) and uninterruptible processes (waiting for disk or network activity). So it’s simply a number of processes. Because the load number also includes processes in uninterruptible states which don’t have much effect on CPU utilization, it’s not quite correct to infer CPU usage from load averages. This also explains why you may see high load averages but not much load on the CPU. cat /proc/loadavg 0.00 0.01 0.03 1120 1500 # 1, 5, 15 mins. Current running procs/total. Last pid


dmesg | tail 

Quick check for obvious errors


vmstat 1

r : number of processes running that are watiting a turn. Doesn’t include i/o so better than load average. r > cpu count means cpu saturation. free : Free memory in kb. “free -m” gives better explanation. si, so : Swap-ins and swap-outs. If these are non-zero, you’re out of memory. us, sy, id, wa, st: These are breakdowns of CPU time, on average across all CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen time (by other guests, or with Xen, the guest’s own isolated driver domain)

The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.


mpstat -P ALL 1 

This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.


pidstat 1

Very similar to tops per-process summary


iostat -xz 1 

r/s, w/s, rkB/s, wkB/s : These are the delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied. await : The average time for the I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or device problems. avgqu-sz : The average number of requests issued to the device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.) %util : Device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.


free -mh

buffers : For the buffer cache, used for block device I/O. cached : For the page cache, used by file systems.


sar -n DEV 1

Check the throughput of the interfaces


sar -n TCP,ETCP 1

active/s : Number of locally-initiated TCP connections per second (e.g., via connect()). passive/s : Number of remotely-initiated TCP connections per second (e.g., via accept()). retrans/s : Number of TCP retransmits per second.

Figuring out what process is using which port

sudo netstat -plunt
sudo ss -plunt

Get RAM size

cat /proc/meminfo

Pin the cpu

cat /dev/urandom > /dev/null &


cat /proc/uptime 9592411.58 9566042.33 # Total seconds up, total seconds idle

HTOP colors

CPU: Blue: Low priority threads (nice > 0) Green: Normal priority threads Red: Kernel threads Mem: Green: Used memory Blue: Buffers Orange: Cache