Server Diagnostics
Mar 25, 2018 17:22 · 629 words · 3 minute read
1.)
uptime
Check load averages. The load number is calculated by counting the number of
running (currently running or waiting to run) and uninterruptible processes
(waiting for disk or network activity). So it’s simply a number of processes.
Because the load number also includes processes in uninterruptible states which
don’t have much effect on CPU utilization, it’s not quite correct to infer CPU
usage from load averages. This also explains why you may see high load averages
but not much load on the CPU. cat /proc/loadavg
0.00 0.01 0.03 1⁄120 1500 #
1, 5, 15 mins. Current running procs/total. Last pid
2.)
dmesg | tail
Quick check for obvious errors
3.)
vmstat 1
r
: number of processes running that are watiting a turn. Doesn’t include i/o so
better than load average. r > cpu count means cpu saturation.
free
: Free memory in kb. “free -m” gives better explanation.
si, so
: Swap-ins and swap-outs. If these are non-zero, you’re out of memory.
us, sy, id, wa, st
: These are breakdowns of CPU time, on average across all
CPUs. They are user time, system time (kernel), idle, wait I/O, and stolen
time (by other guests, or with Xen, the guest’s own isolated driver domain)
The CPU time breakdowns will confirm if the CPUs are busy, by adding user + system time. A constant degree of wait I/O points to a disk bottleneck; this is where the CPUs are idle, because tasks are blocked waiting for pending disk I/O. System time is necessary for I/O processing. A high system time average, over 20%, can be interesting to explore further: perhaps the kernel is processing the I/O inefficiently.
4.)
mpstat -P ALL 1
This command prints CPU time breakdowns per CPU, which can be used to check for an imbalance. A single hot CPU can be evidence of a single-threaded application.
5.)
pidstat 1
Very similar to tops per-process summary
6.)
iostat -xz 1
r/s, w/s, rkB/s, wkB/s
: These are the delivered reads, writes, read Kbytes,
and write Kbytes per second to the device. Use these for workload
characterization. A performance problem may simply be due to an excessive
load applied.
await
: The average time for the I/O in milliseconds. This is the time that
the application suffers, as it includes both time queued and time being
serviced. Larger than expected average times can be an indicator of device
saturation, or device problems.
avgqu-sz
: The average number of requests issued to the device. Values
greater than 1 can be evidence of saturation (although devices can
typically operate on requests in parallel, especially virtual devices which
front multiple back-end disks.)
%util
: Device utilization. This is really a busy percent, showing the time
each second that the device was doing work. Values greater than 60%
typically lead to poor performance (which should be seen in await),
although it depends on the device. Values close to 100% usually indicate
saturation.
7.)
free -mh
buffers
: For the buffer cache, used for block device I/O.
cached
: For the page cache, used by file systems.
8.)
sar -n DEV 1
Check the throughput of the interfaces
9.)
sar -n TCP,ETCP 1
active/s
: Number of locally-initiated TCP connections per second (e.g., via connect()).
passive/s
: Number of remotely-initiated TCP connections per second (e.g., via accept()).
retrans/s
: Number of TCP retransmits per second.
Figuring out what process is using which port
sudo netstat -plunt
sudo ss -plunt
Get RAM size
cat /proc/meminfo
Pin the cpu
cat /dev/urandom > /dev/null &
Uptime
cat /proc/uptime 9592411.58 9566042.33 # Total seconds up, total seconds idle
HTOP colors
CPU: Blue: Low priority threads (nice > 0) Green: Normal priority threads Red: Kernel threads Mem: Green: Used memory Blue: Buffers Orange: Cache