On-Call Playbook for Operating System Checks

Here are the first responder tasks for the basic OS checks performed on most systems.

OS checks depend on ssh

These services depend on ssh, so if ssh isn’t working, then it can manifest as errors with these services.

An ssh fault should eventually spread to all the OS services, which is one way to recognize it.

Troubleshooting ssh

  • Check firewalls and tcp/22 traffic.
  • Check that the nagios_agent account exists on the target host
  • Check Puppet, which should be managing ssh

disk usage

On sysnews

Check Interval Warning Critical
20 minutes free space < 20% free space < 5%

There are many reasons a host can run short on disk space. Quick fixes are to try yum clean all to reduce the yum cache, or troll /var/log for ancient or unneeded log files.

load

On sysnews

Check Interval Warning Critical
20 minutes -w 5 -c 7

Why do we check at these levels? What actions should one take for load problems?

kernel version

On sysnews

Check Interval Warning Critical
1 hour newer kernel on disk never

This check should never go critical, but is used to warn us when a newer kernel is available, so we can schedule a reboot for the host.

security updates

On sysnews

Check Interval Warning Critical
30 minutes Any security updates never

This advisory check should never be Critical. Clear it by applying security updates. :)

puppet agent

On sysnews

Check Interval Warning Critical
20 minutes yes :) who knows?

Warns if the puppet agent hasn’t successfully run. How long? Does it go critical? Perhaps. It sure spends a lot of time in UNKNOWN.

ntp timesync

On Sysnews

Check Interval Warning Critical
20 minutes default:50% default:75%

This nagios check determines the health of NTPd on a system by calculating the overall health of the peers associated with the daemon. This check also verifies other attributes, such as the number of peers available, and whether a peer has been selected to be the sync source. The overall health percentage is a cumulative average of the reach over the peers.

Example: If 3 peers are listed, and 1 of the 3 dropped 2 of the last 8 packets, the health of that peer would be 75%, and the overall health would be about 92% ((100 + 100 + 75) / 3).

#####Solving the issue:

1) Put the host into maintenance mode

2) Stop the ntpd daemon

# sudo su -
# systemctl stop ntpd

3) Force ntpd to sync upstream. This can take it a little bit…

# /sbin/ntpd -gq

4) Start the ntpd daemon

# systemctl start ntpd

5) Check the status

# /sbin/ntpq -pn

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 152.1.15.27     152.7.254.50     2 u    2   64    1    0.549  -51.231   1.556
 152.1.15.28     152.7.254.50     2 u    1   64    1    0.663  -50.627   1.538
 152.1.15.140    152.7.254.50     2 u    2   64    1    0.413  -50.643   0.966

You will have to check the status a few times. All the numbers in the “st” column should show a “2” (meaning “Stratum 2”). If they’re showing a “16”, you’re in a bad state. Give the host a minute or two to self-adjust. If they’re still showing a “16”, something more serious is going on and/or you have a ntp configuration problem. Once the “st” column shows “2’s”, the host in question can then start trying to catch up. This can take several minutes - that’s OK.

The “reach” is the hosts confidence level in the time servers it’s configured to use. The “reach” columns will all be “1” (no confidence) because you just restarted ntpd. As you continue to check the status, the values in the “reach” column will continue to grow. The higher the “reach” value, the more confidence the system has in the systems reporting the time to it.

6) Once the Nagios/Sysnews checks are green, remove the host from maintenance.

Tags: oncall
Edit me