On-Call Playbook for Operating System Checks
OS checks depend on ssh
These services depend on ssh, so if ssh isn’t working, then it can manifest as errors with these services.
An ssh fault should eventually spread to all the OS services, which is one way to recognize it.
- Check firewalls and tcp/22 traffic.
- Check that the
nagios_agentaccount exists on the target host
- Check Puppet, which should be managing ssh
|20 minutes||free space < 20%||free space < 5%|
There are many reasons a host can run short on disk space. Quick fixes are to try
yum clean all to reduce the yum cache, or troll
/var/log for ancient or unneeded log files.
|20 minutes||-w 5||-c 7|
Why do we check at these levels? What actions should one take for load problems?
|1 hour||newer kernel on disk||never|
This check should never go critical, but is used to warn us when a newer kernel is available, so we can schedule a reboot for the host.
|30 minutes||Any security updates||never|
This advisory check should never be Critical. Clear it by applying security updates. :)
|20 minutes||yes :)||who knows?|
Warns if the puppet agent hasn’t successfully run. How long? Does it go critical? Perhaps. It sure spends a lot of time in UNKNOWN.
This nagios check determines the health of NTPd on a system by calculating the overall health of the peers associated with the daemon. This check also verifies other attributes, such as the number of peers available, and whether a peer has been selected to be the sync source. The overall health percentage is a cumulative average of the reach over the peers.
Example: If 3 peers are listed, and 1 of the 3 dropped 2 of the last 8 packets, the health of that peer would be 75%, and the overall health would be about 92% ((100 + 100 + 75) / 3).
#####Solving the issue:
1) Put the host into maintenance mode
2) Stop the ntpd daemon
# sudo su - # systemctl stop ntpd
3) Force ntpd to sync upstream. This can take it a little bit…
# /sbin/ntpd -gq
4) Start the ntpd daemon
# systemctl start ntpd
5) Check the status
# /sbin/ntpq -pn remote refid st t when poll reach delay offset jitter ============================================================================== 22.214.171.124 126.96.36.199 2 u 2 64 1 0.549 -51.231 1.556 188.8.131.52 184.108.40.206 2 u 1 64 1 0.663 -50.627 1.538 220.127.116.11 18.104.22.168 2 u 2 64 1 0.413 -50.643 0.966
You will have to check the status a few times. All the numbers in the “st” column should show a “2” (meaning “Stratum 2”). If they’re showing a “16”, you’re in a bad state. Give the host a minute or two to self-adjust. If they’re still showing a “16”, something more serious is going on and/or you have a ntp configuration problem. Once the “st” column shows “2’s”, the host in question can then start trying to catch up. This can take several minutes - that’s OK.
The “reach” is the hosts confidence level in the time servers it’s configured to use. The “reach” columns will all be “1” (no confidence) because you just restarted ntpd. As you continue to check the status, the values in the “reach” column will continue to grow. The higher the “reach” value, the more confidence the system has in the systems reporting the time to it.
6) Once the Nagios/Sysnews checks are green, remove the host from maintenance.Edit me