On-Call Playbook for Legacy Nagios (uniXXsm)

Nagios is the engine on Sysnews that drives the "field of green."

Ecosystem

The SysNews Server Status page is affectionalty known as the “Field of Green”. This page monitors all hosts in the nagios_ng_hosts table with the Source tag set to Server, VirtualIP or Hostless. Printers, Kiosks, LabMachines and ClassTech hosts do not appear here.

There are two servers, uni02sm and uni03sm, in the 10.x networks performing service checks, one in each data center. These are behind the Comtech Cisco firewalls. The service checks for all monitored hosts on campus split between these hosts dynamically. If either are down, the other takes over responsibility for all checks. The check results are passed between all Nagios nodes, so each knows the state of every host and service. All of this load balancing is done via the Merlin NEB.

The other Nagios servers, uni04sm and uni05sm, are in VLAN 30, also split between DC1 and DC2. These servers have GSM modems attached to them, and perform automatic notification work, as well as database, graphing and front-end display work for the Nagios system. They run Merlin to talk to the hosts behind the firewalls and share check results. Merlin stores the status and performance information in a local MySQL instance on each Nagios node.

Service Windows

Monitoring needs to be up 24x7x365.

You should post an outage notification

  • If both uni02sm and uni03sm are down at the same time at any time
  • If both uni04sm and uni05sm are down at the same time at any time

Firewall restrictions

We have worked with ComTech to generally allow UDP/161 and UDP/1161 as well as TCP/22 (ssh), TCP/24 (unity ssh) and TCP/5666 (nrpe) from the back-end Nagios servers (uni02sm and uni03sm) to any other server protected by a Cisco FWSM firewall.

These are the ports for SNMP, SSH and NRPE, which represent most of the ‘privileged’ ports needed for host status information, beyond ‘is the service responding properly’. (For example, disk space checks, process count, and CPU load averages are done over SNMP or NRPE).

The majoroity of the NRPE checks are targeted at Windows hosts, and similarly, the majority of the SNMP checks are targeted at UNIX hosts.

Tests

Operating System checks

Drive space “volume”

On sysnews

Check Interval Warning Critical
15 minutes ? ?

Separate checks are done for /, /var, and /local

If /var starts to fill on uni0[23]sm, check the pnp4nagios setup, which processes the check result data and formats it with rrdtool.

Files to watch are

  • /var/pnp4nagios/host-perfdata
  • /var/pnp4nagios/service-perfdata

Warning: /var is only 4GB in size and these logs grow fast.
Note : /local has more than 50 GB free if you’re in a jam.

In /etc/nagios/nagios.cfg the “service_perfdata_file_processing_command” is used to run the nagios command “process-service-perfdata-file” every 15 seconds (!)

The actual commands to are in /etc/nagios/objects-static/perfdata_commands.cfg, and is basically to /bin/mv the two files to /afs/unity/adm/monitor/perfdata-spool/$hostname/`

A separate process, NPCD, launches the pnp4nagios command, which is /usr/lib64/nagios/plugins/process_perfdata.pl if you’re at all curious.

This process can fail, even if there is plenty of afs space, if the number of directory entries gets too high.

As a quick fix, onn the host in question

/bin/mv /var/pnp4nagios/service-perfdata \
   /afs/unity/adm/monitor/perfdata-spool/$(hostname --short)/service-perfdata.$(date +%s)
/bin/mv /var/pnp4nagios/host-perfdata \
  /afs/unity/adm/monitor/perfdata-spool/$(hostname --short)/service-perfdata.$(date +%s)

http

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

httpd pcount

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

load

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

memory usage

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

merlin pcount

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

mysql

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

nagios pcount

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

Counts the number of nagios processes on each server. At any given time, there should be only one. You can troubleshoot this a bit with /afs/unity/adm/monitor/bin/merlin-healthcheck.pl

Here’s an example of both uni04sm and uni05sm being dead (no nagios)

unity% /afs/unity/adm/monitor/bin/merlin-healthcheck.pl 
Merlin    Check Count     Percentages       Health
Nodename  (04sm/05sm)     (04sm/05sm)     (04sm/05sm) 
--------  -----------   ---------------  ------------- 
 uni02sm   4605  4605   %49.752 %49.752    OK   OK
 uni03sm   4605  4605   %49.752 %49.752    OK   OK
 uni04sm     23    23   % 0.248 % 0.248  DEAD DEAD
 uni05sm     23    23   % 0.248 % 0.248  DEAD DEAD

Note that most (possibly all) of the checks automagically migrated to the alive hosts. This is why this check reports WARNING rather than CRITICAL.

To fix things, ssh to each host with a bad pcount, and run

sudo /sbin/service nagios restart

As noted under Nagios Quirks, it can take a while for this error to clear and propogate.

If nagios fails to restart, you may have a bad configuration file. You can see what the problem is by running

/usr/sbin/nagios -v /etc/nagios/nagios.cfg

If Nagios is failing to start due to a NULL record, you need to do two things to fix this:

1) Edit /etc/nagios/objects-dynamic/hosts.cfg and set the host_name field of the bad record to something like ‘temp-host-name’ and then restart nagios as shown above. This will allow Nagios to restart, but it will eventually fail again because the config file is generated by data stored in MySQL. We need to fix this as well.

2) To repair the bad record in MySQL:

# mysql -h sysnews.ncsu.edu -u statusadmin -p $PASSWORD_FROM_KEEPASS
# use status_db;
# select hostid from nagios_ng_hosts where hostname = '';
# update nagios_ng_hosts set hostname = 'temp-host-name' where hostid = $HOST_ID_FROM_ABOVE;

npcd pcount

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

SMS gateway

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

SMS message queue

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

ssh

On sysnews

| Check Interval | Warning | Critical
| ————– | —————— | —————
| — | — | —

First Actions

Always deal with underlying OS issues first.

If the problem is unresolved

Troubleshoot and fix the problem. No whining.

Posting boilerplate

Please fill out as many technical details as possible and appropriate in this Sysnews Boilerplate post

On XXXX there were issues on the monitoring server YYYY.

This server is a member of a cluster of servers that provide the [Sysnews service monitoring pages](https://sysnews.ncsu.edu/ss) used by campus system administrators to monitor computer services for proper operation.

We apologize if you received spurious alerts that computer systems were down, or if you were unable to retrieve system status, due to these issues.

Further details, if they are available, accompany this post under its technical details section.

Tags: oncall
Edit me