On-Call Playbook for role::oit_linux::tftpserver hosts

This role manages a "standard" tftp server controlled by a Foreman "Smart Proxy"

Ecosystem

In a network boot, the DHCP servers (operated by Comtech) identify which servers(s) client computers should download their boot files.

The tftp servers (this role) deliver a kernel and RAMdisk used to install the operating system.

When the tftp server is unavailable, clients attempting to boot off the network will be unable to download their installers, and will eventually time out.

Service Windows

This role should be available 24x7x365, but the consequences of short unexpected downtime are trivial (delays on installing new machines)

You should post an outage notification

  • If the total downtime was more than 5 minutes during the production day, or
  • the total downtime was more than 15 minutes during off-hours, or

Firewall restrictions

These hosts are on VLAN 30, and should be accessible via ssh on port 22 from anywhere on or off campus, without requiring VPN.

Tests

Operating System checks

OS checks test general operating system health.

foreman smart proxy process

On sysnews

Check Interval Warning Critical
20 minutes Never No proxy process

This checks to see if the foreman smart proxy is currently running on the host. It’s a pretty dumb check right now.

tftp

On sysnews

Check Interval Warning Critical
20 minutes Never tfto connection failed.

This checks to see if the sysnews server can contact the tftp server. at all by asking for some none existent bogus file and checking the negative answer from the server.

First Actions

Always deal with underlying OS issues first. First responders should ssh to the host, and reboot with sudo /sbin/reboot

When the host comes back up, [“Force and immediate status check”] on sysnews, and [log in to the Foreman server (build.oit.ncsu.edu)] and click on the “TFTP on …” entry under the Infrastructure -> Smart Proxies main menu is not showing errors.

If the problem is unresolved

Troubleshoot and fix the problem. If you need to, puppet agent --disable and hand fix the bugger. No whining.

Posting boilerplate

Please fill out as many technical details as possible and appropriate in this Sysnews Boilerplate post for tftpservers

Tags: oncall
Edit me