Monitoring FAQ

Frequently Asked Questions about Zabbix monitoring system

Zabbix @ NC State Monitoring Q&A

Onboarding

What is the plan to onboard systems that are currently monitored in Nagios?

Individual systems will be monitored by both Nagios and Zabbix until they are ready to be managed only in the new system. Templates are the primary mechanism for adding feature checks into objects.

Can we use existing NRPE checks that are deployed on systems, or will all checks need to be redone using the Zabbix agent on all production systems?

  • Zabbix agent can execute the NRPE script and pull in the data
  • For any O/S supported by Zabbix a local agent will run.
  • The agent is using 1 or more templates for checks on O/S, applications, SSL Certs, etc.
  • We are currently using active checks.

How do we get machines added to monitoring?

Multiple ways

  1. Scan Networks: process intensive. Not used for servers right now.
  2. When a Zabbix agent can be installed, we can use metadata to set up groups to drop hosts into the correct space in zabbix at build / ingest time. Any server can be a member of one or more host groups which can be added during discovery.
  3. Adding a host manually

For DBs, a user will be needed with rights to get data out of the database beyond what the agent can access in the O/S.

What are the next steps to get some Windows test hosts in Zabbix?

  • Timeline for testing is now.
  • Need a set of test hosts where we can look at metrics to tune templates and triggers Azure server testing?
  • CSI will send discovery information to WMS and necessary stanzas for Windows agent to work.
  • Request for MSsql metrics template as well. https://www.zabbix.com/integrations/microsoft_sql
  • Need to look at standard templates available on the Zabbix site.

Zabbix share community available link to the stock windows agent discovery rules https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/os/windows_agent?at=refs%2Fheads%2Frelease%2F4.4

Monitoring

How does Zabbix handle agents and checks?

  • Hybrid from Nagios where an agent exists on a system that is associated with triggers on the master / server side.
  • Agent can work in passive or active mode
    • Passive: Server reaches out and polls for checks
    • Active: Agent pulls down the checks programmed into agent, packages up the results and pushes data to the server. This is a 1.5 - 2 times performance increase versus passive checks

What methods does monitoring use to check?

  • active checks - Agent on port 10050 and 10051 using JSON comm protocol.
    • 10050 is polling port for passive polling
    • 10051 is for active polling
  • passive checks - SNMP checking available for network or other devices where an agent can not be installed.

How do we add checks?

By making changes to the parent template or individual hosts as necessary

Can the agent unregister itself?

ANSWER GOES HERE

Basically, what can the agent do?

  • The agent is active and pushes data to the Zabbix server
  • Active mode collects more information with less load on the Zabbix server while tPassive (server pull) collection of data

Paging

Is there a sane way to solve the “only page me from 8-5 on this subset of systems” for Dev/QA/etc environments? The current environment is dreadful at that.

Yes

How does Zabbix handle Dev / QA servers paging / notifying during work hours, but not during off hours?

  • Flexible maintenance mode system - like nagios downtime editor, but better
  • One-time / daily / weekly, etc scheduled maintenance periods
  • Hosts will go in maintenance for the designated period, then re-enter to fully monitored state w/o user input after initial setup.
  • Metrics are gathered while in maintenence
  • Hosts can be assigned in bulk by Host Group

Will there be customized paging groups in zabbix and/or iris?

Custom paging groups are defined in OnCall and are associated with corresponding groups in Iris and Zabbix

Can this be tailored per group to see info on all paging groups they are a part of?

Paging event reporting will most likely come out of IRIS; IRIS is the middleman for accepting events and pushing data out to the correct on call group.

What is the product for paging / notification?

IRIS/ONCALL

How are hosts handled for immediate maintenance?

  • This is an ACK when directly working in the Zabbix interface.
  • ACKs can handle putting the problem item in maintenance while other checks can continue.
  • Whole host/object maintenance like Sysnews is also available.
  • Logic is being worked out as we add more groups and see real world ‘behavior’ of monitored objects.

Will the Operators( OPS ) still be involved with Monitoring?

Meeting schedule with Comtech, Dana on role of OPS going forward.

What does a notification look like?

Variable. Test to voice with verbosity level configurable. Can also be pure text using SMS. Email being worked on now.

How are templates applied to servers / objects?

  • Zabbix uses ‘Host Groups’ that a single template or multiple templates can be applied by host group.
  • Custom template groups can be cloned from the base template to customized checks, thresholds, and triggers / notification.

Authentication/Permissions/Groups

Is Zabbix auth via LDAP or SAML? If LDAP, is it .admin or unity accounts?

  • Frontend is Shibboleth
  • The same credentials ‘should’ work for API via passthrough.
  • Zabbix listening on port 8080 at 127.0.0.1
  • API does something else - local auth or LDAP?

Is an API account needed per group? Can you store a new generic API account in LDAP?

API access is through a special API account

Can we get accounts for work?

  • Customers will be able to test / deploy on QA.
  • We are using a local account for now while cleaning up the RBAC pieces…ie loading users / groups out of systools.

Where does the hash data get generated?

  • Literally it is a ‘random’ character string with a 256 character limit.
  • Ex. Puppet is using a variable for Kernel info.

Certificates

How is SSL CERT monitoring accomplished?

There is a template for that from Zabbix we have not yet implemented. This will replace the SSL CERT tool if successful.

Can we specify the URI for cert checking? SNI for web servers makes cert checking weird. One machine != one ssl cert on most web servers. What about non-443 ports / protocols? (e.g. ldaps on 636)

We will likely write an SSL cert checker on the zabbix server itself or port the existing checker from SysTools

Misc

Do metadata changes reclassify after the fact?

We think so. May have to do some group management manually, but the templates should apply correctly. Need to double-check.

What are the expectations for data retention?

  • One month is a requirement. Three to six months might be a good ‘sweet spot’.
  • May keep trending data for a year, but only keep raw data for several months.
  • FYI: in nagios graphs/rrds we have this granularity currently
    • 4 hour graph = 1 min avg
    • 25 hours graph = 1.2? min avg
    • 1 week graph = 5 min avg
    • 1 month graph = 5 min avg
    • 1 year graph = 30 min avg

Can data retention be configured per template / host or is this a ‘global’ setting?

Configurable per item on any given template.

How are internal rights for configuring items, templates, triggers handled?

The current setup and direction is per user group (tenant). Each group will have a ‘golden’ template from which the user group will assign to their defined host group(s).

Is Graphana being worked on as a part of Monitoring rollout now?

The Monitoring system will deploy with native dashboard and statistics capabilities.

Is there reporting out of Zabbix dashboard for metrics?

Zabbix can look at existing data for triggers, time series data, paging info, etc. For deeper reporting Graphana ‘might’ be looked at later after the transition from Nagios is complete.

Can Zabbix handle long standing issues where problems are currently being ignored….ie disk out of space for a week, CPU high usage over days, etc?

ANSWER GOES HERE

How would Zabbix tie into vCenter?

ANSWER GOES HERE

:Would it be useful to share data from the vCenter / ESX side with Zabbix as the front end for customers?

A:Zabbix event correlation function could help with the VMware environment. We will need to experiment together once people are in the QA system.

Has any testing been done in parent / child object dependency checking in Zabbix?

Some access will need to be granted into Zabbix from the network side. If visibility exists, we can look at testing in the future. In a separate OPS meeting on Zabbix, Greg James asked about this as well. With Zabbix 5.x LTS, there is a stated item on Zabbix site that integration with ServiceNow now exists.

Tags:
Edit me