Centralized Logging

Summary

Foreman orchestrates several other compoments:

  • DHCP
  • DNS
  • TFTP
  • IPMI
  • Puppet Master
  • Puppet CA
  • Compute Resources: libvirt, oVirt, VmWare, OpenStack, Docker...

The list is longer when Katello and other plugins are installed:

  • Candlepin
  • Pulp, Crane
  • Dynflow
  • Cron jobs
  • Other configuration management tools: Salt, Chef

For instance, typical transaction of creating new host in Foreman involves:

  • acquiring next free IP on DHCP server
  • making DHCP reservation
  • creating DNS record
  • deploying PXELinux configuration, kernel and initramdisk on TFTP
  • signing certs on Puppet CA
  • contacting compute resrouce (e.g. oVirt) to configure and start the instance
  • and more (depending on plugins installed)

Pain points

All those components generate logs depending on their configuration and they all log to either files or syslog/journald. Investigation of failures involves searching in various system logs, log files and databases across whole Foreman infrastructure which involves multiple nodes installed with Smart Proxy and backend services.

Users need to execute foreman-debug tool on each of the nodes involved in transaction generating large tarballs with compressed log files sending them over to Foreman team for furtner investigation. This process is error prone, users often do not deliver logs from all nodes, or even supply logs from nodes not involved in the transaction.

Some operations take tens of minutes to complete, for example YUM repository promotion. Foreman user interface only shows a progress bar today. To see what is going on, users must connect to underlaying Pulp service to view log output in realtime.

Lastly, there is no way of finding root causes without accessing Foreman server or Smart Proxy nodes via remote shell.

Stories to solve

As a Foreman user, I would like to investigate logs from all components involved in New Host transaction.
As a Foreman user, I would like to find all logs from DHCP/TFTP/DNS servers paired with a specific host in Foreman database (IP/MAC).
As a Foreman user, I would like to view logs coming out from Pulp/Celery components during content promotion in realtime.

Solution

Centralized logging with metadata brings transparency as we currently struggle to correlate messages coming out of individual systems with orchestration runs, sessions or users. Possibility to pass the metadata IDs into subsystems would be big advantage. Possible correlation identifiers are:

  • Foreman user id
  • Foreman session id
  • Foreman transaction id
  • Foreman host IP address
  • Foreman host MAC address
  • Foreman host provisioning token

Owners

  • Lukáš Zapletal (Foreman) - `lzap_at_redhat_dot_com`

Current status

  • Design WIP
  • Not yet implemented

Implementation

The implementation must build on top of tools which are available in RHEL 6 and 7 today. The ideal integration point for common loogging is therefore rsyslog, which can be configured to send data over the wire reliably and also integrated with dozens of tools, including ElasticSearch/Kibana. It also allowes structured logging (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/System_Administrators_Guide/s1-structured_logging_with_rsyslog.html) via Lumberjack plugin. We should keep in mind that it is possible that journald will be enhanced to provide better over-the-wire capabilities, so it should be easy to switch to journald instead of rsyslog.

Foreman uses "logging" gem which is modular log4j-like logging framework which unfortunately does not support structured logging at all neither provide integration with journald. But we can leverage logging JSON (CEE) format and pass it in the rsyslog via the standard syslog interface.

Required implementation steps

  • generate correlation (request) ID for each Foreman request (DONE)
  • generate correlation ID for background tasks
  • accept correlation ID via API for multi-request workflows (rhsm registration is multiple calls)
  • generate correlation ID for API calls missing it
  • send correlation ids to Pulp or Candlepin
  • print correlation id on error pages together with commands to run to dig the messages

The format:

  • keep it shortest possible string (base22 encoding or something similar)

References

Pulp uses the system log: http://pulp.readthedocs.org/en/latest/user-guide/troubleshooting.html#logging