System Monitoring

A System Monitor (SM) is a process within a distributed system for collecting and storing state data.

There are many issues involved with designing and implementing a system monitor. Here are a few issues to be dealt with:

configuration
protocol
performance
data access

Configuration

The configuration for the system monitor takes two forms: 1) configuration data for the application itself, and 2) configuration data for the system being monitored. See: System Configuration

The application needs information such as log file path and number of threads to run with.

Once the application is running, it needs to know what to monitor, and deduce how to monitor. Because the configuration data for what to monitor is needed in other areas of the system, such as Deployment, the configuration data should not be tailored specifically to use by the system monitor, but should be a generalized system configuration model.

Protocol

There are many tools for collecting system data from hosts and devices using the SNMP protocol. Most computers and networked devices will have some form of SNMP access. Interpretation of the SNMP data from a host or device requires either a specialized tool (typically extra software from the vendor) or a Management information base (MIB)- a mapping of commands/data references to the various data elements the host or device provides. The advantage of SNMP for monitoring is its low bandwidth requirements and universal usage in the industries.

Unfortunately, unless an application itself provides an MIB and output via SNMP, then SNMP is not suitable for collecting application data.

Other protocols are suitable for monitoring applications, such as CORBA (language/OS-indpendent), Java RMI (Java-specific), J2EE (also Java-specific), or proprietary TCP/IP or UDP protocols (language/OS independent for the most part). (Please add more...)

Performance

The performance of the monitoring system has two aspects:

impact on domain functionality
ability to monitor efficiently

Impact on system domain

Of course any element of the system that prevents the main domain functionality from working is in-appropriate. The risks between implementations and their impact on the real goal for the system must be considered.

Ideally the monitoring is a tiny fraction of each applications footprint, requiring simplicity.

The monitoring function must be highly tunable to allow for such issues as network performance, improvements to applications in the development life-cycle, appropriate levels of detail, etc.

Efficient monitoring

Monitoring must be efficient, able to handle all monitoring goals in a timely manner, within the desired periodicity. This is most related to scalability. Various monitoring modes are discussed below.

Data access

Data Access refers to the interface by which the monitor data can be utilized by other processes. For example, if the System Monitor is a CORBA server, clients can connect and make calls on the monitor for current state of an element, or historical states for an element for some time period.

The System Monitor May Be writing data directly into a database, allowing other processes to access the database outside the context of the System Monitor. This is dangerous however, as the table design for the database will dictate the potential for data-sharing. Ideally the System Monitor is a wrapper for whatever persistence mechanism is used, Providing a consistent and 'safe' access interface for others to access the data.

Mode

The data collection mode of the System Monitor is critical. The modes are: monitor poll, agent push, and a hybrid scheme.

Monitor poll

In this mode, one or more processes in the system actually poll the system elements in some thread. During the loop, devices are polled via SNMP calls, hosts can be accessed via Telnet/SSH to execute scripts or dump files or execute other OS-specific commands, applications can be polled for state data, or their state-output-files can be dumped.

The advantage of this mode is that there is little impact on the host/device being polled. The host's CPU is loaded only during the poll, and the rest of the time the monitoring function plays no part in CPU loading.

The main disadvantage of this mode is that the monitoring process can only do so much in its time, and if polling takes too long, the intended poll-period gets elongated.

Agent push

In agent-push mode, the monitored host is simply pushing data from itself to the system monitoring application. This can be done periodically, or by request from the system monitor asynchronously.

The advantage of this mode is that the system monitor's load can be reduced to simply accepting and storing data, and it doesn't have to worry AbOUT timeouts for SSH calls, parsing OS-specific call results, etc.

The disadvantage of this mode is that the logic for the polling cycle/options are not centralized at the system monitor, but distributed to each remote node. Thus changes to the monitoring logic must be pushed out to each node.

Hybrid mode

The median mode between 'monitor-poll' and 'agent-push' is a hybrid approach, where the System Configuration determines where monitoring occurs, either in the System Monitor or agent. Thus when applications come up, they can determine for themselves what system elements they are responsible for polling. Everything however must post its monitored-data ultimately to the System Monitor process.

This is especially useful when setting up a monitoring infrastructure for the first time, and not all monitoring mechanisms have been implemented. The System Monitor can do all the polling in whatever simple means are available, and as the agents become smarter, they can take on more of the load.

External links

1 - Commercial open source applications and system monitoring software for web-based infrastructure
2 Various Commercial Product Descriptions
Sentinet3 - Systems Monitoring Appliance