European DataGrid - Fabric Monitoring

Documentation



EDG fmon 2.5.4
Release notes describe the changes on the versions.


Content

Description
This package contains WP4 monitoring framework.

It provides a client (Monitoring Sensor Agent - MSA) running sensors (Monitoring Sensors - MS) on each node to monitor, and a central server (fmonServer) to collect data.

The server receives samples as they are measured by MSA, and stores them in a flat file database (one file per metric, per node and per day). A C API is provided to extract information from this database.

A server using an Oracle database is also available. Please check the documentation.

A graphical toolkit has been released to set up a web interface for monitoring flat file database. The system offers CGI data access (numeric values and plots) as well as tools to make daily plots and aggregation of information over cluster. An example deployment is available here (from CERN only).

The client is provided with a sensor (sensorLinuxProc) which uses /proc file system to measure various basic quantities on Linux (CPU load, network, etc). Detailed information about these metrics and other sensors available in the sensors section of this document.

An example subscription tool (fmonClient) uses the repository API to be notified when new samples are available. The transport can also be optimized on large farms by the use of distributed data concentrators (fmonProxy).

Additional information can be found on prototype homepage: http://wwwinfo.cern.ch/pdp/monitoring/index.html.


License
Copyright (c) 2001 EU DataGrid.
For license conditions see http://www.eu-datagrid.org/license.html


Installation

Package installation

  • With source tarball
    Under Linux, the following commands may be used :
    	tar -xzf edg-fabricMonitoring.x.x.x.src.tar.gz
    	cd edg-fabricMonitoring.x.x.x
    	make
    	make install
    
    If you don't have a C++ compiler on your system, compilation of CPP files can be avoided (but please note that sensorLinuxProc requires C++ installed to compile) :
    	make CPP=0
    
    By default, files are installed in /opt/edg/fabricMonitoring. You can change the root installation path. The following command installs files in my/install/root/directory/fabricMonitoring :
    	make install prefix=my/install/root/directory
    

  • With RPM
    Use the rpm -i command, like in the following example:
    	rpm -i edg-fabricMonitoring.x.x.x-1.i386.rpm 
    
    By default, files are installed in /opt/edg/fabricMonitoring. You can change the root installation path. The following command installs files in my/install/root/directory/fabricMonitoring :
    	rpm -i edg-fabricMonitoring.x.x.x-1.i386.rpm
    	       --prefix=my/install/root/directory
    
    Use rpm -e edg-fabricMonitoring.x.x.x to remove the package. Repository and log files are not destroyed with this action. They should be removed manually if needed.

Configuration and running

  • For the monitored nodes:
    MSA is installed with default sensors and configuration. You should edit install_dir/etc/MSA.cfg to setup server, samples, and other configuration. Then launch:
    	install_dir/etc/init.d/edg-fmon-agent start
    
    Options are detailed in edg-fmon-agent script.

  • For the server node:
    Launch:
    	install_dir/etc/init.d/edg-fmon-server start
    
    Options are detailed in edg-fmon-server script. By default, files containing the samples are stored in /var/fmonServer.

  • Automatic startup at boot time:
    To setup automatic startup of MSA and fmonServer when machine boots, run once chkconfig to register the startup scripts.

    If you don't want fmonServer to be launched, type rm /etc/rc.d/init.d/edg-fmon-server
    If you don't want MSA to be launched, type rm /etc/rc.d/init.d/edg-fmon-agent

    These files are symbolic links to the control scripts in install_dir/etc/init.d. This should be done before launching chkconfig.

Notes

  • If you want to distribute the software with your configuration, get the source tarball, extract it and modify etc/MSA.cfg. Then rebuild a distribution with make rpm. The new RPM file includes now your configuration, and the agent is able to run without reconfiguration. You may also include your sensors in the rpm distribution. Use make rpm "sensors=/path/mysensor1 /path/mysensor2".



MSA - Monitoring Sensor Agent
The following options can be added to the command line:
  • -v : print version and exit
  • -h : print command line help and exit
  • -d : print debug information while running
  • -l file : specify log file
  • -c file : specify configuration file
  • -D : run as a daemon in the background
  • -a : autoload of configuration enabled
It is however recommended to use the control script ../etc/init.d/edg-fmon-agent to start and stop the agent.

MSA uses by default the configuration file ../etc/MSA.cfg relative to the running directory. This is normally install_dir/etc when running from install_dir/sbin. The '-c' option should be used to set an absolute path to the configuration file to use.

MSA loads the configuration file on startup. While MSA is running, the file is reloaded if its timestamp changes (when option '-a' has been specified, which is the case when started with 'edg-fmon-agent start'), or if signal SIGHUP is received.

Log messages are by default printed to stdout. A field in the configuration file allows to output log messages to a file. The command line option '-l' can also be used for this purpose.

MSA can be configured for a verbose output with '-d' option. This can be useful when writting new sensors.

MSA configuration file is structured in a tree format. The number of tabulations '\t' at the beginning of the line gives the tree node depth. It is followed by a node name and a node value, separated by blanks. Trailing blanks are removed. It is extremely important to respect the leading tabs depth. A warning is issued in the log file if space characters are present befor the first word on a line. A correct configuration file contains only tabulations before the first word of a line.

If the value of a key/value pair starts with a "$" sign, then the associated environment variable is used as the value. To actually include "$" at the beginning of the value, you need to use "$$".

Examples:
	"my_value" -> val = my_value
	"$$my_value" -> val = $my_value
	"$my_value" -> val = getenv(my_value)
	


Comment lines begins with a dash '#'.

The following example details the configuration file structure.

# The root of the configuration tree is 'MSA'.
#	By default, MSA uses host name as agent identifier in repository,
	but you can specify a value.
MSA
#or: MSA	my_agent_id

# General configuration section.
	General

#	 	MSA log file
#		If log file is set to 'syslog', log messages are logged to
		syslog instead of a file.
		LogFile /var/MSA.log

#		MSA local repository :
#			- define path
#			- optionnaly spool last value separately
#			  for fast access
		LocalCache
			Path	../var/localcache
			SpoolLast


#		Sampling on demand setup :
#			- define a named pipe to be used for queries
#			- external client wanting a metric to be sampled opens this pipe and issues
#			  a query of the form 'timestamp metric \n'
#			  where 'timestamp' is the unix time at which the metric is to be sampled (0 if now)
#			  and 'metric' is the metric to be sampled.
#			  The query is terminated by an end of line.
#			  The named pipe is created when agent starts, and is destroyed on shut down.

		SamplingOnDemand
			PipePath	../var/SamplingOnDemandPipe


# Transport configuration. You can choose between UDP and TCP transports.
# Maximum 5 transports allowed.

	Transport

#		Configure fmonServer location and port
		UDP
			Server	localhost
			Port	12409

# you can also set up a second transport
#		UDP
#			Server	otherhost
#			Port	12409			
# and/or :
#		TCP
#			Server	localhost
#			Port	12409
# optionnaly for TCP transport:
# - specify agent id (by default using host name)
#			AgentID myagent
# - specify directory for cached data not being transported yet (by default ../var)
#			Spooldir /tmp/MSAcache
#
# Optionnaly, for all transport, a filter can be set to select only some metrics
# to be transported for this transport. Add the following:
#
# 			FilterMetrics KeepOnly
#				9011
#				9012
# or :
#
# 			FilterMetrics Reject
#				9011
#				9012
#
# In the first case, only metrics 9011 and 9012 are transmitted.
# In the second case, all but metrics 9011 and 9012 are transmitted.



# NOTE: DNS lookup for server is done only once after configuration is read.
        If server IP changes, all agents should be restarted to avoid UDP
	packets being lost.           


# Declaration of sensors available.
# For each sensor, we give the executable command line and
# the list of metric classe implemented in the sensor.
	Sensors		

#		Internal sensor for MSA self-monitoring.
		MSA
			MetricClasses
				MSA.Alive
				MSA.Footprint

#		Proc sensor.
		proc
			CommandLine ./sensorProc
			MetricClasses
				system.uptime
				system.existingProcesses
				system.createdProcesses
				system.numberOfSockets
				system.CPUutil
				system.contextSwitches
				system.interrupts
				system.swapIO
				system.pagingIO
				system.networkIO
				system.memoryUsed
				system.DiskIO


# Optionnally you can give arguments to the sensor program
#		mysensor
#			CommandLine ./mysensor arg1 arg2 arg3
#			MetricClasses
#				...



# If the sensor does not support the CHK command (MSA > v2.5), you should set the 'NoCHK' option
# to prevent MSA to make such query:
#		mysensor
#			CommandLine ./mysensor arg1 arg2 arg3
#			NoCHK
#			MetricClasses
#				...


# Now the metric instances are listed, under MSA/Metrics entry.
# Each instance is named by a numeric id (used as metric reference
# in database). Each metric node contains a MetricClass node, with
# a value. Parameters can also be passed to the sensor in a 
# Parameters subtree.
# The metric can also be configured with a different node identifier
# than the default one used by the MSA. This can be done by defining a 'NodeId'
# entry (on the same level as MetricClass). This does not work with MSA internal sensor metrics.

	Metrics
		1
			MetricClass MSA.Alive
		2
			MetricClass MSA.Footprint
		3
			MetricClass system.uptime
		4
			MetricClass system.networkIO
			Parameters
				interface eth0

#		5
#			MetricClass remoteMetric
#			NodeId	remoteNodeId



# A smoothing parameter can be configured for each metric, to discard metric values
# that do not change 'too much', according to parameter.
# The word in the sample must be provided with 'Index' (0 if the full string is to be compared).
# 'Type' defines if we compare numbers ('number') or strings ('string').
# Optionnally, we can set the maximum difference allowed using 'Maxdiff'. This makes sense only with 'number' type.
# In the example, the CPU load will be stored only if the CPU idle time has changed more than 5%.
# You can also optionnally set a 'Maxtime' number of seconds of consecutive sample filtering.
# The configuration below guaranties that at least one sample is not discarded per 10 minutes window.
#                6
#                        MetricClass system.CPUutil
#                        Smoothing
#                                Index   1
#                                Type    number
#                                Maxdiff 5
#                                Maxtime 600


# Finally, declaration of the sampling frequencies
# and associated metrics. Each sample is a subtree in
# MSA/Samples. Each sample should contain a Timing and
# a Metrics entry. In this example, metrics 1 and 2
# are sampled every 5 minutes.
	Samples
		sample1
#			For sample timing, we give:
#			- period (in seconds)
#			- initial offset (in seconds).
			Timing	300 0
			Metrics
				1
				2

# Samples can be triggered with reference to a given time for the first
# measurement. For example, if period is 21600s (6 hours), offset 0, and
# reference time 00:00, samples will be measured every day at 00:00, 6:00, 12:00,
# 18:00. The reference time given is for THE FIRST MEASUREMENT (% period). So if you
# want daily measures at the same times, be sure that 86400 is a multiple of
# 'period'. With the following example config, the first sample will occur at 12:00 if
# you start the agent at 10:15.
# Otherwise there will be a shift in time of the measures from day to
# day. If no reference time is given, MSA startup time is used.
#
# This feature can be useful to synchronize measurements between machines, or to
# trigger a measure after an event which time is known (daily rpm update, etc).
# To enable this feature, add a 'ReferenceTime' entry to the sample configuration
# (same tree level as 'Timing'), with value a time of the form 'hh:mm'.

		sample2
			Timing		21600 0
			ReferenceTime	00:00
			Metrics
				3



A summary is available for quick reference to the MSA structure.



Transport configuration :

MSA can send monitoring data to several servers (maximum 5). This is done by having in the configuration several 'UDP' or 'TCP' entries. When MSA is reconfigured, transports are shut down only if their configuration has changed.

Each transport can be configured to send to its server only a subset of the metrics measured. This is done with the 'FilterMetrics' keyword in the transport configuration, as shown above.


fmonServer - Monitoring Repository Collector
The central collector waits for samples sent by remote agents. It receives them and stores them in a database.

The following options can be given to the server on the command line:
  • -p port : set the listening port number
  • -d dir : set the spool directory
  • -v : print version and exits
  • -l file : specify log file
  • -D : run as a daemon in the background
  • -h : print command line help and exits



Measurement repositories

Database structure

The database is very simple : there is one file per host and per metric and per day. They are stored in the spool directory. The content of the file is a list of pairs timestamp/value

File name 'agent identifier'/YYYY_MM_DD__'metric identifier' 'agent identifier' is sent by the agent.
'metric identifier' is formatted as a 8 characters value.
YYYY MM DD is the date
File content 'time' (tab) 'value' 'time' is the the number of seconds since 1st january 1970. 'value' is a string.



A file containing the last value (and timestamp) for each metric is also created and updated at run time. These files keep a copy of the last sample received until a new one arrives (for each metric). They are located in each host directory and named "last.'metric identifier'". They have the same format as the spool files. A file named "last.active" in the spool directory tells at run time if these files are updated or not (its content is 1 or 0).


A tool is available to clean the database. Call 'edg-fmon-cleanspool ndays' where ndays is the number of days for which you want to keep data. By default, it is 5. This script should be called on a regular basis to free space taken by the repository. Please consult the man pages for more details.


Sensors
Sensors are launched by MSA, and communicate with it using the interface specified in sensorAPI.pdf. Available sensors are listed below. Configuration parameters are indicated in italics when applicable, and are mandatory unless specified optionnal. The default configuration file (in ../etc/MSA.cfg) may help you to understand how to configure metrics.
  • MSA internal sensor

    This sensor is special. It is embeded in the agent to implement self-monitoring capabilities. The following metric classes are available:

    MSA.Alive The first number return is always "1" (can be used as a heart beat). Then follows the number of running sensors out of the number that should be running (x/y).
    MSA.Footprint The sample output gives: agent uptime (in seconds), total cpu used (1/100th second), agent resources used : cpu over last interval, vsize (kB), rss (kB), %mem used, sensors resources used (total): cpu over last interval, vsize (kB), rss (kB), %mem used.
    MSA.HeartBeatTimeout This metric requires a configuration parameter named timeout. The value of this parameter is returned each time the metric is sampled. This can be used as a timetout limit to implement a contact lost alarm.
    MSA.SensorCheck This metric requires a configuration parameter named timeout. The value of this parameter is the maximum response time of the sensor to the 'CHK' command issued when this metric is sampled. The metric returns the number of sensors which have not replied before timeout, and the number of sensors checked.



  • edg-fmon-sensor-linuxProc

    This sensor gathers information from /proc. This is available only with Linux. The following metric classes are available:

    system.uptime the elapsed number of seconds since boot time
    system.bootTime the time of last machine boot, in standard timestamp format
    system.existingProcesses the number of processes existing
    system.createdProcesses the number of processes created in last timeinterval (average per second and total)
    system.numberOfSockets the number of sockets in use (total, TCP, UDP , RAW)
    system.CPUutil CPU utilisation in percent over last interval (User, Nice, System, Idle), time interval (seconds), counters discrepencies
    system.contextSwitches number of context switches during last interv al (average per second and total)
    system.interrupts number of interrupts during last interval (average per second and total)
    system.swapIO number of swap pages read and write (average per secon d and total)
    system.pagingIO number of pages read and write (average per second a nd total)
    system.networkIO number of kilobytes read and write (total since boo t time and average per second over last interval) on the given interface (parameter interface, example: lo, eth0)
    system.memoryUsed RSS memory in use (kilobytes) and the corresponding percentage of total RAM
    system.DiskIO name, type, size(MB), used(%), read(kB/s), write(kB/s) , use(%) for each partition over last interval (1 sample per partition)
    system.DiskStat name, read(kB/s), write(kB/s), use(%) for each disk over last interval (one disk per sample if parameter 'multiline' set to 1, a list for all disks otherwise)



  • edg-fmon-sensor-systemCheck

    This sensor measures various quantities using common command line utilities. The following metric classes are available:

    spaceUsed Estimate file space usage recursively in a given directory. This metric is measured using command 'du -s path'. Symbolic links are not dereferenced. If some subdirectories are not accessible, the sampling fails. The returned value is the size used in kilobytes.

    Parameters:
              path        Path of root directory to scan (e.g. '/tmp').

    daemonCheck Counts the number of instances of a given process. It is based on the output of command 'ps -fC name'. The returned value is the number of matching processes.

    Parameters:
              name        The command name of the process (e.g. 'inetd').
              user        Only count the processes matching the given user name (e.g. 'root'). Optionnal. By default, count all.

    executeScript Executes a command in the shell. The returned value is the output of the command. It is reformated in one single line of maximum 2000 characters.

    Parameters:
              command        The shell command to execute (e.g. 'ls /bin', or 'ls -l /tmp | grep -c root').

    serviceStatus Executes '/sbin/service status' for a given service. Returns 1 if the service is running, 0 otherwise.

    Parameters:
              service        The service name to be checked (as known by chkconfig).




  • edg-fmon-sensor-fileUtil

    This sensor provides various file-related utilities.


    file.dump Once initialized, this metric stores in the monitoring what is read from the file, line by line. This metric does not need to be sampled, it sends samples as the file is filled. This works well with named pipes too. Beware that the whole content of the file is dumped at the beginning. If you do not wish such behavior, please use file.tail metric instead.

    Parameters:
              file        Path to the file to read from.

    Output format:
              (string)        What is written to the file, one sample per line.

    An example utilisation is to get the LCFG log messages in the monitoring DB. To get LCF log messages, redirect them to a named pipe. The following entry is required in the syslog.conf file to route LCFG events to the sensor:

              local3.* |/var/obj/tmp/monitor.fifo

    You might need to create the pipe with mkfifo.
    file.redump This metric re-dumps the content of a given file each time it is sampled. There is one sample per line.

    Parameters:
              file        Path to the file to read from.

    Output format:
              (string)        The content of the file, one sample per line.

    file.tail This metric returns strings appened to a given file, in the same way the 'tail' command works. This metric only outputs what is appended once initialized. If you want the full content of the file, use the file.dump metric. This metric is useful to trace log messages, etc. You don't need to sample this metric. Samples are sent as the file grows.

    Parameters:
              file        Path to the file to read from.

    Output format:
              (string)        What is written to the file, one sample per line.

    file.processAccounting This metric returns process accounting information appened to a given file. This metric only outputs what is appended once initialized. The binary information is converted to ASCII readable data. You don't need to sample this metric. Samples are sent as the file grows.

    Parameters:
              file        Path to the file to read from. (typically /var/log/pacct)
              filter_uid        reports accounting information only for user with the given uid. Optionnal.
              filter_gid        reports accounting information only for users belonging to group with the given gid. Optionnal.

    Output format:
              (string)        Accounting command name.
              (long)        Accounting process exitcode.
              (int)        Accounting user ID.
              (int)        Accounting group ID.
              (int)        Controlling tty.
              (long)        Beginning time. (seconds since 1970)
              (int)        Accounting user time.
              (int)        Accounting system time.
              (int)        Accounting elapsed time.
              (int)        Accounting average memory usage.
              (int)        Accounting chars transferred.
              (int)        Accounting blocks read or written.
              (int)        Accounting minor pagefaults.
              (int)        Accounting major pagefaults.
              (int)        Accounting number of swaps.

    file.size This metric measures the size of a given regular file (no directory, symbolic link, etc).

    Parameters:
              file        Path to the file to measure.

    Output format:
              (int)        The file size in bytes. -1 if file does not exist (or is not a regular file).




Contact
Should you have any question or comment, please write to sylvain.chapeland@cern.ch.