meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
documentation:software_interface [2017/05/23 14:55] – [Nagios API] vordocumentation:software_interface [2020/12/16 11:19] (current) – removed vor
Line 1: Line 1:
-====== Software interface ====== 
  
-There are several software interfaces available to monitor the status of the RECS<sup>(r)</sup>%%|%%Box system. These are the Management Web****GUI, a REST API providing XML based monitoring and management functionality and a native NRPE based Nagios interface. 
- 
-===== Management WebGUI ===== 
- 
-The Management Web****GUI is established on every RECS<sup>(r)</sup>%%|%%Box unit. Accessible by any known browser on the assigned IP address and the default port 80. The following views are dependent on the device and assembly. 
- 
-In general these symbols have the following meaning on every page: \\ 
- 
-|{{ :documentation:statusok.png?nolink |}}|Everything is OK. Also indicated by a green line in a graph.| 
-|{{ :documentation:statuswarning.png?nolink |}} |Warnung. Something is wrong, but the system is still fully functional. The system has to be checked so the problem doesn't get worse. Indicated by a yellow line in a graph.| 
-|{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.| 
- 
-Figure 1 shows the first call of the Management Web****GUI. It is organized into three columns. The first is on the left-hand side and contains the following: 
- 
-[[documentation:software_interface#Overview|Overview:]] General overview of all managed RCU<sup></sup>s, RPU<sup></sup>s, installed nodes and health status\\ 
-[[documentation:software_interface#Management|Management:]] Selection of every managed RCU and RPU in the rack with a sensor view button for the Arneb\\ 
-[[documentation:software_interface#Global settings|Global settings:]] IP filter and firmware update\\ 
-[[documentation:software_interface#Log Viewer|Log:]] Logs from the management software about system health and java messages. The logs can be downloaded as a zipfile\\ 
- 
-The second colum contains the buttons and sliders to manipulate the system. While the third colum is mostly for history information like power usage and temperature graphs. 
- 
-==== Overview ==== 
- 
-All units that are installed in the rack and that are managed by the software are summarized on this page. 
-The total power usage is summed up over all managed units. 
- 
-<imgcaption web-gui-overview|> 
-{{ :documentation:web-gui-overview.jpg?direct&500 |Management Overview}}</imgcaption> 
- 
-==== Management ==== 
- 
-An overview of the selected unit can be seen in this tab. The fans can be regulated by dragging the slider to the desired percentage. And multiple nodes can be selected. By klicking on a node the[[documentation:management#node management|Node management]] page of the node is shown. 
- 
-<imgcaption web-gui-rcu-overview|>{{ :documentation:web-gui-rcu-overview.jpg?direct&500 |Adds an ImageCaption tag}}</imgcaption> 
- 
-A quick menu to control a node can be opened by klicking on the gear next to an CXP node. In this menu the node can be switched on and off and the KVM can be switched to the node. 
- 
-<imgcaption web-gui-node-control|>{{ :documentation:web-gui-node-control.jpg?nolink&300 |Management pop-pu for Apalis nodes}}</imgcaption> 
- 
-<WRAP round tip> 
-Apalis nodes do not show a management pop-up button due to size constraints.\\ Click on the node button while pressing the "Shift" key to open the management pop-up instead of navigating to the node view. 
-</WRAP> 
- 
-<WRAP round tip> 
-When pressing the "Shift" key while clicking, the "Select all" and "Select none" buttons select only nodes currently on or nodes currently off, respectively. 
-</WRAP> 
- 
-=== Node management === 
- 
-On this page the selected node can be controlled and detailed status values and graphs can be seen.\\ 
-By klicking on the arrow, pointing downwards in the upper bar next to the nodename, the other nodes of the unit can be chosen. 
- 
-<imgcaption web-gui-cxp-node-view|>{{ :documentation:web-gui-cxp-node-view.jpg?direct&500 |Node management}}</imgcaption> 
- 
-==== Global settings ==== 
- 
-All IP<sup></sup>s that are allowed to access the Nagios interface have to be listet here.\\ 
-The firmware for the whole RECS<sup>(r)</sup>%%|%%Box can be uploaded here by klicking on the "Upload Firmware File" button and selecting the file. The update-process starts right after the file was uploaded.\\ 
-For the update process **all modules will be powered off!**\\ 
- 
-<imgcaption web-gui-global_settings|>{{ :documentation:web-gui-global_settings.jpg?direct&500 |Golobal settings tab}}</imgcaption> 
- 
-==== Log viewer ==== 
- 
-In the system healths tab of the log page the status changes of the sensors, fan and boards can be seen. 
- 
-<imgcaption web-gui-log-standart_view|>{{ :documentation:web-gui-log-standart_view.jpg?direct&500 |System health log}}</imgcaption> 
- 
-In the java tab of the log page all messages regarding the software can be found. 
- 
-<imgcaption web-gui-log-java_messages-view|>{{ :documentation:web-gui-log-java_messages-view.jpg?direct&500 |Java messages}}</imgcaption> 
- 
-Several filters can be set for both tabs at the top.\\ 
-At the bottom the whole log can be downloaded as a ZIP file containing the individual logfiles. 
- 
-===== REST API ===== 
- 
-==== Access ==== 
- 
-The RECS<sup>(r)</sup>%%|%%Box Management API is accessible via the IP-Address or the hostname of the TOR-Master of the cluster. The basic URL of the API has the format ''https://TOR-Master/REST/'' or ''http://TOR-Master/REST/''. 
- 
-Accessing the REST API requires HTTP Basic authentication. The authenticated user has to be in the "Admin" or "User" group to be able to execute the POST/PUT management calls. 
- 
-==== Components ==== 
- 
-The RECS<sup>(r)</sup>%%|%%Box Management API makes all hardware components in the cluster available as XML trees in software. The following components are supported by the API: \\ 
- 
-^ Attribute ^ Description ^ 
-|''node'' | A single node| 
-|''backplane'' |A backplane can be equipped with zero or more baseboards| 
-|''baseboard'' |A baseboard can be equipped with zero or more nodes| 
-|''rcu'' |A RECS<sup>(r)</sup>%%|%%Box Computing Unit (RCU) can be equipped with zero or more baseboards| 
-|''rack'' |A rack consists of several RCU****s| 
- 
-Many resources also return lists of components. These are named according to the scheme <component name>List (e.g. nodeList, rcuList) and contain the elements of the list. 
- 
-Example of a backplaneList: 
- 
-<code xml><backplaneList> 
-<backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0"> 
-<temperatures>24.0</temperatures> 
-<temperatures>25.0</temperatures> 
-<temperatures>26.0</temperatures> 
-<temperatures>27.0</temperatures> 
-<temperatures>28.0</temperatures> 
-</backplane> 
-</backplaneList></code> 
- 
-=== Node === 
- 
-Example XML: 
- 
-<code xml><node baseBoardPosition="0" maxPowerUsage="44" actualNodePowerUsage="32.426884399865166"  
-actualPEGPowerUsage="15.12053962324833" actualPowerUsage="47.54742402311349" architecture="x86"  
-baseBoardId="RCU_84055620466592_BB_1" health="OK" id="RCU_84055620466592_BB_1_0" inletTemperature="20.0"  
-lastSensorUpdate="1465470151268" macAddressCompute="70:b3:d5:56:40:48" outletTemperature="20.0" state="1"  
-highestTemperature="20.0" voltage="12.072700851453936"/></code> 
- 
-The following table shows the possible attributes (some are optional) and their meaning: \\ 
- 
-^ Attribute ^ Description ^ Unit ^ Data type ^ 
-|''id''|Unique ID for referencing the component|-|String| 
-|''actualPowerUsage'' |Actual power consumption of a node (Node + PEG)|W|Double| 
-|''actualNodePowerUsage'' |Actual power consumption of a node (Node only)|W|Double| 
-|''actualPEGPowerUsage'' |Actual power consumption of a PEG card|W|Double| 
-|''maxPowerUsage'' |Maximum power the node can draw|W|Integer| 
-|''baseBoardId'' |ID of the baseboard which hosts the node|-|String| 
-|''baseBoardPosition'' |Position of the node on the baseboard|-|Integer| 
-|''state'' |Power state of the node (0=Off, 1=On, 2=Soft-off, 3=Standby, 4=Hibernate)|-|Integer| 
-|''architecture'' |Architecture (x86, arm, UNKNOWN)|-|String| 
-|''health'' |Health status of the node (OK, Warning, Critical)|-|String| 
-|''inletTemperature'' |Temperature of the inlet air|°C|Double| 
-|''outletTemperature'' |Temperature of the outlet air|°C|Double| 
-|''highestTemperature'' |Highest temperature measured on the node's baseboard|°C|Double| 
-|''voltage'' |Supply voltage of the baseboard|V|Double| 
-|''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| 
-|''macAddressCompute'' |MAC address of the NIC connected to the compute network (optional)|-|String| 
-|''macAddressMgmt'' |MAC address of the NIC connected to the management network (optional)|-|String| 
- 
-In accordance to the component node the API offers nodeList which returns multiple instances of node. 
- 
-=== Backplane === 
- 
-Example XML: 
- 
-<code xml><backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0"> 
-<temperatures>24.0</temperatures> 
-<temperatures>25.0</temperatures> 
-<temperatures>26.0</temperatures> 
-<temperatures>27.0</temperatures> 
-<temperatures>28.0</temperatures> 
-</backplane></code> 
- 
-The attributes have the following meaning: \\ 
- 
-^ Attribute ^ Description ^ Unit ^ Data type ^ 
-|''id'' |Unique ID for referencing the component|-|String| 
-|''position'' |Position of the backplane in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer| 
-|''infrastructurePower'' |Power usage of the infrastructure components on the backplane|W|Double| 
-|''temperatures'' |List of temperatures measured on the backplane|°C|Double| 
- 
-In accordance to the component backplane the API offers backplaneList which returns multiple instances of backplane. 
- 
-=== Baseboard === 
- 
-Example XML: 
- 
-<code xml><baseBoard rcuPosition="6" baseboardType="APLS" id="RCU_84055620466592_BB_6" infrastructurePower="9.8" rcuId="RCU_84055620466592"> 
-<nodeId>RCU_84055620466592_BB_6_1</nodeId> 
-<nodeId>RCU_84055620466592_BB_6_2</nodeId> 
-<nodeId>RCU_84055620466592_BB_6_3</nodeId> 
-<temperatures>20.0</temperatures> 
-<temperatures>20.0</temperatures> 
-<temperatures>20.0</temperatures> 
-<temperatures>20.0</temperatures> 
-<temperatures>20.0</temperatures> 
-</baseBoard></code> 
- 
-The attributes have the following meaning: \\ 
- 
-^ Attribute ^ Description ^ Unit ^ Data type ^ 
-|''id'' |Unique ID for referencing the component|-|String| 
-|''rcuId'' |Unique ID of the RECS<sup>(r)</sup>%%|%%Box Computing Unit hosting the baseboard|-|String| 
-|''rcuPosition'' |Position of the baseboard inside the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer| 
-|''infrastructurePower'' |Power usage of the infrastructure components on the baseboard|W|Double| 
-|''baseboardType'' |Type of the baseboard (CXP, APLS)|-|String| 
-|''nodeId'' |List of ID****s of the nodes installed on the baseboard|-|String| 
-|''temperatures'' |List of temperatures measured on the backplane|°C|Double| 
- 
-In accordance to the component baseboard the API offers baseboardList which returns multiple instances of baseboard. 
- 
-=== RCU === 
- 
-Example XML: 
- 
-<code xml><rcu rcuType="ANTARES" fanSpeed="60" rackId="RCK_1" name="RECSMaster (RCU) on 192.168.56.195" rackPosition="0" id="RCU_84055620466592"> 
-<backplaneId>RCU_84055620466592_BP_1</backplaneId> 
-<baseBoardId>RCU_84055620466592_BB_1</baseBoardId> 
-<baseBoardId>RCU_84055620466592_BB_2</baseBoardId> 
-<baseBoardId>RCU_84055620466592_BB_3</baseBoardId> 
-<baseBoardId>RCU_84055620466592_BB_4</baseBoardId> 
-<baseBoardId>RCU_84055620466592_BB_5</baseBoardId> 
-<baseBoardId>RCU_84055620466592_BB_6</baseBoardId> 
-</rcu></code> 
- 
-The attributes have the following meaning: \\ 
- 
-^ Attribute ^ Description ^ Unit ^ Data type ^ 
-|''id'' |Unique ID for referencing the component|-|String| 
-|''rackId'' |ID of the rack which hosts the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| 
-|''rackPosition'' |Position of the RECS<sup>(r)</sup>%%|%%Box Computing Unit in the rack|-|Integer| 
-|''name'' |Name of the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| 
-|''ip'' |IP address of the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| 
-|''rcuType'' |Type of the RECS<sup>(r)</sup>%%|%%Box Computing Unit (SIRIUS, ARNEB, ANTARES)|-|String| 
-|''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String| 
-|''fanSpeed'' |Current speed setting of the fans in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer| 
-|''backplaneId'' |List of ID****s of backplanes which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| 
-|''baseBoardId'' |List of ID****s of baseboards which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| 
- 
-In accordance to the component rcu the API offers rcuList which returns multiple instances of rcu. 
- 
-=== Rack === 
- 
-Example XML: 
- 
-<code xml><rack description="Default rack" id="RCK_1"> 
-<rcuId>RCU_84055620466592</rcuId> 
-</rack></code> 
- 
-The attributes have the following meaning: \\ 
- 
-^ Attribute ^ Description ^ Unit ^ Data type ^ 
-|''id'' |Unique ID for referencing the component|-|String| 
-|''description ''|Description of the rack|-|String| 
-|''rcuId ''|List of ID****s of RECS<sup>(r)</sup>%%|%%Box Computing Units which are installed in the rack|-|String| 
- 
-In accordance to the component rack the API offers rackList which returns multiple instances of rack. 
- 
-==== Resources ==== 
- 
-The resources are split into monitoring resources (for pure information gathering) and management resources (for changing the system configuration or state). 
- 
-=== Monitoring === 
- 
-For monitoring the following resources are available: \\ 
- 
-^ Attribute ^ Description ^ HTTP Method ^ 
-|''/node'' |Returns a nodeList with all nodes of the cluster|GET| 
-|''/node/{node_id}'' |Returns information about the node with the given ID|GET| 
-|''/baseboard'' |Returns a baseboardList with all baseboards of the cluster|GET| 
-|''/baseboard/{baseboard_id}'' |Returns information about the baseboard with the given ID|GET| 
-|''/baseboard/{baseboard_id}/node'' |Returns a nodeList with all nodes that are installed on the baseboard with the given ID|GET| 
-|''/backplane'' |Returns a backplaneList with all backplanes of the cluster|GET| 
-|''/backplane/{backplane_id}'' |Returns information about the backplane with the given ID|GET| 
-|''/rcu'' |Returns an rcuList with all RECS<sup>(r)</sup>%%|%%Box Computing Units of the cluster|GET| 
-|''/rcu/{rcu_id}'' |Returns information about the RECS<sup>(r)</sup>%%|%%Box Computing Unit with the given ID|GET| 
-|''/rcu/{rcu_id}/baseboard'' |Returns a baseboardList with all baseboards that are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit with the given ID|GET| 
-|''/rcu/{rcu_id}/backplane'' |Returns a backplaneList with all backplanes that are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit with the given ID|GET| 
-|''/rcu/{rcu_id}/node'' |Returns a nodeList with all nodes that are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit with the given ID|GET| 
-|''/rack'' |Returns a rackList with all racks of the cluster|GET| 
-|''/rack/{rack_id}'' |Returns information about the rack with the given ID|GET| 
-|''/rack/{rack_id}/rcu'' |Returns a rcuList with all RECS<sup>(r)</sup>%%|%%Box Computing Units that are installed in the rack with the given ID|GET| 
- 
-=== Management === 
- 
-The management of individual components can be found under the "manage" path of the component. \\ 
- 
-^ Attribute ^ Description ^ HTTP method ^ Parameter ^ 
-|''/node/{node_id}/manage/power_on'' |Turns on the node with the given ID and returns updated node XML|POST| | 
-|''/node/{node_id}/manage/power_off'' |Turns off the node with the given ID and returns updated node XML|POST| | 
-|''/node/{node_id}/manage/reset'' |Resets the node with the given ID and returns updated node XML|POST| | 
-|''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RECS<sup>(r)</sup>%%|%%Box Computing Unit containing the node to the node with the given ID and returns updated node XML|PUT| | 
-|''/rcu/{rcu_id}/manage/set_fans'' |Sets the overall fan speed of the RCU with the given ID and returns the curent status of the RCU|PUT|percent={value}| 
- 
-=== Errors === 
- 
-Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes. 
- 
-===== Nagios API ===== 
- 
-The software integration work to monitor the RECS<sup>(r)</sup>%%|%%Box System is quite simple as the the TOR-Master provides monitoring information in the Nagios native NRPE format. So only the NRPE plugin has to be installed and configured as follows. Here, a sample output can be found: 
- 
-<code bash>$ /usr/lib/nagios/plugins/check_nrpe -H 10.11.12.244 \ 
-            -c check_temp -a 10.11.12.244 10 2 70:104 105: 
- 
-OK - Temperature: 29 C|temp=29,000000;70:104;105;70,000000;105,000000 
-</code> 
- 
-The options are used as following: \\ 
- 
-^ Option ^ Description ^ 
-| ''-H'' | Host to ask for data, this is always the IP of the TOR-Master (example: 10.11.12.244) | 
-| ''-c'' | Plugin to run. Can be ''check_temp'' or ''check_power'' | 
-| ''-a'' | Arguments to pass to the plugin, see more details in tables below | 
- 
-Arguments for ''check_temp'' plugin: \\ 
- 
-^ Argument example ^ Description ^ 
-| ''10.11.12.244'' | Get sensor values from device with this IP (RCU/RPU) | 
-| ''10'' | Get sensor values from this baseboard (''1'' - ''18'') | 
-| ''2'' | Get values from this sensor (''max'', ''inlet'', ''outlet'', ''0'', ''1'', ''2'', ''3'', ''4'') | 
-| ''70:104'' | Range of warning threshold | 
-| ''105'' | Range critical threshold (ending with '':'' means open end) | 
- 
-Arguments for the ''check_power'' plugin: \\ 
- 
-^ Argument example ^ Description ^ 
-| ''10.11.12.244'' | Get sensor values from device with this IP (RCU/RPU) | 
-| ''10'' | Get sensor values from this baseboard (''1'' - ''18'') |     
-| ''2'' | Get sensor values from this node (''1'', ''2'', ''3'', ''4'') | 
-| ''80:109'' | Range of warning threshold [Watt] | 
-| ''110:'' | Range of critical threshold [Watt] (ending with '':'' means open end|