meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
doc_recs4:software_interface [2021/06/24 13:44] vordoc_recs4:software_interface [2023/10/13 10:06] bil
Line 1: Line 1:
 ====== Software interface ====== ====== Software interface ======
  
-There are several software interfaces available to monitor the status of the RECS<sup>(r)</sup>%%|%%Box system. These are the Management Web****GUI and a REST API providing XML based monitoring and management functionality. The Nagios NRPE interface was removed in RECS<sup>(r)</sup>%%|%%Box gen 4 systems.+There are several software interfaces available to monitor the status of the RECS<sup>(r)</sup>%%|%%Box system. These are the Management Web****GUI, a Redfish API and a proprietary REST API providing XML based monitoring and management functionality.
  
 ===== Management WebGUI ===== ===== Management WebGUI =====
Line 13: Line 13:
 |{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.| |{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.|
  
-Figure 1 shows the first call of the Management Web****GUI. It is organized into three columns. The first is on the left-hand side and contains the following:+Figure 1 shows the first call of the Management Web****GUI. The menu on the left side contains the following:
  
 [[documentation:software_interface#Overview|Overview:]] General overview of all managed RCU<sup></sup>s, RPU<sup></sup>s, installed nodes and health status\\ [[documentation:software_interface#Overview|Overview:]] General overview of all managed RCU<sup></sup>s, RPU<sup></sup>s, installed nodes and health status\\
Line 118: Line 118:
 Example XML: Example XML:
  
-<code xml><node baseBoardPosition="0" maxPowerUsage="44" actualNodePowerUsage="32.426884399865166" +<code xml><node baseboardPosition="0" maxPowerUsage="44" actualNodePowerUsage="32.426884399865166" 
 actualPEGPowerUsage="15.12053962324833" actualPowerUsage="47.54742402311349" architecture="x86"  actualPEGPowerUsage="15.12053962324833" actualPowerUsage="47.54742402311349" architecture="x86" 
-baseBoardId="RCU_84055620466592_BB_1" health="OK" id="RCU_84055620466592_BB_1_0" inletTemperature="20.0" +baseboardId="RCU_84055620466592_BB_1" health="OK" id="RCU_84055620466592_BB_1_0" inletTemperature="20.0" 
 lastSensorUpdate="1465470151268" macAddressCompute="70:b3:d5:56:40:48" outletTemperature="20.0" state="1"  lastSensorUpdate="1465470151268" macAddressCompute="70:b3:d5:56:40:48" outletTemperature="20.0" state="1" 
 highestTemperature="20.0" voltage="12.072700851453936"/></code> highestTemperature="20.0" voltage="12.072700851453936"/></code>
Line 132: Line 132:
 |''actualPEGPowerUsage'' |Actual power consumption of a PEG card|W|Double| |''actualPEGPowerUsage'' |Actual power consumption of a PEG card|W|Double|
 |''maxPowerUsage'' |Maximum power the node can draw|W|Integer| |''maxPowerUsage'' |Maximum power the node can draw|W|Integer|
-|''baseBoardId'' |ID of the baseboard which hosts the node|-|String| +|''baseboardId'' |ID of the baseboard which hosts the node|-|String| 
-|''baseBoardPosition'' |Position of the node on the baseboard|-|Integer|+|''baseboardPosition'' |Position of the node on the baseboard|-|Integer|
 |''state'' |Power state of the node (0=Off, 1=On, 2=Soft-off, 3=Standby, 4=Hibernate)|-|Integer| |''state'' |Power state of the node (0=Off, 1=On, 2=Soft-off, 3=Standby, 4=Hibernate)|-|Integer|
 |''architecture'' |Architecture (x86, arm, UNKNOWN)|-|String| |''architecture'' |Architecture (x86, arm, UNKNOWN)|-|String|
Line 175: Line 175:
 Example XML: Example XML:
  
-<code xml><baseBoard rcuPosition="6" baseboardType="APLS" id="RCU_84055620466592_BB_6" infrastructurePower="9.8" +<code xml><baseboard rcuPosition="6" baseboardType="APLS" id="RCU_84055620466592_BB_6" infrastructurePower="9.8" 
 lastSensorUpdate="1465470151268" rcuId="RCU_84055620466592"> lastSensorUpdate="1465470151268" rcuId="RCU_84055620466592">
 <nodeId>RCU_84055620466592_BB_6_1</nodeId> <nodeId>RCU_84055620466592_BB_6_1</nodeId>
Line 185: Line 185:
 <temperatures>20.0</temperatures> <temperatures>20.0</temperatures>
 <temperatures>20.0</temperatures> <temperatures>20.0</temperatures>
-</baseBoard></code>+</baseboard></code>
  
 The attributes have the following meaning: \\ The attributes have the following meaning: \\
Line 205: Line 205:
 Example XML: Example XML:
  
-<code xml><rcu rcuType="ANTARES" fanSpeed="60" rackId="RCK_1" name="RECSMaster (RCU) on 192.168.56.195" rackPosition="0" id="RCU_84055620466592" lastSensorUpdate="1465470151268">+<code xml><rcu rcuType="ANTARES" fanSpeed="60" fanProfile="adjust_by_temperature" rackId="RCK_1" name="RECSMaster (RCU) on 192.168.56.195" rackPosition="0" id="RCU_84055620466592" lastSensorUpdate="1465470151268">
 <backplaneId>RCU_84055620466592_BP_1</backplaneId> <backplaneId>RCU_84055620466592_BP_1</backplaneId>
-<baseBoardId>RCU_84055620466592_BB_1</baseBoardId+<baseboardId>RCU_84055620466592_BB_1</baseboardId
-<baseBoardId>RCU_84055620466592_BB_2</baseBoardId+<baseboardId>RCU_84055620466592_BB_2</baseboardId
-<baseBoardId>RCU_84055620466592_BB_3</baseBoardId+<baseboardId>RCU_84055620466592_BB_3</baseboardId
-<baseBoardId>RCU_84055620466592_BB_4</baseBoardId+<baseboardId>RCU_84055620466592_BB_4</baseboardId
-<baseBoardId>RCU_84055620466592_BB_5</baseBoardId+<baseboardId>RCU_84055620466592_BB_5</baseboardId
-<baseBoardId>RCU_84055620466592_BB_6</baseBoardId>+<baseboardId>RCU_84055620466592_BB_6</baseboardId>
 </rcu></code> </rcu></code>
  
Line 226: Line 226:
 |''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String| |''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String|
 |''fanSpeed'' |Current speed setting of the fans in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer| |''fanSpeed'' |Current speed setting of the fans in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer|
 +|''fanProfile'' |Current fan profileof the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer|
 |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long|
 |''backplaneId'' |List of ID****s of backplanes which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| |''backplaneId'' |List of ID****s of backplanes which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|
-|''baseBoardId'' |List of ID****s of baseboards which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|+|''baseboardId'' |List of ID****s of baseboards which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|
  
 In accordance to the component rcu the API offers rcuList which returns multiple instances of rcu. In accordance to the component rcu the API offers rcuList which returns multiple instances of rcu.
Line 284: Line 285:
 |''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RECS<sup>(r)</sup>%%|%%Box Computing Unit containing the node to the node with the given ID and returns updated node XML|PUT| | |''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RECS<sup>(r)</sup>%%|%%Box Computing Unit containing the node to the node with the given ID and returns updated node XML|PUT| |
 |''/rcu/{rcu_id}/manage/set_fans'' |Sets the overall fan speed of the RCU with the given ID and returns the curent status of the RCU|PUT|percent={value}| |''/rcu/{rcu_id}/manage/set_fans'' |Sets the overall fan speed of the RCU with the given ID and returns the curent status of the RCU|PUT|percent={value}|
 +|''/rcu/{rcu_id}/manage/set_fan_profile'' |Sets the fan profile of the RCU with the given ID and returns the curent status of the RCU (Possible values: manual, increase_by_temperature, adjust_by_temperature)|PUT|percent={value}|
  
 === Errors === === Errors ===
Line 289: Line 291:
 Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes. Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes.
  
-===== Prometheus Exporter =====+===== Prometheus =====
  
 A prometheus exporter is built-in and can be enabled. It is accessable at ''https://TOR-Master/metrics/'' or ''http://TOR-Master/metrics/'' and needs a http basic authentication.  A prometheus exporter is built-in and can be enabled. It is accessable at ''https://TOR-Master/metrics/'' or ''http://TOR-Master/metrics/'' and needs a http basic authentication. 
 +
 +The big advantage of the Prometheus exporter compared to other APIs is that it dynamically exports its own metrics and thus, additional metrics can be added or removed during runtime after changing or hotplugging hardware. This allows to export only metrics of those microservers that are plugged in. As the RECS<sup>(r)</sup>%%|%%Box has a modular approach and every RECS<sup>(r)</sup>%%|%%Box can be equipped with different carrier blades and microserver configurations, this approach is of high relevance. Using traditional monitoring tools that don’t support the export of dynamic metrics needs regular manual changes of the configuration files which is annoying. 
  
 ==== Prometheus Configuration ==== ==== Prometheus Configuration ====
  
-A typical prometheus could look like this:+Prometheus needs very little configuration to automatically parse all information and write it into a database. This makes all metrics easily accessible. 
  
 <code> <code>
Line 304: Line 308:
      - targets: ['192.168.0.100']      - targets: ['192.168.0.100']
     basic_auth:     basic_auth:
-      username: 'admin+      username: 'user
-      password: 'admin'+      password: 'password'
 </code> </code>
  
 ==== Grafana Dashboard ==== ==== Grafana Dashboard ====
  
-An example grafana dashboard is published on https://grafana.com/grafana/dashboards/14622 and can be integrated in Grafana using the "Import" function in grafana.+It is recommended to use Grafana as a graphical dashboard to read out these captured metrics. A pre-build Grafana dashboard is publicly available at https://grafana.com/grafana/dashboards/14622. It can be integrated in Grafana using the "Import" function. It automatically reads the available metrics from the database and dynamically adapts to the number of available microservers, see the following picture: 
 + 
 +<imgcaption web-gui-overview|> 
 +{{ :documentation:grafana.png?direct |Grafana Dashboard}}</imgcaption>