meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revisionBoth sides next revision
doc_recs4:software_interface [2020/12/16 11:19] – created vordoc_recs4:software_interface [2023/10/13 10:06] bil
Line 1: Line 1:
 ====== Software interface ====== ====== Software interface ======
  
-There are several software interfaces available to monitor the status of the RECS<sup>(r)</sup>%%|%%Box system. These are the Management Web****GUI and a REST API providing XML based monitoring and management functionality. The Nagios NRPE interface was removed in RECS<sup>(r)</sup>%%|%%Box gen 4 systems.+There are several software interfaces available to monitor the status of the RECS<sup>(r)</sup>%%|%%Box system. These are the Management Web****GUI, a Redfish API and a proprietary REST API providing XML based monitoring and management functionality.
  
 ===== Management WebGUI ===== ===== Management WebGUI =====
Line 13: Line 13:
 |{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.| |{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.|
  
-Figure 1 shows the first call of the Management Web****GUI. It is organized into three columns. The first is on the left-hand side and contains the following:+Figure 1 shows the first call of the Management Web****GUI. The menu on the left side contains the following:
  
 [[documentation:software_interface#Overview|Overview:]] General overview of all managed RCU<sup></sup>s, RPU<sup></sup>s, installed nodes and health status\\ [[documentation:software_interface#Overview|Overview:]] General overview of all managed RCU<sup></sup>s, RPU<sup></sup>s, installed nodes and health status\\
Line 104: Line 104:
  
 <code xml><backplaneList> <code xml><backplaneList>
-<backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0">+<backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0"  
 +lastSensorUpdate="1465470151268">
 <temperatures>24.0</temperatures> <temperatures>24.0</temperatures>
 <temperatures>25.0</temperatures> <temperatures>25.0</temperatures>
Line 117: Line 118:
 Example XML: Example XML:
  
-<code xml><node baseBoardPosition="0" maxPowerUsage="44" actualNodePowerUsage="32.426884399865166" +<code xml><node baseboardPosition="0" maxPowerUsage="44" actualNodePowerUsage="32.426884399865166" 
 actualPEGPowerUsage="15.12053962324833" actualPowerUsage="47.54742402311349" architecture="x86"  actualPEGPowerUsage="15.12053962324833" actualPowerUsage="47.54742402311349" architecture="x86" 
-baseBoardId="RCU_84055620466592_BB_1" health="OK" id="RCU_84055620466592_BB_1_0" inletTemperature="20.0" +baseboardId="RCU_84055620466592_BB_1" health="OK" id="RCU_84055620466592_BB_1_0" inletTemperature="20.0" 
 lastSensorUpdate="1465470151268" macAddressCompute="70:b3:d5:56:40:48" outletTemperature="20.0" state="1"  lastSensorUpdate="1465470151268" macAddressCompute="70:b3:d5:56:40:48" outletTemperature="20.0" state="1" 
 highestTemperature="20.0" voltage="12.072700851453936"/></code> highestTemperature="20.0" voltage="12.072700851453936"/></code>
Line 131: Line 132:
 |''actualPEGPowerUsage'' |Actual power consumption of a PEG card|W|Double| |''actualPEGPowerUsage'' |Actual power consumption of a PEG card|W|Double|
 |''maxPowerUsage'' |Maximum power the node can draw|W|Integer| |''maxPowerUsage'' |Maximum power the node can draw|W|Integer|
-|''baseBoardId'' |ID of the baseboard which hosts the node|-|String| +|''baseboardId'' |ID of the baseboard which hosts the node|-|String| 
-|''baseBoardPosition'' |Position of the node on the baseboard|-|Integer|+|''baseboardPosition'' |Position of the node on the baseboard|-|Integer|
 |''state'' |Power state of the node (0=Off, 1=On, 2=Soft-off, 3=Standby, 4=Hibernate)|-|Integer| |''state'' |Power state of the node (0=Off, 1=On, 2=Soft-off, 3=Standby, 4=Hibernate)|-|Integer|
 |''architecture'' |Architecture (x86, arm, UNKNOWN)|-|String| |''architecture'' |Architecture (x86, arm, UNKNOWN)|-|String|
Line 150: Line 151:
 Example XML: Example XML:
  
-<code xml><backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0">+<code xml><backplane position="1" id="RCU_84055620466592_BP_1" infrastructurePower="0.0"  
 +lastSensorUpdate="1465470151268">
 <temperatures>24.0</temperatures> <temperatures>24.0</temperatures>
 <temperatures>25.0</temperatures> <temperatures>25.0</temperatures>
Line 164: Line 166:
 |''position'' |Position of the backplane in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer| |''position'' |Position of the backplane in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer|
 |''infrastructurePower'' |Power usage of the infrastructure components on the backplane|W|Double| |''infrastructurePower'' |Power usage of the infrastructure components on the backplane|W|Double|
 +|''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long|
 |''temperatures'' |List of temperatures measured on the backplane|°C|Double| |''temperatures'' |List of temperatures measured on the backplane|°C|Double|
  
Line 172: Line 175:
 Example XML: Example XML:
  
-<code xml><baseBoard rcuPosition="6" baseboardType="APLS" id="RCU_84055620466592_BB_6" infrastructurePower="9.8" rcuId="RCU_84055620466592">+<code xml><baseboard rcuPosition="6" baseboardType="APLS" id="RCU_84055620466592_BB_6" infrastructurePower="9.8"  
 +lastSensorUpdate="1465470151268" rcuId="RCU_84055620466592">
 <nodeId>RCU_84055620466592_BB_6_1</nodeId> <nodeId>RCU_84055620466592_BB_6_1</nodeId>
 <nodeId>RCU_84055620466592_BB_6_2</nodeId> <nodeId>RCU_84055620466592_BB_6_2</nodeId>
Line 181: Line 185:
 <temperatures>20.0</temperatures> <temperatures>20.0</temperatures>
 <temperatures>20.0</temperatures> <temperatures>20.0</temperatures>
-</baseBoard></code>+</baseboard></code>
  
 The attributes have the following meaning: \\ The attributes have the following meaning: \\
Line 190: Line 194:
 |''rcuPosition'' |Position of the baseboard inside the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer| |''rcuPosition'' |Position of the baseboard inside the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|Integer|
 |''infrastructurePower'' |Power usage of the infrastructure components on the baseboard|W|Double| |''infrastructurePower'' |Power usage of the infrastructure components on the baseboard|W|Double|
 +|''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long|
 |''baseboardType'' |Type of the baseboard (CXP, APLS)|-|String| |''baseboardType'' |Type of the baseboard (CXP, APLS)|-|String|
 |''nodeId'' |List of ID****s of the nodes installed on the baseboard|-|String| |''nodeId'' |List of ID****s of the nodes installed on the baseboard|-|String|
Line 200: Line 205:
 Example XML: Example XML:
  
-<code xml><rcu rcuType="ANTARES" fanSpeed="60" rackId="RCK_1" name="RECSMaster (RCU) on 192.168.56.195" rackPosition="0" id="RCU_84055620466592">+<code xml><rcu rcuType="ANTARES" fanSpeed="60" fanProfile="adjust_by_temperature" rackId="RCK_1" name="RECSMaster (RCU) on 192.168.56.195" rackPosition="0" id="RCU_84055620466592" lastSensorUpdate="1465470151268">
 <backplaneId>RCU_84055620466592_BP_1</backplaneId> <backplaneId>RCU_84055620466592_BP_1</backplaneId>
-<baseBoardId>RCU_84055620466592_BB_1</baseBoardId+<baseboardId>RCU_84055620466592_BB_1</baseboardId
-<baseBoardId>RCU_84055620466592_BB_2</baseBoardId+<baseboardId>RCU_84055620466592_BB_2</baseboardId
-<baseBoardId>RCU_84055620466592_BB_3</baseBoardId+<baseboardId>RCU_84055620466592_BB_3</baseboardId
-<baseBoardId>RCU_84055620466592_BB_4</baseBoardId+<baseboardId>RCU_84055620466592_BB_4</baseboardId
-<baseBoardId>RCU_84055620466592_BB_5</baseBoardId+<baseboardId>RCU_84055620466592_BB_5</baseboardId
-<baseBoardId>RCU_84055620466592_BB_6</baseBoardId>+<baseboardId>RCU_84055620466592_BB_6</baseboardId>
 </rcu></code> </rcu></code>
  
Line 221: Line 226:
 |''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String| |''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String|
 |''fanSpeed'' |Current speed setting of the fans in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer| |''fanSpeed'' |Current speed setting of the fans in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer|
 +|''fanProfile'' |Current fan profileof the RECS<sup>(r)</sup>%%|%%Box Computing Unit|%|Integer|
 +|''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long|
 |''backplaneId'' |List of ID****s of backplanes which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String| |''backplaneId'' |List of ID****s of backplanes which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|
-|''baseBoardId'' |List of ID****s of baseboards which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|+|''baseboardId'' |List of ID****s of baseboards which are installed in the RECS<sup>(r)</sup>%%|%%Box Computing Unit|-|String|
  
 In accordance to the component rcu the API offers rcuList which returns multiple instances of rcu. In accordance to the component rcu the API offers rcuList which returns multiple instances of rcu.
Line 278: Line 285:
 |''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RECS<sup>(r)</sup>%%|%%Box Computing Unit containing the node to the node with the given ID and returns updated node XML|PUT| | |''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RECS<sup>(r)</sup>%%|%%Box Computing Unit containing the node to the node with the given ID and returns updated node XML|PUT| |
 |''/rcu/{rcu_id}/manage/set_fans'' |Sets the overall fan speed of the RCU with the given ID and returns the curent status of the RCU|PUT|percent={value}| |''/rcu/{rcu_id}/manage/set_fans'' |Sets the overall fan speed of the RCU with the given ID and returns the curent status of the RCU|PUT|percent={value}|
 +|''/rcu/{rcu_id}/manage/set_fan_profile'' |Sets the fan profile of the RCU with the given ID and returns the curent status of the RCU (Possible values: manual, increase_by_temperature, adjust_by_temperature)|PUT|percent={value}|
  
 === Errors === === Errors ===
  
 Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes. Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes.
 +
 +===== Prometheus =====
 +
 +A prometheus exporter is built-in and can be enabled. It is accessable at ''https://TOR-Master/metrics/'' or ''http://TOR-Master/metrics/'' and needs a http basic authentication. 
 +
 +The big advantage of the Prometheus exporter compared to other APIs is that it dynamically exports its own metrics and thus, additional metrics can be added or removed during runtime after changing or hotplugging hardware. This allows to export only metrics of those microservers that are plugged in. As the RECS<sup>(r)</sup>%%|%%Box has a modular approach and every RECS<sup>(r)</sup>%%|%%Box can be equipped with different carrier blades and microserver configurations, this approach is of high relevance. Using traditional monitoring tools that don’t support the export of dynamic metrics needs regular manual changes of the configuration files which is annoying. 
 +
 +==== Prometheus Configuration ====
 +
 +Prometheus needs very little configuration to automatically parse all information and write it into a database. This makes all metrics easily accessible. 
 +
 +<code>
 +  - job_name: 'RECS_Master'
 +    scrape_interval: 1s
 +    scrape_timeout: 1s
 +    static_configs:
 +     - targets: ['192.168.0.100']
 +    basic_auth:
 +      username: 'user'
 +      password: 'password'
 +</code>
 +
 +==== Grafana Dashboard ====
 +
 +It is recommended to use Grafana as a graphical dashboard to read out these captured metrics. A pre-build Grafana dashboard is publicly available at https://grafana.com/grafana/dashboards/14622. It can be integrated in Grafana using the "Import" function. It automatically reads the available metrics from the database and dynamically adapts to the number of available microservers, see the following picture:
 +
 +<imgcaption web-gui-overview|>
 +{{ :documentation:grafana.png?direct |Grafana Dashboard}}</imgcaption>
 +