====== Software interface ====== There are several software interfaces available to monitor the status of the RECS(r)%%|%%Box system. These are the Management Web****GUI, a Redfish API and a proprietary REST API providing XML based monitoring and management functionality. ===== Management WebGUI ===== The Management Web****GUI is established on every RECS(r)%%|%%Box unit. Accessible by any known browser on the assigned IP address and the default port 80. The following views are dependent on the device and assembly. In general these symbols have the following meaning on every page: \\ |{{ :documentation:statusok.png?nolink |}}|Everything is OK. Also indicated by a green line in a graph.| |{{ :documentation:statuswarning.png?nolink |}} |Warnung. Something is wrong, but the system is still fully functional. The system has to be checked so the problem doesn't get worse. Indicated by a yellow line in a graph.| |{{ :documentation:statuscritical.png?nolink |}} |Critical Error. The system must be checked immediately and maybe has to be shut down to prevent hardware damage. indicated by a red line in a graph.| On the left side is a menu, which can be toggled by clicking the menu button in the upper left corner of the screen. The menu contains the following items: [[doc_recs4:software_interface#Dashboard|Dashboard]]: General overview of the managed system, installed nodes and health status\\ [[doc_recs4:software_interface#Management|Management]]: Power control and monitoring for all nodes and fans\\ [[doc_recs4:software_interface#Network|Network]]: VLAN-Configuration and of management network\\ [[doc_recs4:software_interface#Composition|Composition]]: Configuration of PCIe resources\\ [[doc_recs4:software_interface#Users|Users]]: User management\\ [[doc_recs4:software_interface#Settings|Settings]]: System-wide configuration settings\\ [[doc_recs4:software_interface#Time|Time]]: System time settings\\ [[doc_recs4:software_interface#Firmware|Firmware]]: Firmware updates and overview of software versions\\ [[doc_recs4:software_interface#Logs|Logs]]: Logs from the management software about system health and java messages.\\ ==== Dashboard ==== The Dashboard is seen first when opening the Web****GUI and displays the summarized system health status. {{:doc_recs4:deneb_dashboard.png?direct|Dashboard View}} ==== Management ==== In this view, nodes can be turned on or off with a quick menu, which opens when clicking on the gear symbol of a node.\\ Multiple nodes can be controled at once via the panel "Batch-Control Nodes".\\ The view also shows fan monitoring data and allows a detailed look at the temperature map of the system's baseboard.\\ By clicking on a node label, the respective [[doc_recs4:software_interface#Node Management|Node Management]] view is opened.\\ Furthermore the view displays the summarized system health status. {{:doc_recs4:deneb_management.png?direct|Management View}} === Node Management === This view features controlling the power state of the selected node and monitoring its detailed status values and graphs.\\ It is also possible to change KVM settings or open a console to the node.\\ If the node is running and the [[documentation:recsdaemon|RECSDeamon]] is installed on it, even more detailed data is shown. {{:doc_recs4:deneb_node-management.png?direct|Node Management View}} === Network === The network view allows changing the settings of the managment port. This port is used to access the webinterface and all APIs.\\ In addition to that, VLANs of the node network can be configured and assigned to the ports of the nodes and the backpanel. {{:doc_recs4:deneb_network.png?direct|Network View}} === Composition === This view allows the configuration of the PCIe resources in the form of composed nodes.\\ A composed node is a reserved bundle of resources, which utilize PCIe functions.\\ A wizard leads through the process of creating such composed nodes. {{:doc_recs4:deneb_composition.png?direct|Composition View}} === Users === This view features the user management. Users can be created, edited or deleted.\\ Additionally, IPMI passwords can be set. {{:doc_recs4:deneb_users.png?direct|Users View}} === Settings === This view allows changing system-wide preferences (e.g. regarding the interfaces of the system). {{:doc_recs4:deneb_settings.png?direct|Settings View}} === Time === Here, the system time can be set either manually or using NTP. {{:doc_recs4:deneb_time.png?direct|Time View}} === Firmware === This view shows the currently installed versions of the firmware and management software.\\ Furthermore, it is possible to update those software components. {{:doc_recs4:deneb_firmware.png?direct|Firmware View}} === Logs === In the System Events tab of this view, the status changes of the sensors, fan and boards can be seen.\\ In the Java Messages tab , all messages regarding the software can be found.\\ Several filters can be set for both tabs at the top.\\ The whole log can be downloaded as a ZIP file containing the individual logfiles. {{:doc_recs4:deneb_logs.png?direct|Logs View}} ===== Redfish API ===== The management software also features a Redfish API.\\ The documentation can be seen at [[https://christmann.github.io/recs-redfish-api/index.html|Github]]. ===== REST API ===== ==== Access ==== The REST API is accessible via the management IP-Address or the hostname of the system. The basic URL of the API has the format ''https://host/REST/''. Accessing the REST API requires HTTP Basic authentication. The authenticated user has to be in the "Admin" or "User" group to be able to execute the POST/PUT management calls. ==== Components ==== The REST API makes all hardware components in the cluster available as XML trees in software. The following components are supported by the API: \\ ^ Attribute ^ Description ^ |''rcu'' |A RECS Computing Unit (RCU) represents the overall system| |''backplane'' |A backplane holds sensors and controls fans| |''baseboard'' |A baseboard can be equipped with zero or more nodes| |''node'' |A single node| === RCU === The main entrypoint of this API is the RECS Computing Unit (RCU). Request: curl -X GET -k -i https://host/REST/rcu Response: 26,2 32,1 23,0 28,0 47,2 36,1 26,0 32,0 23,1 28,1 47,0 36,1 32,2 44,2 26,1 36,0 74,0 52,1 38.279927272086105 62.24058727899749 74.00064867543587 RCU_10995770589198_BP_1 RCU_10995770589198_BP_2 RCU_10995770589198_BP_3 RCU_10995770589198_BB_1 RCU_10995770589198_BB_2 RCU_10995770589198_BB_3 RCU_10995770589198_BB_4 RCU_10995770589198_BB_6 RCU_10995770589198_BB_7 RCU_10995770589198_BB_8 RCU_10995770589198_BB_9 RCU_10995770589198_Fan_DENEB_1 RCU_10995770589198_Fan_DENEB_2 RCU_10995770589198_Fan_DENEB_3 RCU_10995770589198_BB_1_0 RCU_10995770589198_BB_1_2 RCU_10995770589198_BB_1_3 RCU_10995770589198_BB_1_4 RCU_10995770589198_BB_1_5 RCU_10995770589198_BB_1_6 RCU_10995770589198_BB_1_7 RCU_10995770589198_BB_1_8 RCU_10995770589198_BB_1_9 RCU_10995770589198_BB_1_10 RCU_10995770589198_BB_1_11 RCU_10995770589198_BB_1_12 RCU_10995770589198_BB_1_13 RCU_10995770589198_BB_1_14 RCU_10995770589198_BB_1_15 RCU_10995770589198_BB_2_0 RCU_10995770589198_BB_2_1 RCU_10995770589198_BB_2_2 RCU_10995770589198_BB_3_0 RCU_10995770589198_BB_3_1 RCU_10995770589198_BB_3_2 RCU_10995770589198_BB_4_0 RCU_10995770589198_BB_4_1 RCU_10995770589198_BB_4_2 RCU_10995770589198_BB_6_0 RCU_10995770589198_BB_6_1 RCU_10995770589198_BB_6_2 RCU_10995770589198_BB_7_0 RCU_10995770589198_BB_7_1 RCU_10995770589198_BB_7_2 RCU_10995770589198_BB_7_3 RCU_10995770589198_BB_7_4 RCU_10995770589198_BB_7_5 RCU_10995770589198_BB_7_6 RCU_10995770589198_BB_7_8 RCU_10995770589198_BB_7_9 RCU_10995770589198_BB_7_10 RCU_10995770589198_BB_7_11 RCU_10995770589198_BB_7_12 RCU_10995770589198_BB_7_13 RCU_10995770589198_BB_7_14 RCU_10995770589198_BB_7_15 RCU_10995770589198_BB_8_0 RCU_10995770589198_BB_8_1 RCU_10995770589198_BB_8_2 RCU_10995770589198_BB_8_3 RCU_10995770589198_BB_8_4 RCU_10995770589198_BB_8_5 RCU_10995770589198_BB_8_6 RCU_10995770589198_BB_8_7 RCU_10995770589198_BB_8_9 RCU_10995770589198_BB_8_10 RCU_10995770589198_BB_8_11 RCU_10995770589198_BB_8_12 RCU_10995770589198_BB_8_13 RCU_10995770589198_BB_8_14 RCU_10995770589198_BB_8_15 RCU_10995770589198_BB_9_0 RCU_10995770589198_BB_9_1 RCU_10995770589198_BB_9_2 RCU_10995770589198_BB_9_3 RCU_10995770589198_BB_9_4 RCU_10995770589198_BB_9_5 RCU_10995770589198_BB_9_6 RCU_10995770589198_BB_9_7 RCU_10995770589198_BB_9_8 RCU_10995770589198_BB_9_10 RCU_10995770589198_BB_9_11 RCU_10995770589198_BB_9_12 RCU_10995770589198_BB_9_13 RCU_10995770589198_BB_9_14 RCU_10995770589198_BB_9_15 2024.3027830888711 59.600615599470274 467.0563244508067 1497.6458430385942 Attributes: \\ ^ Attribute ^ Description ^ Unit ^ Data type ^ |''name'' |Name of the RCU|-|String| |''fanSpeed'' |Current speed setting of the fans in the RCU|%|Integer| |''fanProfile'' |Current fan profileof the RCU|%|Integer| |''health'' |Health status of the RCU (OK, Warning, Critical)|-|String| |''ip'' |IP address of the RCU|-|String| |''kvmNode'' |ID of the node to which the KVM system is switched (optional)|-|String| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''type'' |Type of the RCU|-|String| |''id'' |ID for referencing the component|-|String| Nested elements: \\ ^ Element ^ Description ^ Unit ^ Data type ^ |''temperature'' |List of temperature sensors|°C|Double| |''backplane'' |ID of the backplanes which are installed in the RCU|-|String| |''baseboard'' |ID of the baseboards which are installed in the RCU|-|String| |''fan'' |ID****s of fans, which are installed in the RCU|-|String| |''node'' |ID****s of nodes, which are installed in the RCU|-|String| |''power'' |List of power sensors|W|Double| === Backplane === Request: curl -X GET -k -i https://host/REST/backplane/RCU_10995770589198_BP_1 Response: RCU_10995770589198_Fan_DENEB_1 RCU_10995770589198_Fan_DENEB_2 RCU_10995770589198_Fan_DENEB_3 26,1 32,0 23,0 28,1 47,2 36,2 Attributes: \\ ^ Attribute ^ Description ^ Unit ^ Data type ^ |''rcuPosition'' |Position of the backplane inside the RCU|-|Integer| |''health'' |Health status of the backplane (OK, Warning, Critical)|-|String| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''id'' |ID for referencing the component|-|String| Nested elements: \\ ^ Element ^ Description ^ Unit ^ Data type ^ |''fan'' |ID****s of fans, which are associated to the backplane|-|String| |''temperature'' |List of temperature sensors|°C|Double| The API offers backplaneList, which returns a list of the IDs of all backplanes within the system. RCU_10995770589198_BP_1 RCU_10995770589198_BP_2 RCU_10995770589198_BP_3 === Baseboard === Request: curl -X GET -k -i https://host/REST/baseboard/RCU_10995770589198_BB_3 Response: RCU_10995770589198_BB_3_0 RCU_10995770589198_BB_3_1 RCU_10995770589198_BB_3_2 7,51 54.91125326153526 54.91125326153526 0.0 42,4 41,4 43,6 25,0 36,5 43,4 46,4 255,0 50,1 45,1 11,90 Attributes: \\ ^ Attribute ^ Description ^ Unit ^ Data type ^ |''type'' |Type of the baseboard|-|String| |''expansionBoardInserted'' |Indicates, if an expansion board is available|-|Boolean| |''rcuPosition'' |Position of the baseboard inside the RCU|-|Integer| |''health'' |Health status of the baseboard (OK, Warning, Critical)|-|String| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''id'' |ID for referencing the component|-|String| Nested elements: \\ ^ Element ^ Description ^ Unit ^ Data type ^ |''fan'' |ID****s of fans, which are associated to the baseboard|-|String| |''node'' |ID****s of nodes, which are installed on the baseboard|-|String| |''power'' |List of power sensors|W|Double |''temperature'' |List of temperature sensors|°C|Double| |''voltage'' |List of voltage sensors|V|Double| The API offers baseboardList, which returns a list of the IDs of all baseboards within the system. RCU_10995770589198_BB_1 RCU_10995770589198_BB_2 RCU_10995770589198_BB_3 RCU_10995770589198_BB_4 RCU_10995770589198_BB_6 RCU_10995770589198_BB_7 RCU_10995770589198_BB_8 RCU_10995770589198_BB_9 === Node === Request: curl -X GET -k -i https://host/REST/node/RCU_10995770589198_BB_3_0 Response: RCU_10995770589198_BB_1 20.280571457632558 20,28 20.001274723105443 23.06943277191329 11,91 Attributes: \\ ^ Attribute ^ Description ^ Unit ^ Data type ^ |''baseboardPosition'' |Position of the node on the baseboard|-|Integer| |''name'' |Name of the node|-|String| |''type'' |Type of the node|-|String| |''maxPowerUsage'' |Maximum power the node can draw|W|Integer| |''powerState'' |Power state of the node (Off, On, Soft-off, Standby, Hibernate)|-|String| |''health'' |Health status of the node (OK, Warning, Critical)|-|String| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''id''|ID for referencing the component|-|String| |''macAddressCompute'' |MAC address of the NIC connected to the compute network (optional)|-|String| |''macAddressMgmt'' |MAC address of the NIC connected to the management network (optional)|-|String| Nested elements: \\ ^ Element ^ Description ^ Unit ^ Data type ^ |''baseboard'' |ID of the baseboard hosting the node|-|String| |''deamon'' |List of deamon sensors (optional)|-|Mixed| |''power'' |List of power sensors|W|Double| |''processor'' |List of processors of this node with detailed information|-|-| |''temperature'' |List of temperature sensors|°C|Double| |''voltage'' |List of voltage sensors|V|Double| The API offers nodeList, which returns a list of the IDs of all nodes within the system. Request: curl -X GET -k -i https://host/REST/node Response: RCU_10995770589198_BB_1_0 RCU_10995770589198_BB_1_2 RCU_10995770589198_BB_1_3 RCU_10995770589198_BB_1_4 RCU_10995770589198_BB_1_5 RCU_10995770589198_BB_1_6 RCU_10995770589198_BB_1_7 RCU_10995770589198_BB_1_8 RCU_10995770589198_BB_1_9 RCU_10995770589198_BB_1_10 RCU_10995770589198_BB_1_11 RCU_10995770589198_BB_1_12 RCU_10995770589198_BB_1_13 RCU_10995770589198_BB_1_14 RCU_10995770589198_BB_1_15 RCU_10995770589198_BB_2_0 RCU_10995770589198_BB_2_1 RCU_10995770589198_BB_2_2 RCU_10995770589198_BB_3_0 RCU_10995770589198_BB_3_1 RCU_10995770589198_BB_3_2 RCU_10995770589198_BB_4_0 RCU_10995770589198_BB_4_1 RCU_10995770589198_BB_4_2 RCU_10995770589198_BB_6_0 RCU_10995770589198_BB_6_1 RCU_10995770589198_BB_6_2 RCU_10995770589198_BB_7_0 RCU_10995770589198_BB_7_1 RCU_10995770589198_BB_7_2 RCU_10995770589198_BB_7_3 RCU_10995770589198_BB_7_4 RCU_10995770589198_BB_7_5 RCU_10995770589198_BB_7_6 RCU_10995770589198_BB_7_8 RCU_10995770589198_BB_7_9 RCU_10995770589198_BB_7_10 RCU_10995770589198_BB_7_11 RCU_10995770589198_BB_7_12 RCU_10995770589198_BB_7_13 RCU_10995770589198_BB_7_14 RCU_10995770589198_BB_7_15 RCU_10995770589198_BB_8_0 RCU_10995770589198_BB_8_1 RCU_10995770589198_BB_8_2 RCU_10995770589198_BB_8_3 RCU_10995770589198_BB_8_4 RCU_10995770589198_BB_8_5 RCU_10995770589198_BB_8_6 RCU_10995770589198_BB_8_7 RCU_10995770589198_BB_8_9 RCU_10995770589198_BB_8_10 RCU_10995770589198_BB_8_11 RCU_10995770589198_BB_8_12 RCU_10995770589198_BB_8_13 RCU_10995770589198_BB_8_14 RCU_10995770589198_BB_8_15 RCU_10995770589198_BB_9_0 RCU_10995770589198_BB_9_1 RCU_10995770589198_BB_9_2 RCU_10995770589198_BB_9_3 RCU_10995770589198_BB_9_4 RCU_10995770589198_BB_9_5 RCU_10995770589198_BB_9_6 RCU_10995770589198_BB_9_7 RCU_10995770589198_BB_9_8 RCU_10995770589198_BB_9_10 RCU_10995770589198_BB_9_11 RCU_10995770589198_BB_9_12 RCU_10995770589198_BB_9_13 RCU_10995770589198_BB_9_14 RCU_10995770589198_BB_9_15 === Fan === Request: curl -X GET -k -i https://host/REST/fan/RCU_10995770589198_Fan_TRECS_1 Response: Attributes: \\ ^ Attribute ^ Description ^ Unit ^ Data type ^ |''position'' |Position of the fan|-|String| |''installed'' |Indicates, if the fan is installed|-|Boolean| |''nominalSpeed'' |Nominal speed of the fan|%|Integer| |''rpm'' |Actual rotational speed of the fan|rpm|Integer| |''health'' |Health status of the fan (OK, Warning, Critical)|-|String| |''lastSensorUpdate'' |Timestamp of the last sensor update|ms|Long| |''id''|ID for referencing the component|-|String| The API offers fanList, which returns a list of the IDs of all fans within the system. Request: curl -X GET -k -i https://host/REST/fan Response: RCU_10995770589198_Fan_DENEB_1 RCU_10995770589198_Fan_DENEB_2 RCU_10995770589198_Fan_DENEB_3 ==== Endpoints ==== The resources are split into monitoring resources (for pure information gathering) and management resources (for changing the system configuration or state). === Monitoring === For monitoring the following resources are available: \\ ^ Attribute ^ Description ^ HTTP Method ^ |''/rcu'' |Returns information about the RCU|GET| |''/backplane'' |Returns a baseboardList with all backplane IDs of the RCU|GET| |''/backplane/{backplane_id}'' |Returns information about the backplane with the given ID|GET| |''/baseboard'' |Returns a baseboardList with all baseboard IDs of the RCU|GET| |''/baseboard/{baseboard_id}'' |Returns information about the baseboard with the given ID|GET| |''/baseboard/{baseboard_id}/node'' |Returns a nodeList with all node IDs that are installed on the baseboard with the given ID|GET| |''/node'' |Returns a nodeList with all node IDs of the RCU|GET| |''/node/{node_id}'' |Returns information about the node with the given ID|GET| |''/fan'' |Returns a fanList with all fan IDs of the RCU|GET| |''/fan/{fan_id}'' |Returns information about the fan with the given ID|GET| === Management === The management of individual components can be found under the "manage" path of the component. \\ ^ Attribute ^ Description ^ HTTP method ^ Parameter ^ |''/node/{node_id}/manage/power_on'' |Turns on the node with the given ID and returns updated node|POST| | |''/node/{node_id}/manage/power_button'' |Turns on/off the node with the given ID and returns updated node|POST| | |''/node/{node_id}/manage/power_off'' |Turns off the node with the given ID and returns updated node|POST| | |''/node/{node_id}/manage/reset'' |Resets the node with the given ID and returns updated node|POST| | |''/node/{node_id}/manage/sleep'' |Sets the node with the given ID in sleep condition and returns updated node|POST| | |''/node/{node_id}/manage/select_kvm'' |Switches the KVM port of the RCU to the node with the given ID and returns updated node|PUT| | |''/node/{node_id}/manage/set_bootsource'' |Sets the boot source of the node with the given ID and returns updated node|PUT|source={NONE,HDD,CD,PXE,USBSTICK},persistent={true,false}| |''/rcu/manage/set_fans'' |Sets the overall fan speed of the RCU and returns the current status of the RCU|PUT|percent={value}| |''/rcu/manage/set_fan_profile'' |Sets the fan profile of the RCU and returns the current status of the RCU|PUT|profile={manual,auto}| |''/fan/{fan_id}'' |Sets the speed of the fan with the given ID and returns the current status of the fan|PUT|percent={value}| === Errors === Information about the success or failure of management requests are returned via HTTP status codes. Please have a look at [[http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html|RFC2616]] for an overview about the defined HTTP status codes. ===== Prometheus ===== A prometheus exporter is built-in and can be enabled. It is accessable at ''https://host/metrics/'' or ''http://host/metrics/'' and needs a http basic authentication. The big advantage of the Prometheus exporter compared to other APIs is that it dynamically exports its own metrics and thus, additional metrics can be added or removed during runtime after changing or hotplugging hardware. This allows to export only metrics of those microservers that are plugged in. As the RECS has a modular approach and every RECS can be equipped with different carrier blades and microserver configurations, this approach is of high relevance. Using traditional monitoring tools that don’t support the export of dynamic metrics needs regular manual changes of the configuration files which is annoying. ==== Prometheus Configuration ==== Prometheus needs very little configuration to automatically parse all information and write it into a database. This makes all metrics easily accessible. - job_name: 'RECS_Master' scrape_interval: 1s scrape_timeout: 1s static_configs: - targets: ['192.168.0.100'] basic_auth: username: 'user' password: 'password' ==== Grafana Dashboard ==== It is recommended to use Grafana as a graphical dashboard to read out these captured metrics. A pre-build Grafana dashboard is publicly available at https://grafana.com/grafana/dashboards/14622. It can be integrated in Grafana using the "Import" function. It automatically reads the available metrics from the database and dynamically adapts to the number of available microservers, see the following picture: {{ :documentation:grafana.png?direct |Grafana Dashboard}}