Additional metric information
Metric Name |
Command |
Comments |
Aggregation Level |
---|---|---|---|
BlockedProcesses |
|
Node Level |
|
CPUSystem |
|
Node Level |
|
CPUWait |
|
Node Level |
|
ErrorsRecv |
|
Error packets received for individual network interfaces will be populated. |
Node Level |
ErrorsSent |
|
Error packets sent for individual network interfaces will be populated. |
Node Level |
FailedJobs |
|
Cluster Level |
|
HardwareCorruptedMemory |
|
Node Level |
|
MemoryActive |
|
Node Level |
|
MemoryAvailable |
|
Node Level |
|
MemoryCached |
|
Node Level |
|
MemoryFree |
|
Node Level |
|
MemoryInactive |
|
Node Level |
|
MemoryPercent |
|
Node Level |
|
MemoryShared |
|
Node Level |
|
MemoryTotal |
|
Node Level |
|
MemoryUsed |
|
Node Level |
|
NodesDown |
|
Node is considered down if node state is any of the following: down, drained, draining, fail, failing, future, inval, maint, powered_down, powering_down, unknown, unk. Note Node state with * in suffix will be considered as down. Example, idle* will be considered as down. |
Cluster Level |
NodesTotal |
|
Cluster Level |
|
NodesUp |
|
Node is considered up if node state is any of the following: idle, mixed, completing. Note Node state with * in suffix will be considered as down node. Example, idle* will be considered as down node. |
Cluster Level |
QueuedJobs |
|
Cluster Level |
|
RunningJobs |
|
Cluster Level |
|
SMARTHDATemp |
|
Node Level |
|
UniqueUserLogin |
|
|
Login Node/ Manager Node (If Login Node is not present) |
dmesg |
|
Node Level |
|
Beegfs -beegfsstat |
|
Node Level |
|
gpu_driver_health:gpu |
|
Node Level |
|
gpu_health_nvlink:gpu [1] |
NVIDIA: |
Node Level |
|
gpu_health_pcie:gpu |
|
Node Level |
|
gpu_health_pmu:gpu |
For NVIDIA GPU: |
PMU - Power management unit |
Node Level |
gpu_health_power:gpu |
For NVIDIA GPU: |
Power consumption |
Node Level |
gpu_health_thermal:gpu |
For AMD GPU: |
GPU temperature health |
Node Level |
Kubernetespodsstatus |
|
Value is pass when all pods and containers are in running state, otherwise Fail. |
Cluster Level |
Kuberneteschildnode |
|
Value is pass when all child nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail. |
Cluster Level |
kubernetesnodesstatus |
|
Value is pass when all nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail. |
Cluster Level |
kubernetescomponentsstatus |
|
Value is Pass when health check is passed in |
Cluster Level |
Smart |
|
Node Level |
|
gpu_temperature:gpu |
|
Node Level |
|
gpu_utilization: |
|
Node Level |
|
gpu_utilization:average |
*For NVIDIA GPU: *For AMD GPU: |
Value is average of utilization value of all GPUs |
Node Level |
Note
psutil
(python system and process utilities) is a cross-platform library for retrieving information on running processes and system utilization (CPU, memory, network).
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.