Metrics collected
Regular metrics
Your cluster in numbers: Regular metrics include information such as CPU, memory, packets errors, drives etc.
Metric Name |
Unit |
Possible Values |
Possible error causes |
---|---|---|---|
BlockedProcesses |
processes |
|
|
CPUSystem |
seconds |
|
|
CPUWait |
seconds |
|
|
ErrorsRecv |
|
|
|
ErrorsSent |
|
|
|
FailedJobs |
|
|
|
HardwareCorruptedMemory |
kB |
|
|
MemoryActive |
bytes |
|
|
MemoryAvailable |
bytes |
|
|
MemoryCached |
bytes |
|
|
MemoryFree |
bytes |
|
|
MemoryInactive |
bytes |
|
|
MemoryPercent |
percent |
|
|
MemoryShared |
bytes |
|
|
MemoryTotal |
bytes |
|
|
MemoryUsed |
bytes |
|
|
NodesDown |
|
|
|
NodesTotal |
|
|
|
NodesUp |
|
|
|
QueuedJobs |
|
|
|
RunningJobs |
|
|
|
SMARTHDATemp |
C |
|
|
UniqueUserLogin |
|
Health metrics
The health of your cluster: Health metrics include key performance indicators.
Metric Name |
Possible value(s) |
Possible failure causes |
---|---|---|
dmesg |
|
|
beegfs -beegfsstat |
|
|
gpu_driver_health:gpu |
|
|
gpu_health_nvlink:gpu [1] |
|
|
gpu_health_pcie:gpu |
|
|
gpu_health_pmu:gpu |
|
|
gpu_health_power:gpu |
|
|
gpu_health_thermal:gpu |
|
|
Kubernetespodsstatus |
|
|
Kuberneteschildnode |
|
|
kubernetesnodesstatus |
|
|
kubernetescomponentsstatus |
|
|
Smart |
|
|
GPU metrics
The GPUs of your cluster: GPU metrics include information about GPUs in the cluster
Metric Name |
Unit |
Possible value(s) |
Potential error cause(s) |
---|---|---|---|
gpu_temperature:gpu |
C |
|
|
gpu_utilization |
percent |
|
|
gpu_utilization:average |
percent |
|
|
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.