Metrics collected

Regular metrics

Your cluster in numbers: Regular metrics include information such as CPU, memory, packets errors, drives etc.

Regular metrics
Metric Name	Unit	Possible Values	Possible error causes
BlockedProcesses	processes	Metric Value No Data	This could happen if the `/proc/stat` file is inaccessible.
CPUSystem	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
CPUWait	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsRecv		Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsSent		Metric Value No Data	This could happen when the `psutil` library encounters errors.
FailedJobs		Metric Value No Data	Slurm is not installed.
HardwareCorruptedMemory	kB	Metric Value No Data	This could happen if the `/proc/meminfo` file is inaccessible.
MemoryActive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryAvailable	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryCached	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryFree	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryInactive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryPercent	percent	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryShared	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryTotal	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryUsed	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
NodesDown		Metric Value No Data	Slurm is not installed.
NodesTotal		Metric Value No Data	Slurm is not installed.
NodesUp		Metric Value No Data	Slurm is not installed.
QueuedJobs		Metric Value No Data	Slurm is not installed.
RunningJobs		Metric Value No Data	Slurm is not installed.
SMARTHDATemp	C	Metric Value No Data	`smartctl` commands failed.
UniqueUserLogin		Metric Value No Data

Health metrics

The health of your cluster: Health metrics include key performance indicators.

Health metrics
Metric Name	Possible value(s)	Possible failure causes
dmesg	Unknown Fail Pass	[Unknown] The dmesg command was not found on the cluster node. [Fail] The dmesg command returned an error log message.
beegfs -beegfsstat	Unknown Fail Pass	[Unknown] BeeGFS is not installed or inactive. [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.
gpu_driver_health:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_nvlink:gpu [1]	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. NVLinks are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_pcie:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_pmu:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_power:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_thermal:gpu	Unknown Metric Value	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
Kubernetespodsstatus	Unknown Fail Pass	Kubernetes is not installed.
Kuberneteschildnode	Unknown Fail Pass	Kubernetes is not installed.
kubernetesnodesstatus	Unknown Fail Pass	Kubernetes is not installed.
kubernetescomponentsstatus	Unknown Fail Pass	Kubernetes is not installed.
Smart	Unknown Fail Pass	smartctl commands failed.

GPU metrics

The GPUs of your cluster: GPU metrics include information about GPUs in the cluster

GPU metrics
Metric Name	Unit	Possible value(s)	Potential error cause(s)
gpu_temperature:gpu	C	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_utilization	percent	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_utilization:average	percent	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.