Additional metric information

Telemetry metrics
Metric Name	Command	Comments	Aggregation Level
BlockedProcesses	`grep procs_blocked /proc/stat`		Node Level
CPUSystem	`psutil.cpu_times().system`		Node Level
CPUWait	`psutil.cpu_times().iowait`		Node Level
ErrorsRecv	`psutil.net_io_counters(pernic=True).get(interface_name).errin`	Error packets received for individual network interfaces will be populated.	Node Level
ErrorsSent	`psutil.net_io_counters(pernic=True).get(interface_name).errout`	Error packets sent for individual network interfaces will be populated.	Node Level
FailedJobs	`sacct -P --delimiter=\t`		Cluster Level
HardwareCorruptedMemory	`grep HardwareCorrupted /proc/meminfo`		Node Level
MemoryActive	`psutil.virtual_memory().active`		Node Level
MemoryAvailable	`psutil.virtual_memory().available`		Node Level
MemoryCached	`psutil.virtual_memory().cached`		Node Level
MemoryFree	`psutil.virtual_memory().free`		Node Level
MemoryInactive	`psutil.virtual_memory().inactive`		Node Level
MemoryPercent	`psutil.virtual_memory().percent`		Node Level
MemoryShared	`psutil.virtual_memory().shared`		Node Level
MemoryTotal	`psutil.virtual_memory().total`		Node Level
MemoryUsed	`psutil.virtual_memory().used`		Node Level
NodesDown	`sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m`	Node is considered down if node state is any of the following: down, drained, draining, fail, failing, future, inval, maint, powered_down, powering_down, unknown, unk. Note Node state with * in suffix will be considered as down. Example, idle* will be considered as down.	Cluster Level
NodesTotal	`sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m`		Cluster Level
NodesUp	`sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m`	Node is considered up if node state is any of the following: idle, mixed, completing. Note Node state with * in suffix will be considered as down node. Example, idle* will be considered as down node.	Cluster Level
QueuedJobs	`squeue --format=%i\t%P\t%j\t%u\t%T\t%S\t%N`		Cluster Level
RunningJobs	`squeue --format=%i\t%P\t%j\t%u\t%T\t%S\t%N`		Cluster Level
SMARTHDATemp	`smartctl -a <device name>`		Node Level
UniqueUserLogin	`who\|cut -f 1 -d " "\|sort -u\|wc -l`	Locally created users via `useradd` command are also counted in `UniqueUserLogin` count. Remote logged in LDAP users are not counted in `UniqueUserLogin` on login nodes. Remote logged in FreeIPA users are counted in `UniqueUserLogin` on login nodes.	Login Node/ Manager Node (If Login Node is not present)
dmesg	`dmesg --level=err`		Node Level
Beegfs -beegfsstat	`systemctl is-active beegfs-client` `beegfs-ctl --nodetype=client --listnodes`		Node Level
gpu_driver_health:gpu	For NVIDIA GPU: `nvidia-smi --query-gpu=driver_version --format=csv,nounits` For AMD GPU: `rocm-smi --showdriverversion --csv`		Node Level
gpu_health_nvlink:gpu [1]	NVIDIA: `nvidia-smi nvlink --status`		Node Level
gpu_health_pcie:gpu	For NVIDIA GPU: `nvidia-smi --query-gpu=pci.bus_id --format=csv,nounits` For AMD GPU: `rocm-smi --showbus --csv`		Node Level
gpu_health_pmu:gpu	For NVIDIA GPU: `nvidia-smi --query-gpu=power.management --format=csv,nounits`	PMU - Power management unit	Node Level
gpu_health_power:gpu	For NVIDIA GPU: `nvidia-smi --query-gpu=pci.bus_id --format=csv,nounits`	Power consumption	Node Level
gpu_health_thermal:gpu	For AMD GPU: `rocm-smi --showbus --csv`	GPU temperature health	Node Level
Kubernetespodsstatus	`sudo kubectl get pods -A -o json`	Value is pass when all pods and containers are in running state, otherwise Fail.	Cluster Level
Kuberneteschildnode	`sudo kubectl get nodes -o json`	Value is pass when all child nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail.	Cluster Level
kubernetesnodesstatus	`sudo kubectl get nodes -o json`	Value is pass when all nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail.	Cluster Level
kubernetescomponentsstatus	`sudo kubectl get --raw=/livez?verbose`	Value is Pass when health check is passed in `kubectl get --raw=/livez?verbose` command , otherwise fail.	Cluster Level
Smart	`smartctl -a <device name>`		Node Level
gpu_temperature:gpu	For NVIDIA GPU: `nvidia-smi --query-gpu=temperature.gpu --format=csv,nounits` For AMD GPU: `rocm-smi -t --csv`		Node Level
gpu_utilization:	For NVIDIA GPU: `nvidia-smi nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits` For AMD GPU: `rocm-smi -u --csv`		Node Level
gpu_utilization:average	For NVIDIA GPU: `nvidia-smi nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits` For AMD GPU: `rocm-smi -u --csv`	Value is average of utilization value of all GPUs	Node Level

Note

psutil (python system and process utilities) is a cross-platform library for retrieving information on running processes and system utilization (CPU, memory, network).

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.