Additional metric information

Telemetry metrics

Metric Name

Command

Comments

Aggregation Level

BlockedProcesses

grep procs_blocked /proc/stat

Node Level

CPUSystem

psutil.cpu_times().system

Node Level

CPUWait

psutil.cpu_times().iowait

Node Level

ErrorsRecv

psutil.net_io_counters(pernic=True).get(interface_name).errin

Error packets received for individual network interfaces will be populated.

Node Level

ErrorsSent

psutil.net_io_counters(pernic=True).get(interface_name).errout

Error packets sent for individual network interfaces will be populated.

Node Level

FailedJobs

sacct -P --delimiter=\t

Cluster Level

HardwareCorruptedMemory

grep HardwareCorrupted /proc/meminfo

Node Level

MemoryActive

psutil.virtual_memory().active

Node Level

MemoryAvailable

psutil.virtual_memory().available

Node Level

MemoryCached

psutil.virtual_memory().cached

Node Level

MemoryFree

psutil.virtual_memory().free

Node Level

MemoryInactive

psutil.virtual_memory().inactive

Node Level

MemoryPercent

psutil.virtual_memory().percent

Node Level

MemoryShared

psutil.virtual_memory().shared

Node Level

MemoryTotal

psutil.virtual_memory().total

Node Level

MemoryUsed

psutil.virtual_memory().used

Node Level

NodesDown

sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m

Node is considered down if node state is any of the following: down, drained, draining, fail, failing, future, inval, maint, powered_down, powering_down, unknown, unk.

Note

Node state with * in suffix will be considered as down. Example, idle* will be considered as down.

Cluster Level

NodesTotal

sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m

Cluster Level

NodesUp

sinfo --format=%N\t%P\t%a\t%C\t%t\t%D\t%m

Node is considered up if node state is any of the following: idle, mixed, completing.

Note

Node state with * in suffix will be considered as down node. Example, idle* will be considered as down node.

Cluster Level

QueuedJobs

squeue --format=%i\t%P\t%j\t%u\t%T\t%S\t%N

Cluster Level

RunningJobs

squeue --format=%i\t%P\t%j\t%u\t%T\t%S\t%N

Cluster Level

SMARTHDATemp

smartctl -a <device name>

Node Level

UniqueUserLogin

who|cut -f 1 -d " "|sort -u|wc -l

  • Locally created users via useradd command are also counted in UniqueUserLogin count.

  • Remote logged in LDAP users are not counted in UniqueUserLogin on login nodes.

  • Remote logged in FreeIPA users are counted in UniqueUserLogin on login nodes.

Login Node/ Manager Node (If Login Node is not present)

dmesg

dmesg --level=err

Node Level

Beegfs -beegfsstat

systemctl is-active beegfs-client beegfs-ctl --nodetype=client --listnodes

Node Level

gpu_driver_health:gpu

  • For NVIDIA GPU: nvidia-smi --query-gpu=driver_version --format=csv,nounits

  • For AMD GPU: rocm-smi --showdriverversion --csv

Node Level

gpu_health_nvlink:gpu [1]

NVIDIA: nvidia-smi nvlink --status

Node Level

gpu_health_pcie:gpu

  • For NVIDIA GPU: nvidia-smi --query-gpu=pci.bus_id --format=csv,nounits

  • For AMD GPU: rocm-smi --showbus --csv

Node Level

gpu_health_pmu:gpu

For NVIDIA GPU: nvidia-smi --query-gpu=power.management --format=csv,nounits

PMU - Power management unit

Node Level

gpu_health_power:gpu

For NVIDIA GPU: nvidia-smi --query-gpu=pci.bus_id --format=csv,nounits

Power consumption

Node Level

gpu_health_thermal:gpu

For AMD GPU: rocm-smi --showbus --csv

GPU temperature health

Node Level

Kubernetespodsstatus

sudo kubectl get pods -A -o json

Value is pass when all pods and containers are in running state, otherwise Fail.

Cluster Level

Kuberneteschildnode

sudo kubectl get nodes -o json

Value is pass when all child nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail.

Cluster Level

kubernetesnodesstatus

sudo kubectl get nodes -o json

Value is pass when all nodes are in Ready or Ready,SchedulingDisabled state, otherwise Fail.

Cluster Level

kubernetescomponentsstatus

sudo kubectl get --raw=/livez?verbose

Value is Pass when health check is passed in kubectl get --raw=/livez?verbose command , otherwise fail.

Cluster Level

Smart

smartctl -a <device name>

Node Level

gpu_temperature:gpu

  • For NVIDIA GPU: nvidia-smi --query-gpu=temperature.gpu --format=csv,nounits

  • For AMD GPU: rocm-smi -t --csv

Node Level

gpu_utilization:

  • For NVIDIA GPU: nvidia-smi nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits

  • For AMD GPU: rocm-smi -u --csv

Node Level

gpu_utilization:average

*For NVIDIA GPU: nvidia-smi nvidia-smi --query-gpu=utilization.gpu --format=csv,nounits

*For AMD GPU: rocm-smi -u --csv

Value is average of utilization value of all GPUs

Node Level

Note

psutil (python system and process utilities) is a cross-platform library for retrieving information on running processes and system utilization (CPU, memory, network).

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.