Troubleshooting guide

Troubleshooting Kubeadm

For a complete guide to troubleshooting kubeadm, click here.

Connecting to internal databases

  • TimescaleDB
    • Start a bash session within the timescaledb pod: kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash

    • Connect to psql: psql -U <postgres_username>

    • Connect to database: \c telemetry_metrics

  • MySQL DB
    • Start a bash session within the mysqldb pod: kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash

    • Connect to mysql: mysql -U <mysqldb_username> -p <mysqldb_password>

    • Connect to database: USE idrac_telemetrysource_services_db

Checking and updating encrypted parameters

  1. Move to the filepath where the parameters are saved (as an example, we will be using provision_config_credentials.yml):

    cd input/
    
  2. To view the encrypted parameters:

    ansible-vault view provision_config_credentials.yml --vault-password-file .provision_vault_key
    
  3. To edit the encrypted parameters:

    ansible-vault edit provision_config_credentials.yml --vault-password-file .provision_vault_key
    

Checking pod status on the control plane

  • Use this command to get a list of all available pods: kubectl get pods -A

  • Check the status of any specific pod by running: kubectl describe pod <pod name> -n <namespace name>

Using telemetry information to diagnose node issues

Regular telemetry metrics

Metric Name

Unit

Possible Values

Possible error causes

BlockedProcesses

processes

  • Metric Value

  • No Data

  • This could happen if the /proc/stat file is inaccessible.

CPUSystem

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

CPUWait

seconds

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsRecv

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

ErrorsSent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

FailedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

HardwareCorruptedMemory

kB

  • Metric Value

  • No Data

  • This could happen if the /proc/meminfo file is inaccessible.

MemoryActive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryAvailable

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryCached

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryFree

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryInactive

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryPercent

percent

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryShared

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryTotal

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

MemoryUsed

bytes

  • Metric Value

  • No Data

  • This could happen when the psutil library encounters errors.

NodesDown

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesTotal

  • Metric Value

  • No Data

  • Slurm is not installed.

NodesUp

  • Metric Value

  • No Data

  • Slurm is not installed.

QueuedJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

RunningJobs

  • Metric Value

  • No Data

  • Slurm is not installed.

SMARTHDATemp

C

  • Metric Value

  • No Data

  • smartctl commands failed.

UniqueUserLogin

  • Metric Value

  • No Data

Health telemetry metrics

Metric Name

Possible value(s)

Possible failure causes

dmesg

  • Unknown

  • Fail

  • Pass

  • [Unknown] The dmesg command was not found on the cluster node.

  • [Fail] The dmesg command returned an error log message.

beegfs -beegfsstat

  • Unknown

  • Fail

  • Pass

  • [Unknown] BeeGFS is not installed or inactive.

  • [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.

gpu_driver_health:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_nvlink:gpu [1]

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • NVLinks are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_pcie:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_pmu:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_power:gpu

  • Unknown

  • Fail

  • Pass

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_health_thermal:gpu

  • Unknown

  • Metric Value

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

Kubernetespodsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Kuberneteschildnode

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetesnodesstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

kubernetescomponentsstatus

  • Unknown

  • Fail

  • Pass

  • Kubernetes is not installed.

Smart

  • Unknown

  • Fail

  • Pass

  • smartctl commands failed.

GPU telemetry metrics

Metric Name

Unit

Possible value(s)

Potential error cause(s)

gpu_temperature:gpu

C

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_utilization

percent

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

gpu_utilization:average

percent

  • Metric value

  • No data

  • AMD/NVIDIA accelerators are not present.

  • GPU drivers are not installed including Rocm and CUDA.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.