Troubleshooting guide

Troubleshooting Kubeadm

For a complete guide to troubleshooting kubeadm, click here.

Connecting to internal databases

TimescaleDB
- Start a bash session within the timescaledb pod: kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash
- Connect to psql: psql -U <postgres_username>
- Connect to database: \c telemetry_metrics
MySQL DB
- Start a bash session within the mysqldb pod: kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash
- Connect to mysql: mysql -U <mysqldb_username> -p <mysqldb_password>
- Connect to database: USE idrac_telemetrysource_services_db

Checking and updating encrypted parameters

Move to the filepath where the parameters are saved (as an example, we will be using provision_config_credentials.yml):
```
cd input/
```

To view the encrypted parameters:

ansible-vault view provision_config_credentials.yml --vault-password-file .provision_vault_key

To edit the encrypted parameters:

ansible-vault edit provision_config_credentials.yml --vault-password-file .provision_vault_key

Checking pod status on the control plane

Use this command to get a list of all available pods: kubectl get pods -A

Check the status of any specific pod by running: kubectl describe pod <pod name> -n <namespace name>

Using telemetry information to diagnose node issues

Regular telemetry metrics
Metric Name	Unit	Possible Values	Possible error causes
BlockedProcesses	processes	Metric Value No Data	This could happen if the `/proc/stat` file is inaccessible.
CPUSystem	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
CPUWait	seconds	Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsRecv		Metric Value No Data	This could happen when the `psutil` library encounters errors.
ErrorsSent		Metric Value No Data	This could happen when the `psutil` library encounters errors.
FailedJobs		Metric Value No Data	Slurm is not installed.
HardwareCorruptedMemory	kB	Metric Value No Data	This could happen if the `/proc/meminfo` file is inaccessible.
MemoryActive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryAvailable	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryCached	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryFree	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryInactive	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryPercent	percent	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryShared	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryTotal	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
MemoryUsed	bytes	Metric Value No Data	This could happen when the `psutil` library encounters errors.
NodesDown		Metric Value No Data	Slurm is not installed.
NodesTotal		Metric Value No Data	Slurm is not installed.
NodesUp		Metric Value No Data	Slurm is not installed.
QueuedJobs		Metric Value No Data	Slurm is not installed.
RunningJobs		Metric Value No Data	Slurm is not installed.
SMARTHDATemp	C	Metric Value No Data	`smartctl` commands failed.
UniqueUserLogin		Metric Value No Data

Health telemetry metrics
Metric Name	Possible value(s)	Possible failure causes
dmesg	Unknown Fail Pass	[Unknown] The dmesg command was not found on the cluster node. [Fail] The dmesg command returned an error log message.
beegfs -beegfsstat	Unknown Fail Pass	[Unknown] BeeGFS is not installed or inactive. [Fail] The BeeGFS client service has failed or the node is not present in reachable lists of BeeGFS clients.
gpu_driver_health:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_nvlink:gpu [1]	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. NVLinks are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_pcie:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_pmu:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_power:gpu	Unknown Fail Pass	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_health_thermal:gpu	Unknown Metric Value	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
Kubernetespodsstatus	Unknown Fail Pass	Kubernetes is not installed.
Kuberneteschildnode	Unknown Fail Pass	Kubernetes is not installed.
kubernetesnodesstatus	Unknown Fail Pass	Kubernetes is not installed.
kubernetescomponentsstatus	Unknown Fail Pass	Kubernetes is not installed.
Smart	Unknown Fail Pass	smartctl commands failed.

GPU telemetry metrics
Metric Name	Unit	Possible value(s)	Potential error cause(s)
gpu_temperature:gpu	C	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_utilization	percent	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.
gpu_utilization:average	percent	Metric value No data	AMD/NVIDIA accelerators are not present. GPU drivers are not installed including Rocm and CUDA.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.