Troubleshooting guide
Troubleshooting Kubeadm
For a complete guide to troubleshooting kubeadm, click here.
Connecting to internal databases
- TimescaleDB
Start a bash session within the timescaledb pod:
kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash
Connect to psql:
psql -U <postgres_username>
Connect to database:
\c telemetry_metrics
- MySQL DB
Start a bash session within the mysqldb pod:
kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash
Connect to mysql:
mysql -U <mysqldb_username> -p <mysqldb_password>
Connect to database:
USE idrac_telemetrysource_services_db
Checking and updating encrypted parameters
Move to the filepath where the parameters are saved (as an example, we will be using
provision_config_credentials.yml
):cd input/
To view the encrypted parameters:
ansible-vault view provision_config_credentials.yml --vault-password-file .provision_vault_key
To edit the encrypted parameters:
ansible-vault edit provision_config_credentials.yml --vault-password-file .provision_vault_key
Checking pod status on the control plane
Use this command to get a list of all available pods:
kubectl get pods -A
Check the status of any specific pod by running:
kubectl describe pod <pod name> -n <namespace name>
Using telemetry information to diagnose node issues
Metric Name |
Unit |
Possible Values |
Possible error causes |
---|---|---|---|
BlockedProcesses |
processes |
|
|
CPUSystem |
seconds |
|
|
CPUWait |
seconds |
|
|
ErrorsRecv |
|
|
|
ErrorsSent |
|
|
|
FailedJobs |
|
|
|
HardwareCorruptedMemory |
kB |
|
|
MemoryActive |
bytes |
|
|
MemoryAvailable |
bytes |
|
|
MemoryCached |
bytes |
|
|
MemoryFree |
bytes |
|
|
MemoryInactive |
bytes |
|
|
MemoryPercent |
percent |
|
|
MemoryShared |
bytes |
|
|
MemoryTotal |
bytes |
|
|
MemoryUsed |
bytes |
|
|
NodesDown |
|
|
|
NodesTotal |
|
|
|
NodesUp |
|
|
|
QueuedJobs |
|
|
|
RunningJobs |
|
|
|
SMARTHDATemp |
C |
|
|
UniqueUserLogin |
|
Metric Name |
Possible value(s) |
Possible failure causes |
---|---|---|
dmesg |
|
|
beegfs -beegfsstat |
|
|
gpu_driver_health:gpu |
|
|
gpu_health_nvlink:gpu [1] |
|
|
gpu_health_pcie:gpu |
|
|
gpu_health_pmu:gpu |
|
|
gpu_health_power:gpu |
|
|
gpu_health_thermal:gpu |
|
|
Kubernetespodsstatus |
|
|
Kuberneteschildnode |
|
|
kubernetesnodesstatus |
|
|
kubernetescomponentsstatus |
|
|
Smart |
|
|
Metric Name |
Unit |
Possible value(s) |
Potential error cause(s) |
---|---|---|---|
gpu_temperature:gpu |
C |
|
|
gpu_utilization |
percent |
|
|
gpu_utilization:average |
percent |
|
|
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.