Troubleshooting guide
Connecting to internal databases
- TimescaleDB
Go inside the pod:
kubectl exec -it pod/timescaledb-0 -n telemetry-and-visualizations -- /bin/bash
Connect to psql:
psql -U <postgres_username>
Connect to database:
\c telemetry_metrics
- MySQL DB
Go inside the pod:
kubectl exec -it pod/mysqldb-0 -n telemetry-and-visualizations -- /bin/bash
Connect to mysql:
mysql -U <mysqldb_username> -p <mysqldb_password>
Connect to database:
USE idrac_telemetrysource_services_db
Checking and updating encrypted parameters
Move to the filepath where the parameters are saved (as an example, we will be using
provision_config.yml
):cd input/
To view the encrypted parameters:
ansible-vault view provision_config.yml –vault-password-file .provision_vault_key
To edit the encrypted parameters:
ansible-vault edit provision_config.yml --vault-password-file .provision_vault_key
Checking pod status on the control plane
Select the pod you need to troubleshoot from the output of
kubectl get pods -A
Check the status of the pod by running
kubectl describe pod <pod name> -n <namespace name>
Using telemetry information to diagnose node issues
Metric Name |
Unit |
Possible Values |
Possible error causes |
---|---|---|---|
BlockedProcesses |
processes |
|
|
CPUSystem |
seconds |
|
|
CPUWait |
seconds |
|
|
ErrorsRecv |
|
|
|
ErrorsSent |
|
|
|
FailedJobs |
|
|
|
HardwareCorruptedMemory |
kB |
|
|
MemoryActive |
bytes |
|
|
MemoryAvailable |
bytes |
|
|
MemoryCached |
bytes |
|
|
MemoryFree |
bytes |
|
|
MemoryInactive |
bytes |
|
|
MemoryPercent |
percent |
|
|
MemoryShared |
bytes |
|
|
MemoryTotal |
bytes |
|
|
MemoryUsed |
bytes |
|
|
NodesDown |
|
|
|
NodesTotal |
|
|
|
NodesUp |
|
|
|
QueuedJobs |
|
|
|
RunningJobs |
|
|
|
SMARTHDATemp |
C |
|
|
UniqueUserLogin |
|
Metric Name |
Possible value(s) |
Possible failure causes |
---|---|---|
dmesg |
|
|
beegfs -beegfsstat |
|
|
gpu_driver_health:gpu |
|
|
gpu_health_nvlink:gpu [1] |
|
|
gpu_health_pcie:gpu |
|
|
gpu_health_pmu:gpu |
|
|
gpu_health_power:gpu |
|
|
gpu_health_thermal:gpu |
|
|
Kubernetespodsstatus |
|
|
Kuberneteschildnode |
|
|
kubernetesnodesstatus |
|
|
kubernetescomponentsstatus |
|
|
Smart |
|
|
Metric Name |
Unit |
Possible value(s) |
Potential error cause(s) |
---|---|---|---|
gpu_temperature:gpu |
C |
|
|
gpu_utilization |
percent |
|
|
gpu_utilization:average |
percent |
|
|
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.