Telemetry and visualizations

The telemetry feature allows the set up of Omnia telemetry (to poll values from all Omnia provisioned nodes in the cluster) and/or iDRAC telemetry (To poll values from all eligible iDRACs in the cluster). It also installs Grafana and Loki as Kubernetes pods.

To initiate telemetry support, fill out the following parameters in input/telemetry_config.yml:

Parameters

Parameter

Details

idrac_telemetry_support

boolean [1]

Required

  • Enables iDRAC telemetry support and visualizations.

  • Values:

* false <- Default

* true

Note

When idrac_telemetry_support is true, mysqldb_user, mysqldb_password and mysqldb_root_password become mandatory.

omnia_telemetry_support

boolean [1]

Required

  • Starts or stops Omnia telemetry

  • If omnia_telemetry_support is true, then at least one of collect_regular_metrics or collect_health_check_metrics or collect_gpu_metrics should be true, to collect metrics.

  • If omnia_telemetry_support is false, telemetry acquisition will be stopped.

  • Values:

* false <- Default

* true

visualization_support

boolean [1]

Required

  • Enables visualizations.

  • Values:

* false <- Default

* true

Note

When visualization_support is true, grafana_username and grafana_password become mandatory.

appliance_k8s_pod_net_cidr

string

Required

  • Kubernetes pod network CIDR for appliance k8s network.

  • Make sure this value does not overlap with any of the host networks.

  • Default value: "192.168.0.0/16"

pod_external_ip_start_range

string

Required

  • The start of the range that will be used by Loadbalancer for assigning IPs to K8s services in admin NIC subnet configured on the control plane.

  • The first and second octets (x,y) are not used/validated by Omnia. These values are internally calculated based on the value of admin_nic_subnet in input/provision_config.yml.

  • If pod_external_ip_start_range: “x.y.240.100” and pod_external_ip_end_range: “x.y.240.105” and

  • If admin_nic_subnet provided in provision_config.yml is 10.5.0.0, pod_external_ip_start_range will be 10.5.240.100 and pod_external_ip_end_range will be 10.5.240.105

Note

Make sure the IP range is not assigned to any node in the cluster.

  • Default value: "x.y.240.100"

pod_external_ip_end_range

string

Required

  • The end of the range that will be used by Loadbalancer for assigning IPs to K8s services in admin NIC subnet configured on the control plane.

  • The first and second octets (x,y) are not used/validated by Omnia. These values are internally calculated based on the value of admin_nic_subnet in input/provision_config.yml.

  • To create a meaningful range, the third octet of pod_external_ip_end_range should equal or exceed the third octet of pod_external_ip_start_range. If the third octets are equal, the forth octet of pod_external_ip_end_range should exceed the forth octet of pod_external_ip_start_range.

  • If pod_external_ip_start_range: “x.y.240.100” and pod_external_ip_end_range: “x.y.240.105” and

  • If admin_nic_subnet provided in provision_config.yml is 10.5.0.0, pod_external_ip_start_range will be 10.5.240.100 and pod_external_ip_end_range will be 10.5.240.105

Note

Make sure the IP range is not assigned to any node in the cluster.

  • Default value: "x.y.240.105"

timescaledb_user

string

Required

  • Username used to access timescaleDB.

  • The username must not contain -,, ‘,”.

  • The Length of the username should be at least 2 characters.

timescaledb_password

string

Required

  • Password used to used to access timescaleDB.

  • The password must not contain -,, ‘,”.

  • The length of the password should be at least 2 characters.

  • The first character of the string should be an alphabet.

idrac_username

string

Optional

  • Username used to authenticate to iDRAC.

  • The username must not contain -,, ‘,”.

  • Required if idrac_telemetry_support is true.

idrac_password

string

Optional

  • Password used to authenticate to iDRAC.

  • The password must not contain -,, ‘,”.

  • Required if idrac_telemetry_support is true.

  • The first character of the string should be an alphabet.

mysqldb_user

string

Optional

  • Username used to authenticate to mysqldb.

  • The username must not contain -,, ‘,”.

  • The length of the username should be at least 2 characters.

  • Required if idrac_telemetry_support is true.

mysqldb_password

string

Optional

  • Password used to authenticate to mysqldb.

  • The password must not contain -,, ‘,”.

  • The length of the password should be at least 2 characters.

  • Required if idrac_telemetry_support is true.

  • The first character of the string should be an alphabet.

mysqldb_root_password

string

Optional

  • Password used to authenticate to mysqldb as a root user.

  • The password must not contain -,, ‘,”.

  • The length of the password should be at least 2 characters.

  • Required if idrac_telemetry_support is true.

  • The first character of the string should be an alphabet.

omnia_telemetry_collection_interval

integer

Required

  • This variable denotes the time interval (seconds) of telemetry data collection from required compute nodes.

  • Range (seconds): 60-3600 [1 minute to 1 hour]

  • Default value: 300

collect_regular_metrics

boolean [1]

Required

  • This variable is used to enable metric collection part of the regular metric group.

  • For a list of regular metrics collected, click here.

  • Values:

* true <- Default

* false

collect_health_check_metrics

boolean [1]

Required

  • This variable is used to enable metric collection part of the health check metric group.

  • For a list of health metrics collected, click here.

  • Values:

* true <- Default

* false

collect_gpu_metrics

boolean [1]

Required

  • This variable is used to enable metric collection related to GPU.

  • For a list of GPU metrics collected, click here.

  • Values:

* true <- Default

* false

fuzzy_offset

integer

Required

  • This variable is used to set an appropriate time interval in seconds for all cluster nodes so that they do not congest the admin network.

  • Individual nodes generate a random number between 0 and fuzzy_offset and telemetry data collection of that node initially waits for that much of second before starting data collection.

  • Default value (seconds): 60

  • For large clusters, a higher value is recommended.

  • This value should be less than or equal to the value of omnia_telemetry_collection_interval but greater than or equal to 60.

metric_collection_timeout

integer

Required

  • This variable is used to define data collection timeout period in seconds.

  • Default value: 5

  • This value should be less than the value of omnia_telemetry_collection_interval but greater than 0.

grafana_username

string

Optional

  • The username for grafana UI

  • The length of username should be at least 5

  • The username must not contain -,, ‘,”

  • Mandatory when visualization_support is true.

grafana_password

string

Optional

  • The password for grafana UI

  • The length of password should be at least 5

  • The password must not contain -,, ‘,”

  • The password cannot be set to ‘admin’.

  • The first character of the string should be an alphabet.

  • Mandatory when visualization_support is true.

mount_location

string

Optional

  • At this location grafana persistent volume will be created.

  • If using telemetry, all telemetry related files will also be stored and both timescale and mysql databases will be mounted to this location.

  • ‘/’ is mandatory at the end of the path.

  • Default value: “/opt/omnia/telemetry/”

Once you have executed discovery_provision.yml and has also provisioned the cluster, initiate telemetry on the cluster as part of omnia.yml, which configures the cluster with scheduler, storage and authentication using the below command.

ansible-playbook omnia.yml -i inventory

Optionally, you can initiate only telemetry using the below command:

ansible-playbook telemetry.yml -i inventory

Note

  • Depending on the type of telemetry initiated, include the following groups in the inventory:

    • omnia_telemetry: manager, compute, [optional] login

    • idrac_telemetry: idrac

  • If you would like a local backup of the timescaleDB used to store telemetry data, click here.

After initiation, new iDRACs can be added for idrac_telemetry acquisition by running the following commands:

ansible-playbook add_idrac_node.yml -i inventory

Modifying telemetry information

To modify how data is collected from the cluster, modify the variables in omnia/input/telemetry_config.yml and re-run the telemetry.yml playbook.

  • When omnia_telemetry_support is set to false, Omnia Telemetry Acquisition service will be stopped on all cluster nodes provided in the passed inventory.

  • When omnia_telemetry_support is set to true, Omnia Telemetry Acquisition service will be restarted on all cluster nodes provided in the passed inventory.

  • To start or stop the collection of regular metrics, health check metrics, or GPU metrics, update the values of collect_regular_metrics, collect_health_check_metrics, or collect_gpu_metrics. For a list of all metrics collected, click here.

Note

  • Currently, changing the grafana_username and grafana_password values is not supported via telemetry.yml.

  • The passed inventory should have an idrac group, if idrac_telemetry_support is true.

  • If omnia_telemetry_support is true, then the inventory should have control plane and cluster node groups (as specified in the sample files) along with optional login group.

  • Rocky 8.7 is not compatible with the Kubernetes installed by telemetry.yml due to known issues with cri-o. For more information, click here.

  • If a subsequent run of telemetry.yml fails, the telemetry_config.yml file will be unencrypted.

To access the Grafana UI

Pre requisites

  • visualisation_support should be set to true when running telemetry.yml or omnia.yml.

  1. Find the IP address of the Grafana service using kubectl get svc -n grafana

../../_images/grafanaIP.png
  1. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000. That is http://xx.xx.xx.xx:5000/login

../../_images/Grafana_login.png
  1. Enter the grafana_username and grafana_password as mentioned in input/telemetry_config.yml.

../../_images/Grafana_Dashboards.png

Loki log collections can viewed on the explore section of the grafana UI.

../../_images/Grafana_Loki.png

Datasources configured by Omnia can be viewed as seen below.

../../_images/GrafanaDatasources.png

To use Loki for log filtering

  1. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000. That is http://xx.xx.xx.xx:5000/login

  2. In the Explore page, select control-plane-loki.

../../_images/Grafana_ControlPlaneLoki.png
  1. The log browser allows you to filter logs by job, node and/or user.

Example

(job)= "cluster deployment logs") |= "nodename"
(job="compute log messages") |= "nodename" |="node_username"

To use Grafana to view telemetry data

  1. Login to the Grafana UI by connecting to the cluster IP of grafana service obtained above via port 5000. That is http://xx.xx.xx.xx:5000/login

  2. In the Explore page, select telemetry-postgres.

../../_images/Grafana_Telemetry_PostGRES.png
  1. The query builder allows you to create SQL commands that can be used to query the omnia_telemetry.metrics table. Filter the data required using the following fields:

  • id: The name of the metric.

  • context: The type of metric being collected (Regular Metric, Health Check Metric and GPU metric).

  • label: A combined field listing the id and context row values.

  • value: The value of the metric at the given timestamp.

  • unit: The unit measure of the metric (eg: Seconds, kb, percent, etc.)

  • system: The service tag of the cluster node.

  • hostname: The hostname of the cluster node.

  • time: The timestamp at which the metric was polled from the cluster node.

If you are more comfortable using SQL queries over the query builder, click on Edit SQL to directly provide your query. Optionally, the data returned from a query can be viewed as a graph.

Visualizations

If idrac_telemetry_support and visualisation_support is set to true, Parallel Coordinate graphs can be used to view system statistics.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.