Releases

1.6

  • WACO support with Ubuntu 22.04 OS:

  • Additional Features

1.5.1

  • Omnia now installs Kubernetes 1.26.

1.5

  • Extensive Telemetry and Monitoring has been added to the Omnia stack, intended for consumption by customers that are using Dell systems and Omnia to provide SaaS/IaaS solutions. These include, but are not limited to:

– CPU Utilization and status

– GPU utilization

– Node Count

– Network Packet I/O

– HDD capacity and free space

– Memory capacity and utilization

– Queued and Running Job Count

– User Count

– Cluster HW Health Checks (PCIE, NVLINK, BMC, Temps)

– Cluster SW Health Checks (dmesg, BeeGFS, k8s nodes/pods, mySQL on control plane)

  • Metrics are extracted using a combination of the following: PSUtil, Smartctl, beegfs-ctl, nvidia-smi, rocm-smi. Since groundwork is already laid, additional requests from these tools will be quicker to implement in the future.

  • Telemetry and health checks can be optionally disabled.

  • Log Aggregation via xCAT syslog:

– Aggregated on control plane, grouping default is “severity” with others available.

– Uses Grafani-Loki for viewing.

  • Omnia github now hosts a “genesis” image with this functionality baked in for initial bootup.

  • Host aliasing for Scheduler and IPA authentication.

  • Login and kube_control_plane access from both public and private NIC.

  • Validation check enhancements:

  • Rearranged to occur as early as possible.

  • Isolate checks when running smaller playbooks.

1.4.3

  • XE 9640, R760 XA, R760 XD2 are now supported as control planes or target nodes with Nvidia H100 accelerators.

  • Added ability for split port configuration on NVIDIA Quantum-2-based QM9700 (Nvidia InfiniBand NDR400 switches).

  • Extended password-less SSH support for multiple user configuration in a single execution.

  • Input mapping files and inventory files now support commented entries for customized playbook execution.

  • NFS share is now available for hosting user home directories within the cluster.

1.4.2

  • XE9680, R760, R7625, R6615, R7615 are now supported as control planes or target nodes.

  • Added ability for switch-based discovery of remote servers and PXE provisioning.

  • Active RedHat subscription is no longer required on the control plane and the cluster nodes. Users can configure and use local RHEL repositories.

  • IP ranges can be defined for assignment to remote nodes when discovered via the switch.

1.4.1

  • R660, R6625 and C6620 platforms are now supported as control planes or target nodes.

  • One touch provisioning now allows for OFED installation, NVIDIA CUDA-toolkit installation along with iDRAC and InfiniBand IP configuration on target nodes.

  • Potential servers can now be discovered via iDRAC.

  • Servers can be provisioned automatically without manual intervention for booting/PXE settings.

  • Target node provisioning status can now be checked on the control plane by viewing the OmniaDB.

  • Omnia clusters can be configured with password-less SSH for seamless execution of HPC jobs run by non-root users.

  • Accelerator drivers can be installed on Rocky target nodes in addition to RHEL.

1.4

  • Provisioning of remote nodes through PXE boot by providing TOR switch IP

  • Provisioning of remote nodes through PXE boot by providing mapping file

  • PXE provisioning of remote nodes through admin NIC or shared LOM NIC

  • Database update of mac address, hostname and admin IP

  • Optional monitoring support(Grafana installation) on control plane

  • OFED installation on the remote nodes

  • CUDA installation on the remote nodes

  • AMD accelerator and ROCm support on the remote nodes

  • Omnia playbook execution with Kubernetes, Slurm, and FreeIPA installation in all cluster nodes

  • Infiniband switch configuration and split port functionality

  • Added support for Ethernet Z series switches.

1.3

  • CLI support for all Omnia playbooks (AWX GUI is now optional/deprecated).

  • Automated discovery and configuration of all devices (including PowerVault, InfiniBand, and ethernet switches) in shared LOM configuration.

  • Job based user access with Slurm.

  • AMD server support (R6415, R7415, R7425, R6515, R6525, R7515, R7525, C6525).

  • PowerVault ME5 series support (ME5012, ME5024, ME5084).

  • PowerVault ME4 and ME5 SAS Controller configuration and NFS server, client configuration.

  • NFS bolt-on support.

  • BeeGFS bolt-on support.

  • Lua and Lmod installation on manager and compute nodes running RedHat 8.x, Rocky 8.x and Leap 15.3.

  • Automated setup of FreeIPA client on all nodes.

  • Automate configuration of PXE device settings (active NIC) on iDRAC.

1.2.2

  • Bugfix patch release to address AWX Inventory not being updated.

1.2.1

  • HPC cluster formation using shared LOM network

  • Supporting PXE boot on shared LOM network as well as high speed Ethernet or InfiniBand path.

  • Support for BOSS Control Card

  • Support for RHEL 8.x with ability to activate the subscription

  • Ability to upgrade Kernel on RHEL

  • Bolt-on Support for BeeGFS

1.2.0.1

  • Bugfix patch release which address the broken cobbler container issue.

  • Rocky 8.6 Support

1.2

  • Omnia supports Rocky 8.5 full OS on the Control Plane

  • Omnia supports ansible version 2.12 (ansible-core) with python 3.6 support

  • All packages required to enable the HPC/AI cluster are deployed as a pod on control plane

  • Omnia now installs Grafana as a single pane of glass to view logs, metrics and telemetry visualization

  • cluster node provisioning can be done via PXE and iDRAC

  • Omnia supports multiple operating systems on the cluster including support for Rocky 8.5 and OpenSUSE Leap 15.3

  • Omnia can deploy cluster nodes with a single NIC.

  • All Cluster metrics can be viewed using Grafana on the Control plane (as opposed to checking the kube_control_plane on each cluster)

  • AWX node inventory now displays service tags with the relevant operating system.

  • Omnia adheres to most of the requirements of NIST 800-53 and NIST 800-171 guidelines on the control plane and login node.

  • Omnia has extended the FreeIPA feature to provide authentication and authorization on Rocky Nodes.

  • Omnia uses [389ds}(https://directory.fedoraproject.org/) to provide authentication and authorization on Leap Nodes.

  • Email Alerts have been added in case of login failures.

  • Administrator can restrict users or hosts from accessing the control plane and login node over SSH.

  • Malicious or unwanted network software access can be restricted by the administrator.

  • Admins can restrict the idle time allowed in an ssh session.

  • Omnia installs apparmor to restrict program access on leap nodes.

  • Security on audit log access is provided.

  • Program execution on the control plane and login node is logged using snoopy tool.

  • User activity on the control plane and login node is monitored using psacct/acct tools installed by Omnia

  • Omnia fetches key performance indicators from iDRACs present in the cluster

  • Omnia also supports fetching performance indicators on the nodes in the cluster when SLURM jobs are running.

  • The telemetry data is plotted on Grafana to provide better visualization capabilities.

  • Four visualization plugins are supported to provide and analyze iDRAC and Slurm data.

    • Parallel Coordinate

    • Spiral

    • Sankey

    • Stream-net (aka. Power Map)

  • In addition to the above features, changes have been made to enhance the performance of Omnia.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.