Provision
Before you run the provision tool
(Recommended) Run
prereq.sh
to get the system ready to deploy Omnia. Alternatively, ensure that Ansible 2.12.10 and Python 3.8 are installed on the system. SELinux should also be disabled.Set the hostname of the control plane using the
hostname
.domain name
format.- Hostname requirements
The Hostname should not contain the following characters: , (comma), . (period) or _ (underscore). However, the domain name is allowed commas and periods.
The Hostname cannot start or end with a hyphen (-).
No upper case characters are allowed in the hostname.
The hostname cannot start with a number.
The hostname and the domain name (that is:
hostname00000x.domain.xxx
) cumulatively cannot exceed 64 characters. For example, if thenode_name
provided ininput/provision_config.yml
is ‘node’, and thedomain_name
provided is ‘omnia.test’, Omnia will set the hostname of a target compute node to ‘node00001.omnia.test’. Omnia appends 6 digits to the hostname to individually name each target node.
For example,
controlplane.omnia.test
is acceptable.
Note
The domain name specified for the control plane should be the same as the one specified under domain_name
in input/provision_config.yml
.
To provision the bare metal servers, download one of the following ISOs for deployment:
Note the compatibility between cluster OS and control plane OS below:
Control Plane OS
Compute Node OS
Compatibility
RHEL
RHEL
Yes
RHEL
Rocky
Yes
Rocky
RHEL
Yes[1]_
Rocky
Rocky
Yes
To set up CUDA and OFED using the provisioning tool, download the required repositories from here:
To dictate IP address/MAC mapping, a host mapping file can be provided. Use the pxe_mapping_file.csv to create your own mapping file.
Ensure that all connection names under the network manager match their corresponding device names.
nmcli connection
In the event of a mismatch, edit the file /etc/sysconfig/network-scripts/ifcfg-<nic name>
using vi editor.
When discovering nodes via snmpwalk or a mapping file, all target nodes should be set up in PXE mode before running the playbook.
Nodes provisioned using the Omnia provision tool do not require a RedHat subscription to run
provision.yml
on RHEL target nodes.For RHEL target nodes not provisioned by Omnia, ensure that RedHat subscription is enabled on all target nodes. Every target node will require a RedHat subscription.
Users should also ensure that all repos (AppStream, BaseOS and CRB) are available on the RHEL control plane.
Ensure that the
pxe_nic
andpublic_nic
are in the firewalld zone: public.The control plane NIC connected to remote servers (through the switch) should be configured with two IPs in a shared LOM set up. This NIC is configured by Omnia with the IP xx.yy.255.254, aa.bb.255.254 (where xx.yy are taken from
bmc_nic_subnet
and aa.bb are taken fromadmin_nic_subnet
) whennetwork_interface_type
is set tolom
. For other discovery mechanisms, only the admin NIC is configured with aa.bb.255.254 (Where aa.bb is taken fromadmin_nic_subnet
).
Note
After configuration and installation of the cluster, changing the control plane is not supported. If you need to change the control plane, you must redeploy the entire cluster.
If there are errors while executing any of the Ansible playbook commands, then re-run the playbook.
For servers with an existing OS being discovered via BMC, ensure that the first PXE device on target nodes should be the designated active NIC for PXE booting.
Provisioning the cluster
Edit the input/provision_config.yml
file to update the required variables. A list of the variables required is available by discovery mechanism.
Note
The first PXE device on target nodes should be the designated active NIC for PXE booting.
Optional configurations managed by the provision tool
Installing CUDA
Using the provision tool
If
cuda_toolkit_path
is provided ininput/provision_config.yml
and NVIDIA GPUs are available on the target nodes, CUDA packages will be deployed post provisioning without user intervention.Using the Accelerator playbook
CUDA can also be installed using accelerator.yml after provisioning the servers (Assuming the provision tool did not install CUDA packages).
Note
The CUDA package can be downloaded from here
CUDA requires an additional reboot while being installed. While this is taken care of by Omnia, users are required to wait an additional few minutes when running the provision tool with CUDA installation for the target nodes to come up.
Installing OFED
Using the provision tool
If
mlnx_ofed_path
is provided ininput/provision_config.yml
and Mellanox NICs are available on the target nodes, OFED packages will be deployed post provisioning without user intervention.Using the Network playbook
OFED can also be installed using network.yml after provisioning the servers (Assuming the provision tool did not install OFED packages).
Note
The OFED package can be downloaded from here .
Assigning infiniband IPs
When ib_nic_subnet
is provided in input/provision_config.yml
, the infiniband NIC on target nodes are assigned IPv4 addresses within the subnet without user intervention. When PXE range and Infiniband subnet are provided, the infiniband NICs will be assigned IPs with the same 3rd and 4th octets as the PXE NIC.
For example on a target node, when the PXE NIC is assigned 10.5.0.101, and the Infiniband NIC is assigned 10.10.0.101 (where
ib_nic_subnet
is 10.10.0.0).
Note
The IP is assigned to the interface ib0 on target nodes only if the interface is present in active mode. If no such NIC interface is found, xCAT will list the status of the node object as failed.
Assigning BMC IPs
When target nodes are discovered via SNMP or mapping files (ie discovery_mechanism
is set to snmp or mapping in input/provision_config.yml
), the bmc_nic_subnet
in input/provision_config.yml
can be used to assign BMC IPs to iDRAC without user intervention. When PXE range and BMC subnet are provided, the iDRAC NICs will be assigned IPs with the same 3rd and 4th octets as the PXE NIC.
For example on a target node, when the PXE NIC is assigned 10.5.0.101, and the iDRAC NIC is assigned 10.3.0.101 (where
bmc_nic_subnet
is 10.3.0.0).
Using multiple versions of a given OS
Omnia now supports deploying different versions of the same OS. With each run of provision.yml
, a new deployable OS image is created with a distinct type (rocky or RHEL) and version (8.0, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 8.7) depending on the values provided in input/provision_config.yml
.
Note
While Omnia deploys the minimal version of the OS, the multiple version feature requires that the Rocky full (DVD) version of the OS be provided.
The multiple OS feature is only available with Rocky 8.7 when xCAT 2.16.5 is in use. [Currently, Omnia uses 2.16.4]
DHCP routing for internet access
Omnia now supports DHCP routing via the control plane. To enable routing, update the primary_dns
and secondary_dns
in input/provision_config.yml
with the appropriate IPs (hostnames are currently not supported). For compute nodes that are not directly connected to the internet (ie only PXE network is configured), this configuration allows for internet connectivity.
Disk partitioning
Omnia now allows for customization of disk partitions applied to remote servers. The disk partition desired_capacity
has to be provided in MB. Valid mount_point
values accepted for disk partition are /home
, /var
, /tmp
, /usr
, swap
. Default partition size provided for /boot
is 1024MB, /boot/efi
is 256MB and the remaining space to /
partition. Values are accepted in the form of JSON list such as:
disk_partition:
- { mount_point: "/home", desired_capacity: "102400" }
- { mount_point: "swap", desired_capacity: "10240" }
Running the provision tool
To deploy the Omnia provision tool, run the following command
cd provision
ansible-playbook provision.yml
provision.yml
runs in three stages that can be called individually:
Preparing the control plane
Verifies pre-requisites such as SELinux and xCAT services status.
Installs required tool packages.
Verifies and updates firewall settings.
Installs xCAT.
Configures xCAT databases basis
input/provision_config.yml
.
To call this playbook individually, ensure that input/provision_config.yml
is updated and then run:
ansible-playbook prepare_cp.yml
Creating/updating the repositories
Creates and updates all repositories required locally.
This playbook also invokes the
airgap.yml
playbook for creating offline repositories. For more information on this, click here.To call this playbook individually, ensure that
prepare_cp.yml
has run at least once and then run:ansible-playbook repo_manipulate.yml
Discovering/provisioning the nodes
Discovers all target servers based on specifications in
input/provision_config.yml
.Provisions all discovered servers.
PostgreSQL database is set up with all relevant cluster information such as MAC IDs, hostname, admin IP, infiniband IPs, BMC IPs etc.
To access the DB, run:
psql -U postgres \c omniadb
To view the schema being used in the cluster:
\dn
To view the tables in the database:
\dt
To view the contents of the
nodeinfo
table:select * from cluster.nodeinfo;
id | serial | node | hostname | admin_mac | admin_ip | bmc_ip | ib_ip | status | bmc_mode | switch_ip | switch_name | switch_port ----+---------+--------------------+--------------------------------+-------------------+--------------+--------------+--------------+------------+----------+---------------+-------------+------------- 1 | XXXXXXX | omnia-node00001 | omnia-node00001.omnia.test | ec:2a:72:34:f7:26 | 10.5.0.101 | 10.19.0.101 | 10.10.0.101 | booted | | 10.96.28.132 | switch1 | 2 2 | XXXXXXX | omnia-node00002 | omnia-node00002.omnia.test | | 10.5.0.102 | 10.19.0.102 | 10.10.0.102 | | | 10.96.28.132 | switch1 | 3 3 | XXXXXXX | omnia-node00003 | omnia-node00003.omnia.test | | 10.5.0.103 | 10.19.0.103 | 10.10.0.103 | | | 10.96.28.132 | switch1 | 4 4 | XXXXXXX | omnia-node00004 | omnia-node00004.omnia.test | 2c:ea:7f:3d:6b:98 | 10.5.0.104 | 10.19.0.104 | 10.10.0.104 | installing | | 10.96.28.132 | switch1 | 5 5 | XXXXXXX | omnia-node00005 | omnia-node00005.omnia.test | | 10.5.0.105 | 10.19.0.105 | 10.10.0.105 | | | 10.96.28.132 | switch1 | 6 6 | XXXXXXX | omnia-node00006 | omnia-node00006.omnia.test | | 10.5.0.106 | 10.19.0.106 | 10.10.0.106 | | | 10.96.28.132 | switch1 | 7 7 | XXXXXXX | omnia-node00007 | omnia-node00007.omnia.test | 4c:d9:8f:76:48:2e | 10.5.0.107 | 10.19.0.107 | 10.10.0.107 | booted | | 10.96.28.132 | switch1 | 8 8 | XXXXXXX | omnia-node00008 | omnia-node00008.omnia.test | | 10.5.0.108 | 10.19.0.108 | 10.10.0.108 | | | 10.96.28.132 | switch1 | 1 9 | XXXXXXX | omnia-node00009 | omnia-node00009.omnia.test | | 10.5.0.109 | 10.19.0.109 | 10.10.0.109 | failed | | 10.96.28.132 | switch1 | 10 10 | XXXXXXX | omnia-node00010 | omnia-node00010.omnia.test | | 10.5.0.110 | 10.19.0.110 | 10.10.0.110 | | | 10.96.28.132 | switch1 | 12 11 | XXXXXXX | omnia-node00011 | omnia-node00011.omnia.test | | 10.5.0.111 | 10.19.0.111 | 10.10.0.111 | failed | | 10.96.28.132 | switch1 | 13 12 | XXXXXXX | omnia-node00012 | omnia-node00012.omnia.test | | 10.5.0.112 | 10.19.0.112 | 10.10.0.112 | | | 10.96.28.132 | switch1 | 14
Possible values of status are static, powering-on, installing, bmcready, booting, post-booting, booted, failed. The status will be updated every 3 minutes.
Note
For nodes listing status as ‘failed’, provisioning logs can be viewed in /var/log/xcat/xcat.log
on the target nodes.
To call this playbook individually, ensure that repo_manipulate.yml
has run at least once and then run:
ansible-playbook discovery_provision.yml
Note
If the cluster does not have access to the internet, AppStream will not function. To provide internet access through the control plane (via the PXE network NIC), update
primary_dns
andsecondary_dns
inprovision_config.yml
and runprovision.yml
All ports required for xCAT to run will be opened (For a complete list, check out the Security Configuration Document).
After running
provision.yml
, the fileinput/provision_config.yml
will be encrypted. To edit the file, use the command:ansible-vault edit provision_config.yml --vault-password-file .provision_vault_key
To re-provision target servers
provision.yml
can be re-run with a new inventory file that contains a list of admin (PXE) IPs. For more information, click herePost execution of
provision.yml
, IPs/hostnames cannot be re-assigned by changing the mapping file. However, the addition of new nodes is supported as explained below.
Warning
Once xCAT is installed, restart your SSH session to the control plane to ensure that the newly set up environment variables come into effect.
To avoid breaking the passwordless SSH channel on the control plane, do not run
ssh-keygen
commands post execution ofprovision.yml
.
After running the provision tool
Once the servers are provisioned, run the post provision script to:
Create
node_inventory
in/opt/omnia
listing provisioned nodes.cat /opt/omnia/node_inventory 10.5.0.100 service_tag=XXXXXXX operating_system=RedHat 10.5.0.101 service_tag=XXXXXXX operating_system=RedHat 10.5.0.102 service_tag=XXXXXXX operating_system=Rocky 10.5.0.103 service_tag=XXXXXXX operating_system=Rocky
To run the script, use the below command::
ansible-playbook post_provision.yml
If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.