Remove node from the cluster

Use this playbook to remove nodes from the cluster and stop all clustering software on the target nodes.

Note

All target nodes should be drained before executing the playbook. If a job is running on any target nodes, the playbook will exit.

Configurations performed by the playbook

  • Remove node from Slurm and Kubernetes cluster.

  • Update Slurm and Kubernetes config.

  • Slurm and Kubernetes services are stopped (not uninstalled). OS startup service list will be updated to disable Slurm and Kubernetes.

To run the playbook

Run the playbook using the following commands:

cd utils
ansible-playbook remove_node_config.yml -i inventory

Soft reset the cluster

Use this playbook to stop all Slurm and Kubernetes services. This action will destroy the cluster.

Note

All target nodes should be drained before executing the playbook. If a job is running on any target nodes, the playbook will exit.

Configurations performed by the playbook

  • The Slurm or Kubernetes cluster will be reset.

  • The configuration on the kube_control_plane or the slurm_control_plane will be reset.

  • Slurm and Kubernetes services are stopped (not uninstalled).

To run the playbook

Run the playbook using the following commands:

cd utils
ansible-playbook reset_cluster_config.yml -i inventory

Delete node from the cluster

Use this playbook to remove nodes from all inventory files and tables. No changes are made to the Slurm or Kubernetes cluster.

Note

All target nodes should be drained before executing the playbook. If a job is running on any target nodes, the playbook will exit.

Configurations performed by the playbook

  • Nodes will be deleted from the Omnia DB and xCAT node object will be deleted.

  • Telemetry services will be stopped.

To run the playbook

Run the playbook using the following commands:

cd utils
ansible-playbook delete_node.yml -i inventory

Note

When the node is added or deleted, the autogenerated inventories: amd_gpu, nvidia_gpu, amd_cpu, and intel_cpu should be updated for the latest changes. Slurm partition is also needs to be updated with these changes.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.