GPU accelerator configuration

The accelerator role allows users to set up the AMD ROCm platform or the CUDA Nvidia toolkit. These tools allow users to unlock the potential of installed GPUs.

Ensure that CUDA and ROCm local repositories are configured using the local_repo.yml script.

Enter all required parameters in input/accelerator_config.yml.

Parameters

Details

amd_gpu_version

string Optional

This variable accepts the amd gpu version for the RHEL specific OS version. Verify if the version provided is present in the repo for the OS version on your node. Verify the url for the compatible version: https://repo.radeon.com/amdgpu/ . If ‘latest’ is provided in the variable and the cluster os version is rhel 8.5. Then the url transforms to https://repo.radeon.com/amdgpu/latest/rhel/8.5/main/x86_64/

Default values: 22.20.3

amd_rocm_version

string Optional

Required AMD ROCm driver version. Make sure the subscription is enabled for rocm installation because rocm packages are present in code ready builder repo for RHEL. If ‘latest’ is provided in the variable, the url transforms to https://repo.radeon.com/rocm/centos8/latest/main/. Only single instance is supported by Omnia.

Default values: latest/main

cuda_toolkit_version

string Optional

Required CUDA toolkit version. By default latest cuda is installed unless cuda_toolkit_path is specified. Default: latest (11.8.0).

Default values: latest

cuda_toolkit_path

string Optional

If the latest cuda toolkit is not required, provide an offline copy of the toolkit installer in the path specified. (Take an RPM copy of the toolkit from here). If cuda_toolkit_version is not latest, giving cuda_toolkit_path is mandatory.

cuda_stream
string

Optional

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.

Default values: latest-dkms

Note

  • Nodes provisioned using the Omnia provision tool do not require a RedHat subscription to run accelerator.yml on RHEL target nodes.

  • For RHEL target nodes not provisioned by Omnia, ensure that RedHat subscription is enabled on all target nodes. Every target node will require a RedHat subscription.

  • AMD ROCm driver installation is not supported by Omnia on Rocky cluster nodes.

To install all the latest GPU drivers and toolkits, run:

cd accelerator
ansible-playbook accelerator.yml -i inventory

The following configurations take place when running accelerator.yml

  1. Servers with AMD GPUs are identified and the latest GPU drivers and ROCm platforms are downloaded and installed.

  2. Servers with NVIDIA GPUs are identified and the specified CUDA toolkit is downloaded and installed.

  3. For the rare servers with both NVIDIA and AMD GPUs installed, all the above mentioned download-ables are installed to the server.

  4. Servers with neither GPU are skipped.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.