Accelerator

The accelerator role allows users to set up the AMD ROCm platform or the CUDA Nvidia toolkit. These tools allow users to unlock the potential of installed GPUs.

Enter all required parameters in input/accelerator_config.yml.

Name

Default, Accepted Values

Required?

Information

amd_gpu_version

22.20.3

optional

This variable accepts the amd gpu version for the RHEL specific OS version. Verify if the version provided is present in the repo for the OS version on your node. Verify the url for the compatible version: https://repo.radeon.com/amdgpu/ . If ‘latest’ is provided in the variable and the compute os version is rhel 8.5. Then the url transforms to https://repo.radeon.com/amdgpu/latest/rhel/8.5/main/x86_64/

amd_rocm_version

latest/main

optional

Required AMD ROCm driver version. Make sure the subscription is enabled for rocm installation because rocm packages are present in code ready builder repo for RHEL. If ‘latest’ is provided in the variable, the url transforms to https://repo.radeon.com/rocm/centos8/latest/main/. Only single instance is supported by Omnia.

cuda_toolkit_version

latest

optional

Required CUDA toolkit version. By default latest cuda is installed unless cuda_toolkit_path is specified. Default: latest (11.8.0).

cuda_toolkit_path

optional

If the latest cuda toolkit is not required, provide an offline copy of the toolkit installer in the path specified. (Take an RPM copy of the toolkit from here). If cuda_toolkit_version is not latest, giving cuda_toolkit_path is mandatory.

cuda_stream

latest-dkms

optional

A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code.

Note

  • For target nodes running RedHat, ensure that redhat subscription is enabled before running accelerator.yml

  • If cuda_toolkit_path is provided in input/provision_config.yml and NVIDIA GPUs are available on the target nodes, CUDA packages will be deployed post provisioning without user intervention during the execution of provision.yml.

To install all the latest GPU drivers and toolkits, run:

cd accelerator
ansible-playbook accelerator.yml -i inventory

(where inventory consists of manager, compute and login nodes)

The following configurations take place when running accelerator.yml
  1. Servers with AMD GPUs are identified and the latest GPU drivers and ROCm platforms are downloaded and installed.

  2. Servers with NVIDIA GPUs are identified and the specified CUDA toolkit is downloaded and installed.

  3. For the rare servers with both NVIDIA and AMD GPUs installed, all the above mentioned download-ables are installed to the server.

  4. Servers with neither GPU are skipped.