Installing pmix and updating slurm configuration for AMD processors

This topic explains how to automatically update AMD servers for MPI jobs. To manually install pmix and update the slurm configuration, click here.

Pre-requisites

  • discovery_provision.yml has been executed.

  • An Omnia slurm cluster has been set up by omnia.yml running with at least 2 nodes: 1 slurm_control_node and 1 slurm_node.

  • Verify that the target nodes are in the booted state. For more information, click here.

  • A local OpenMPI repository has been created. For more information, click here. <../LocalRepo/openMPI.html>

To run the playbook:

cd benchmarks
ansible-playbook amd_benchmark.yml -i inventory

To execute multi-node jobs

  • OpenMPI and aocc-compiler-*.tar should be installed and compiled with slurm on all cluster nodes or should be available on the NFS share.

Note

  • Omnia currently supports pmix version2, pmix_v2.

  • While compiling OpenMPI, include pmix, slurm, hwloc and, libevent as shown in the below sample command:

    ./configure --prefix=/home/omnia-share/openmpi-4.1.5 --enable-mpi1-compatibility --enable-orterun-prefix-by-default --with-slurm=/usr --with-pmix=/usr --with-libevent=/usr --with-hwloc=/usr --with-ucx CC=clang CXX=clang++ FC=flang   2>&1 | tee config.out
    
  • For a job to run on multiple nodes (10.5.0.4 and 10.5.0.5) where OpenMPI is compiled and installed on the NFS share (/home/omnia-share/openmpi/bin/mpirun), the job can be initiated as below:

Note

Ensure amd-zen-hpl-2023_07_18 is downloaded before running this command.

srun -N 2 --mpi=pmix_v2 -n 2 ./amd-zen-hpl-2023_07_18/xhpl

For a batch job using the same parameters, the script would be:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --output=test.log

#SBATCH --partition=normal

#SBATCH -N 3

#SBATCH --time=10:00

#SBATCH --ntasks=2




source /home/omnia-share/setenv_AOCC.sh

export PATH=$PATH:/home/omnia-share/openmpi/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib

srun --mpi=pmix_v2 ./amd-zen-hpl-2023_07_18/xhpl

Alternatively, to use mpirun, the script would be:

#!/bin/bash

#SBATCH --job-name=test

#SBATCH --output=test.log

#SBATCH --partition=normal

#SBATCH -N 3

#SBATCH --time=10:00

#SBATCH --ntasks=2




source /home/omnia-share/setenv_AOCC.sh

export PATH=$PATH:/home/omnia-share/openmpi/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib

/home/omnia-share/openmpi/bin/mpirun --map-by ppr:1:node -np 2 --display-map   --oversubscribe --mca orte_keep_fqdn_hostnames 1 ./xhpl

Note

The above scripts are samples that can be modified as required. Ensure that --mca orte_keep_fqdn_hostnames 1 is included in the mpirun command in sbatch scripts. Omnia maintains all hostnames in FQDN format. Failing to include --mca orte_keep_fqdn_hostnames 1 may cause job initiation to fail.

If you have any feedback about Omnia documentation, please reach out at omnia.readme@dell.com.