You have two options for building a machine image. The first is the Deep Learning AMI. This is the easier option as there’s pre-built versions versions for popular frameworks like Tensorflow, PyTorch or MxNet.
The second is the parallelcluster-efa-gpu-preflight repository. This option allows you more control on the software stack, such as custom CUDA versions, container support via Pyxis and Enroot.
Here’s an example of the two different images. Pick one and proceed to either a. Deep Learning AMI or b. Custom AMI with Packer (Optional).
For example, in the Ubuntu 20.04 Deep Learning AMI, we get the following software stack:
Ubuntu 20.04
x86
Python 3.9
535.54.03
/usr/local/cuda-xx.x/
1.19.0
1.5.0-aws
/usr/local/cuda-xx.x/efa
/usr/local/cuda-11.8/efa/test-cuda-xx.x/
Also, PyTorch package comes with dynamically linked AWS OFI NCCL plugin as a conda package aws-ofi-nccl-dlc
package as well and PyTorch will use that package instead of system AWS OFI NCCL.
/usr/local/cuda-xx.x/efa/test-cuda-xx.x/
The parallelcluster-efa-gpu-preflight repository includes the following software stack.