This section documents how to build a ParallelCluster AMI based on the Deep Learning Ubuntu 20.04 AMI. This means we’ll take an ami that’s pre-built with PyTorch, CUDA, NCCL and EFA and then build on top the parallelcluster components like Slurm, Lustre, ect.
We recommend this approach over taking a AWS ParallelCluster AMI and customizing it as it can be complicated and error prone to install the CUDA stack. This approach relies on the pre-built stack in the Deep Learning AMI, skipping all that.
Find the name of the AMI that you’d like to use from the release notes page. If you’re unsure the following command will use the AWS Deep Learning Base GPU AMI (Ubuntu 20.04) which is a good ami to start with.
Next run the following command with the Name substituted for the ami you want. i.e. Deep Learning Base GPU AMI (Ubuntu 20.04) ????????
:
aws ec2 describe-images --region us-east-1 --owners amazon --filters 'Name=name,Values=Deep Learning Base GPU AMI (Ubuntu 20.04) ????????' 'Name=state,Values=available' --query 'reverse(sort_by(Images, &CreationDate))[:1].ImageId' --output text
You’ll get an ami id like ami-0528af10692058c25
.
Next using the ami id we fetched, we’ll create a config file dl-ami.yaml
like so:
Build:
InstanceType: g4dn.2xlarge
ParentImage: ami-0528af10692058c25
Then run the pcluster build-image
command and specify that config file.
pcluster build-image --image-id pcluster-3-7-1-deep-learning-alinux2 -c dl-ami.yaml
Think of the flag --image-id
as the name of the image. In the above example we call it pcluster-3-7-1-deep-learning-ubuntu
to easily see which version of parallelcluster we built the image for and the framework/os. Feel free to change this to suit your use case.
Once the image is finished building you’ll see in pcluster list-images
$ pcluster list-images --image-status AVAILABLE
{
"images": [
{
"imageId": "pcluster-3-7-1-deep-learning-ubuntu",
"imageBuildStatus": "BUILD_COMPLETE",
"ec2AmiInfo": {
"amiId": "ami-1234abcd5678efgh"
},
"region": "us-east-1",
"version": "3.7.1"
}
]
}