In this section we’ll run the NCCL Tests, specifically the All Reduce tests which are a suite of tests intended to validate the networking performance of the cluster.
We’ll check to see that EFA is enabled and the bandwidth matches the spec:
Instance Type | Network Bandwidth | GPU Peer to Peer |
---|---|---|
p4d.24xlarge | 200 Gbps EFAv1 | 600 GB/s NVSwitch |
p4de.24xlarge | 400 Gbps EFAv1 | 600 GB/s NVSwitch |
p5.48xlarge | 3200 Gbps EFAv2 | 900 GB/s NVSwitch |