Update aws-efa-k8s-device-plugin version to 0.5.10 #282
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Updates the AWS EFA Kubernetes device plugin to version 0.5.10 or later to ensure proper device identification on p6-b200.48xlarge instances and prevent NCCL failures during distributed training.
Problem
On p6-b200.48xlarge instances, the InfiniBand subsystem exposes 10 devices:
ibp*
devices (e.g.,ibp115s0f0
,ibp116s0f0
)rdmap*
devices (e.g.,rdmap79s0
,rdmap80s0
, etc.)Critical Issue: The
ibp*
devices are NVLink controllers for GPU interconnect, not EFA devices. However, EFA device plugin versions prior to 0.5.10 incorrectly identify these NVLink controllers as EFA devices.Impact of Current Version
When using EFA device plugin < 0.5.10:
According to AWS EFA team : "Everything will eventually break because eventually EKS will schedule them as EFA devices, but they aren't, and then NCCL will get confused."
Solution
Update to EFA device plugin version 0.5.10 or later, which correctly:
rdmap*
interfaces as EFA devicesibp*
NVLink controllers from the EFA resource poolTesting
Current State - EFA Device Plugin v0.5.4
Problem: The plugin v0.5.4 incorrectly identifies NVLink controllers (
ibp115s0f0
,ibp116s0f0
) as EFA devices, reporting 10 instead of 8.Update Process and Results - EFA Device Plugin v0.5.10 (VERIFIED)
Device Breakdown on p6-b200.48xlarge
Verification Steps
rdmap*
devices are allocatedReferences