Skip to content

Conversation

KeitaW
Copy link
Contributor

@KeitaW KeitaW commented Sep 29, 2025

Updates the AWS EFA Kubernetes device plugin to version 0.5.10 or later to ensure proper device identification on p6-b200.48xlarge instances and prevent NCCL failures during distributed training.

Problem

On p6-b200.48xlarge instances, the InfiniBand subsystem exposes 10 devices:

  • 2 ibp* devices (e.g., ibp115s0f0, ibp116s0f0)
  • 8 rdmap* devices (e.g., rdmap79s0, rdmap80s0, etc.)

Critical Issue: The ibp* devices are NVLink controllers for GPU interconnect, not EFA devices. However, EFA device plugin versions prior to 0.5.10 incorrectly identify these NVLink controllers as EFA devices.

Impact of Current Version

When using EFA device plugin < 0.5.10:

  1. Kubernetes incorrectly reports 10 available EFA devices instead of 8
  2. Pods may be scheduled with NVLink controllers allocated as EFA devices
  3. NCCL gets confused when trying to use NVLink controllers for RDMA communication
  4. Result: Training jobs fail with NCCL errors and communication timeouts

According to AWS EFA team : "Everything will eventually break because eventually EKS will schedule them as EFA devices, but they aren't, and then NCCL will get confused."

Solution

Update to EFA device plugin version 0.5.10 or later, which correctly:

  • Identifies only the 8 rdmap* interfaces as EFA devices
  • Excludes the 2 ibp* NVLink controllers from the EFA resource pool
  • Reports the correct count of 8 EFA devices to Kubernetes

Testing

Current State - EFA Device Plugin v0.5.4

# Check current EFA device plugin version
$ kubectl get daemonset aws-efa-k8s-device-plugin -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.4

# Check EFA resources on p6-b200.48xlarge node
$ kubectl describe node ip-10-0-85-35.ec2.internal | grep -A2 -B2 "vpc.amazonaws.com/efa"
memory:                 2092981788Ki
  pods:                   250
  vpc.amazonaws.com/efa:  10    # INCORRECT - includes 2 NVLink controllers
Allocatable:
  cpu:                    191450m
--
  memory:                 2079885852Ki
  pods:                   250
  vpc.amazonaws.com/efa:  10    # INCORRECT - should be 8

Problem: The plugin v0.5.4 incorrectly identifies NVLink controllers (ibp115s0f0, ibp116s0f0) as EFA devices, reporting 10 instead of 8.

Update Process and Results - EFA Device Plugin v0.5.10 (VERIFIED)

# Step 1: Update the EFA device plugin to v0.5.10
$ kubectl set image daemonset/aws-efa-k8s-device-plugin -n kube-system \
    aws-efa-k8s-device-plugin=602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.10
daemonset.apps/aws-efa-k8s-device-plugin image updated

# Step 2: Wait for rollout (note: plugin may crash on non-EFA nodes - this is expected)
$ kubectl rollout status daemonset/aws-efa-k8s-device-plugin -n kube-system

# Step 3: Restart the plugin pod on p6-b200 node to pick up changes
$ kubectl delete pod aws-efa-k8s-device-plugin-<pod-id> -n kube-system

# Step 4: Verify the updated version
$ kubectl get daemonset aws-efa-k8s-device-plugin -n kube-system -o jsonpath='{.spec.template.spec.containers[0].image}'
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/aws-efa-k8s-device-plugin:v0.5.10

# Step 5: Check EFA resources on same node - NOW CORRECT!
$ kubectl describe node ip-10-0-85-35.ec2.internal | grep -A2 -B2 "vpc.amazonaws.com/efa"
  memory:                 2092981788Ki
  pods:                   250
  vpc.amazonaws.com/efa:  8    # CORRECT - Only 8 EFA devices
Allocatable:
  cpu:                    191450m
--
  memory:                 2079885852Ki
  pods:                   250
  vpc.amazonaws.com/efa:  8    # CORRECT - Excludes NVLink controllers

Device Breakdown on p6-b200.48xlarge

# List all InfiniBand devices visible to the system
$ kubectl exec -it <pod-on-p6-b200> -- ls /sys/class/infiniband/
ibp115s0f0    # ❌ NVLink controller (NOT an EFA device)
ibp116s0f0    # ❌ NVLink controller (NOT an EFA device)
rdmap79s0     # ✅ EFA device
rdmap80s0     # ✅ EFA device
rdmap81s0     # ✅ EFA device
rdmap132s0    # ✅ EFA device
rdmap133s0    # ✅ EFA device
rdmap134s0    # ✅ EFA device
rdmap135s0    # ✅ EFA device
rdmap136s0    # ✅ EFA device

# Total: 10 InfiniBand devices (2 NVLink + 8 EFA)

Verification Steps

  1. Deploy test pod requesting all EFA resources
  2. Verify only rdmap* devices are allocated
  3. Confirm NCCL initializes successfully
  4. Run multi-node training job to validate communication

References

@KeitaW KeitaW requested a review from a team as a code owner September 29, 2025 22:49
@pintaoz-aws pintaoz-aws merged commit dc2096a into aws:main Sep 30, 2025
6 checks passed
@KeitaW KeitaW deleted the patch-2 branch September 30, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants