Skip to content

Conversation

@srujithpoondla03
Copy link

Description

Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES.
Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing).

Without these entries, NCCL hangs during distributed training initialization on P5e instances due to missing EFA environment variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE).

Related

Testing

  • Added unit test for P5/P5e EFA instance list membership

Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES.
Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing).

Without these entries, NCCL hangs during distributed training
initialization on P5e instances due to missing EFA environment
variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE).

Fixes aws#5491
@srujithpoondla03
Copy link
Author

srujithpoondla03 commented Jan 16, 2026

Is there a specific process for testing EFA/instance-specific changes on actual hardware (e.g., P5e instances) before merging? Or are unit tests sufficient for this type of change?

We've verified the fix works in production with a manual workaround (setting the EFA env vars directly), but wanted to confirm if there's an integration test process for these hardware-specific configurations.

@mrpic-amzn
Copy link

We've verified the fix works in production with a manual workaround

Can you add more details about how this was tested in production. For example, the training job arns, the exact commands you used to test, the CUDA versions that were used in the training algo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ml.p5e.48xlarge to EFA instance lists in sagemaker-train and sagemaker-core

2 participants