fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists #5492

srujithpoondla03 · 2026-01-16T20:45:59Z

Description

Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES.
Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing).

Without these entries, NCCL hangs during distributed training initialization on P5e instances due to missing EFA environment variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE).

Fixes Add ml.p5e.48xlarge to EFA instance lists in sagemaker-train and sagemaker-core #5491
sagemaker-training-toolkit issue: Add ml.p5e.48xlarge to EFA instance lists (SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES) sagemaker-training-toolkit#240
sagemaker-training-toolkit PR: feat: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists sagemaker-training-toolkit#241

Testing

Added unit test for P5/P5e EFA instance list membership

Add ml.p5e.48xlarge to SM_EFA_NCCL_INSTANCES and SM_EFA_RDMA_INSTANCES. Add ml.p5.48xlarge to SM_EFA_RDMA_INSTANCES (was missing). Without these entries, NCCL hangs during distributed training initialization on P5e instances due to missing EFA environment variables (FI_PROVIDER, FI_EFA_USE_DEVICE_RDMA, RDMAV_FORK_SAFE). Fixes aws#5491

srujithpoondla03 · 2026-01-16T20:51:26Z

Is there a specific process for testing EFA/instance-specific changes on actual hardware (e.g., P5e instances) before merging? Or are unit tests sufficient for this type of change?

We've verified the fix works in production with a manual workaround (setting the EFA env vars directly), but wanted to confirm if there's an integration test process for these hardware-specific configurations.

mrpic-amzn · 2026-01-16T21:25:43Z

We've verified the fix works in production with a manual workaround

Can you add more details about how this was tested in production. For example, the training job arns, the exact commands you used to test, the CUDA versions that were used in the training algo.

srujithpoondla03 requested a deployment to manual-approval January 16, 2026 20:46 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists #5492

fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists #5492

srujithpoondla03 commented Jan 16, 2026

Uh oh!

srujithpoondla03 commented Jan 16, 2026 •

edited

Loading

Uh oh!

mrpic-amzn commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists #5492

Are you sure you want to change the base?

fix: Add ml.p5e.48xlarge and ml.p5.48xlarge to EFA instance lists #5492

Conversation

srujithpoondla03 commented Jan 16, 2026

Description

Related

Testing

Uh oh!

srujithpoondla03 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrpic-amzn commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srujithpoondla03 commented Jan 16, 2026 •

edited

Loading