Correctly set GPU device for distributed initialization. #99

romerojosh · 2025-12-05T18:24:36Z

The current distributed model initialization code does not set the current GPU device to the one used for the model before initializing its NCCL communicator. This can lead to potential issues where the NCCL communicator is created on a different GPU device than what the model is running on, leading to runtime crashes with NCCL operations. This PR adds appropriate device settings where required to resolve this issue.

…tibuted training. Signed-off-by: Josh Romero <joshr@nvidia.com>

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2025-12-05T18:30:08Z

/build_and_test

github-actions · 2025-12-05T18:30:17Z

🚀 Build workflow triggered! View run

github-actions · 2025-12-05T18:42:03Z

✅ Build workflow passed! View run

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2025-12-05T19:16:09Z

/build_and_test

github-actions · 2025-12-05T19:16:17Z

🚀 Build workflow triggered! View run

github-actions · 2025-12-05T19:28:40Z

✅ Build workflow passed! View run

azrael417

Great, thanks for fixing this. I had a question but I will approve.

src/csrc/distributed.cpp

romerojosh added 3 commits December 5, 2025 10:28

Add missing GPU device handling for NCCL initialization/usage for dis…

41a890c

…tibuted training. Signed-off-by: Josh Romero <joshr@nvidia.com>

Add tests.

e13fa82

Signed-off-by: Josh Romero <joshr@nvidia.com>

Formatting fixes.

6c6554c

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the fix_device_handling_distributed branch from 2d944f8 to 6c6554c Compare December 5, 2025 18:29

Add distributed test to CI.

c2b21f8

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh requested a review from azrael417 December 5, 2025 19:51

azrael417 approved these changes Dec 8, 2025

View reviewed changes

src/csrc/distributed.cpp Show resolved Hide resolved

romerojosh merged commit 96f8a33 into master Dec 8, 2025
4 checks passed

romerojosh mentioned this pull request Dec 8, 2025

Device handling improvements. #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Correctly set GPU device for distributed initialization. #99

Correctly set GPU device for distributed initialization. #99

Uh oh!

romerojosh commented Dec 5, 2025

Uh oh!

romerojosh commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

romerojosh commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

azrael417 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Correctly set GPU device for distributed initialization. #99

Correctly set GPU device for distributed initialization. #99

Uh oh!

Conversation

romerojosh commented Dec 5, 2025

Uh oh!

romerojosh commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

romerojosh commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

azrael417 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants