Skip to content

Conversation

@romerojosh
Copy link
Collaborator

The current distributed model initialization code does not set the current GPU device to the one used for the model before initializing its NCCL communicator. This can lead to potential issues where the NCCL communicator is created on a different GPU device than what the model is running on, leading to runtime crashes with NCCL operations. This PR adds appropriate device settings where required to resolve this issue.

…tibuted training.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh romerojosh force-pushed the fix_device_handling_distributed branch from 2d944f8 to 6c6554c Compare December 5, 2025 18:29
@romerojosh
Copy link
Collaborator Author

/build_and_test

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🚀 Build workflow triggered! View run

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

✅ Build workflow passed! View run

Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

/build_and_test

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

🚀 Build workflow triggered! View run

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

✅ Build workflow passed! View run

@romerojosh romerojosh requested a review from azrael417 December 5, 2025 19:51
Copy link
Collaborator

@azrael417 azrael417 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for fixing this. I had a question but I will approve.

@romerojosh romerojosh merged commit 96f8a33 into master Dec 8, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants