Skip to content

Conversation

@Chao1Han
Copy link
Contributor

@Chao1Han Chao1Han commented Jan 6, 2026

In the default blocking P2P mode, there is no need to wrap P2P operations with group APIs. In oneCCL 2021.17, this pattern can actually lead to a deadlock when a group contains only send or only recv. Removing the group calls around the P2P operations resolves this issue.
disable_e2e
disable_ut

Copilot AI review requested due to automatic review settings January 6, 2026 05:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a deadlock issue in blocking P2P operations by removing unnecessary group API wrapping. The change addresses a problem where oneCCL 2021.17 can deadlock when a group contains only send or only recv operations.

  • Removed ccl::group_start() and ccl::group_end() calls around P2P operations in blocking mode
  • Re-enabled previously skipped test_batch_isend_irecv test that was hanging due to this issue

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/xccl/ProcessGroupXCCL.cpp Removed group API calls wrapping P2P operations to prevent deadlock
test/xpu/skip_list_dist.py Re-enabled previously skipped test that was failing due to the deadlock issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Chao1Han Chao1Han requested a review from pkourdis January 6, 2026 09:02
Copy link
Contributor

@pkourdis pkourdis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@github-actions
Copy link

github-actions bot commented Jan 6, 2026

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.887445 0.788249
  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
huggingface_float16_training XLNetLMHeadModel 0.99703 0.899677

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Performance outliers, please check!

  • 🔴 [-1, 80%), should be regression
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnet18 0.908869 0.770172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants