Fix wrong group usage in blocking p2p #2688

Chao1Han · 2026-01-06T05:13:31Z

In the default blocking P2P mode, there is no need to wrap P2P operations with group APIs. In oneCCL 2021.17, this pattern can actually lead to a deadlock when a group contains only send or only recv. Removing the group calls around the P2P operations resolves this issue.
disable_e2e
disable_ut

Copilot

Pull request overview

This PR fixes a deadlock issue in blocking P2P operations by removing unnecessary group API wrapping. The change addresses a problem where oneCCL 2021.17 can deadlock when a group contains only send or only recv operations.

Removed ccl::group_start() and ccl::group_end() calls around P2P operations in blocking mode
Re-enabled previously skipped test_batch_isend_irecv test that was hanging due to this issue

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
src/xccl/ProcessGroupXCCL.cpp	Removed group API calls wrapping P2P operations to prevent deadlock
test/xpu/skip_list_dist.py	Re-enabled previously skipped test that was failing due to the deadlock issue

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkourdis

LGTM.

github-actions · 2026-01-06T13:44:22Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnet18	0.887445	0.788249

🟡 [80%, 90%), may be fluctuations

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
huggingface_float16_training	XLNetLMHeadModel	0.99703	0.899677

github-actions · 2026-01-07T11:09:52Z

Performance outliers, please check!

🔴 [-1, 80%), should be regression

Category	Model	Target vs. Baseline [Eager]	Target vs. Baseline [Inductor]
torchbench_bfloat16_training	resnet18	0.908869	0.770172

Fix wrong group usage in blocking p2p

0cf9ac5

Copilot AI review requested due to automatic review settings January 6, 2026 05:13

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Chao1Han requested a review from pkourdis January 6, 2026 09:02

pkourdis approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'main' into xccl/fix_hang

e0ae03e

Merge branch 'main' into xccl/fix_hang

0ffe81a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix wrong group usage in blocking p2p #2688

Fix wrong group usage in blocking p2p #2688

Chao1Han commented Jan 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

pkourdis left a comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix wrong group usage in blocking p2p #2688

Are you sure you want to change the base?

Fix wrong group usage in blocking p2p #2688

Conversation

Chao1Han commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

pkourdis left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Performance outliers, please check!

Uh oh!

github-actions bot commented Jan 7, 2026

Performance outliers, please check!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Chao1Han commented Jan 6, 2026 •

edited

Loading