Skip to content

fix(consumers): prevent await deadlock on ContextCallable failure#2941

Merged
imbajin merged 5 commits intoapache:masterfrom
bitflicker64:fix/consumers-latch-deadlock
Jan 26, 2026
Merged

fix(consumers): prevent await deadlock on ContextCallable failure#2941
imbajin merged 5 commits intoapache:masterfrom
bitflicker64:fix/consumers-latch-deadlock

Conversation

@bitflicker64
Copy link
Contributor

@bitflicker64 bitflicker64 commented Jan 20, 2026

Purpose of the PR

Main Changes

Submit safeRun() instead of directly submitting ContextCallable

Add safeRun() wrapper to always call latch.countDown() in finally

Remove latch.countDown() from runAndDone() to avoid double-counting

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • mvn -pl hugegraph-server/hugegraph-core -am test PASS; full build fails in hugegraph-clustertest-minicluster due to toolchain/JDK issue (unrelated).

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Jan 20, 2026
@imbajin imbajin requested a review from Copilot January 21, 2026 08:25
@imbajin
Copy link
Member

imbajin commented Jan 21, 2026

Hi, thanks for your PR, maybe we need at least 2 tests to validate the modification

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a potential deadlock issue in the Consumers class where await() could hang indefinitely if a ContextCallable fails before entering the runAndDone() method. The fix ensures that the countdown latch is always decremented, even when failures occur during worker thread initialization.

Changes:

  • Introduced safeRun() wrapper method to guarantee latch.countDown() is called in all failure scenarios
  • Modified worker submission to use safeRun() instead of directly submitting ContextCallable
  • Moved latch countdown from runAndDone() to safeRun() to prevent double-counting while ensuring it always executes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Jan 21, 2026
@bitflicker64
Copy link
Contributor Author

Pushed an update to this PR on branch fix/consumers-latch-deadlock.

Main changes

  1. hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/util/Consumers.java

    • Start workers using safeRun() so latch.countDown() is guaranteed in a finally block even if ContextCallable fails before entering runAndDone().
    • Removed latch.countDown() from runAndDone() to avoid double countdown and ensure exactly one countdown per worker.
    • Minor style cleanup (removed an extra blank line). Kept LOG.error(...) for worker-start failures since these are real startup/thread-context errors.
  2. Added 2 unit tests

    • File: hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/unit/util/ConsumersTest.java
    • Tests:
      • testStartProvideAwaitNormal: verifies items are consumed and await() returns normally
      • testAwaitThrowsWhenConsumerThrows: verifies await() surfaces consumer exceptions (no hang)

Verification

  • mvn -pl hugegraph-server/hugegraph-core -am test ✅ PASS locally

Thanks!

@bitflicker64
Copy link
Contributor Author

Side note: there is another Consumers implementation at
hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/job/algorithm/Consumers.java
which also submits ContextCallable directly. It may have the same latch/await hang corner case. I didn’t include it here to keep this PR focused, but I can open a separate PR to apply the same fix there if you want.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}

@Test(timeout = 5000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Minor: The test doesn't verify the specific deadlock scenario mentioned in the PR description - where ContextCallable fails before entering runAndDone().

Consider adding a test case that explicitly simulates the failure scenario, for example:

@Test(timeout = 1000)
public void testAwaitDoesNotHangWhenContextCallableFails() throws Throwable {
    // Test that simulates ContextCallable constructor/call() failure
    // before runAndDone() is entered
}

@imbajin
Copy link
Member

imbajin commented Jan 22, 2026

Side note: there is another Consumers implementation at hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/job/algorithm/Consumers.java which also submits ContextCallable directly. It may have the same latch/await hang corner case. I didn’t include it here to keep this PR focused, but I can open a separate PR to apply the same fix there if you want.

No need to update the another one (but we could merge them into one class/method reference)

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 22, 2026
@bitflicker64 bitflicker64 force-pushed the fix/consumers-latch-deadlock branch from 9e52f92 to 260e4ea Compare January 22, 2026 16:27
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jan 22, 2026
@bitflicker64
Copy link
Contributor Author

  • The PR adds a safeRun() wrapper around the worker so latch.countDown() is guaranteed in a finally block. This is mainly to prevent await() from hanging if a worker fails early.

  • Context propagation still needs to be preserved properly, so ContextCallable should be created/wrapped at submission time in start() , not inside the worker thread.

  • avoid catch(Throwable) in runAndDone() since that would also catch Error (OOM/StackOverflow etc.). Keeping it to catch(Exception) matches the original behavior.

  • Also making sure exceptionHandle() isn’t called redundantly for the same failure path, and tweaking the log message to something accurate (failure can be inside ContextCallable wrapper too, not strictly “before runAndDone”).

I’ll work on the test for where ContextCallable fails before entering runAndDone().

…ilure

Add a unit test that explicitly covers the failure scenario described in the PR,
where ContextCallable fails before entering runAndDone().

The test verifies that Consumers.await() does not hang when the worker task
fails during ContextCallable execution, relying on safeRun() to always
decrement the latch in its finally block.

This test would deadlock on the previous implementation and passes with the
current fix, ensuring the issue cannot regress.
@codecov
Copy link

codecov bot commented Jan 24, 2026

Codecov Report

❌ Patch coverage is 0% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 1.57%. Comparing base (37be6cd) to head (ec746c9).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...main/java/org/apache/hugegraph/util/Consumers.java 0.00% 18 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (37be6cd) and HEAD (ec746c9). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (37be6cd) HEAD (ec746c9)
4 2
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #2941       +/-   ##
============================================
- Coverage     39.38%   1.57%   -37.82%     
+ Complexity      456      43      -413     
============================================
  Files           812     779       -33     
  Lines         68660   65018     -3642     
  Branches       8968    8332      -636     
============================================
- Hits          27044    1026    -26018     
- Misses        38824   63906    +25082     
+ Partials       2792      86     -2706     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bitflicker64
Copy link
Contributor Author

Thanks for the review. I’ve updated the implementation and All related unit tests pass locally.

@bitflicker64
Copy link
Contributor Author

Side note: there is another Consumers implementation at hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/job/algorithm/Consumers.java which also submits ContextCallable directly. It may have the same latch/await hang corner case. I didn’t include it here to keep this PR focused, but I can open a separate PR to apply the same fix there if you want.

No need to update the another one (but we could merge them into one class/method reference)

When you said “no need to update the another one (but we could merge them into one class/method reference)”, did you mean that this PR should stay limited to the current Consumers, and that any fix for the other implementation should be handled separately possibly as a later refactor to share the common logic?

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 26, 2026
@imbajin
Copy link
Member

imbajin commented Jan 26, 2026

Side note: there is another Consumers implementation at hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/job/algorithm/Consumers.java which also submits ContextCallable directly. It may have the same latch/await hang corner case. I didn’t include it here to keep this PR focused, but I can open a separate PR to apply the same fix there if you want.

No need to update the another one (but we could merge them into one class/method reference)

When you said “no need to update the another one (but we could merge them into one class/method reference)”, did you mean that this PR should stay limited to the current Consumers, and that any fix for the other implementation should be handled separately possibly as a later refactor to share the common logic?

Exactly! You got it.

We want to avoid code duplication across the project—ideally, these utilities should only exist in one place. If they need to be shared(across multi-modules), we can move them to hugegraph-commons in a later refactor.

@imbajin imbajin merged commit fc391a7 into apache:master Jan 26, 2026
13 of 15 checks passed
@bitflicker64 bitflicker64 deleted the fix/consumers-latch-deadlock branch February 13, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Potential silent thread failure in Consumers.java due to unchecked Future

3 participants