fix(pd): add timeout and null-safety to getLeaderGrpcAddress() by bitflicker64 · Pull Request #2961 · apache/hugegraph

bitflicker64 · 2026-03-05T08:15:06Z

Purpose of the PR

close [Bug] 3-node PD cluster fails when pd0 is not raft leader — getLeaderGrpcAddress() NPE in bridge network mode #2959

In a 3-node PD cluster running in Docker bridge network mode, getLeaderGrpcAddress() makes a bolt RPC call to discover the leader's gRPC address when the current node is a follower. This call fails in bridge mode — the TCP connection establishes but the bolt RPC response never returns properly, causing CompletableFuture.get() to return null and throw NPE.

This causes:

redirectToLeader() fails with NPE
Store registration requests landing on follower PDs are never forwarded
Stores register but partitions are never distributed (partitionCount:0)
HugeGraph servers stuck in DEADLINE_EXCEEDED loop indefinitely

The cluster only works when pd0 wins raft leader election (since isLeader() returns true and the broken code path is skipped). If pd1 or pd2 wins, the NPE fires on every redirect attempt.

Related PR: #2952

Main Changes

Add a bounded timeout to the bolt RPC call using config.getRpcTimeout() instead of unbounded .get()
Add null-check on the RPC response before accessing .getGrpcAddress()
Fall back to deriving the leader address from the raft endpoint IP + local gRPC port when the RPC fails or times out
Add TimeUnit and TimeoutException imports

Verifying these changes

Trivial rework / code cleanup without any test coverage. (No Need)
Already covered by existing tests, such as (please modify tests here).
Done testing and can be verified as follows:
- Deploy 3-node PD cluster in Docker bridge network mode
- Verify cluster works regardless of which PD node wins raft leader election
- Confirm stores show partitionCount:12 on all 3 nodes when pd1 or pd2 is leader
- Confirm no NPE in pd logs at getLeaderGrpcAddress

Does this PR potentially affect the following parts?

Dependencies (add/update license info & regenerate_known_dependencies.sh)
Modify configurations
The public API
Other affects (typed here)
Nope

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

The bolt RPC call in getLeaderGrpcAddress() returns null in Docker bridge network mode, causing NPE when a follower PD node attempts to discover the leader's gRPC address. This breaks store registration and partition distribution when any node other than pd0 wins the raft leader election. Add a bounded timeout using the configured rpc-timeout, null-check the RPC response, and fall back to deriving the address from the raft endpoint IP when the RPC fails. Closes apache#2959

bitflicker64 · 2026-03-05T20:43:14Z

How I tested:

Built a local Docker image from source with this fix applied
Brought up the 3-node cluster (3 PD + 3 Store + 3 Server) in bridge network mode
Confirmed cluster was healthy with pd0 as initial leader
Restarted pd0 to force a new leader election — pd1 won
Checked partition distribution and cluster health with pd1 as leader

Results with pd1 as leader:

partitionCount:12 on all 3 stores ✅
leaderCount:12 on all 3 stores ✅
{"graphs":["hugegraph"]} ✅
All 9 containers healthy ✅

Confirmed fallback triggered in pd1 logs:

[WARN] RaftEngine - Failed to get leader gRPC address via RPC, falling back to endpoint derivation
java.util.concurrent.ExecutionException: com.alipay.remoting.exception.RemotingException:
Create connection failed. The address is 172.20.0.10:8610
    at RaftEngine.getLeaderGrpcAddress(RaftEngine.java:247)
    at PDService.redirectToLeader(PDService.java:1275)

Before this fix: RPC returns null → NPE → follower PDs can't redirect requests to leader → cluster only worked when pd0 won leader election since it never hit the broken code path.

After this fix: RPC failure caught with bounded timeout → fallback to endpoint IP + gRPC port derivation → follower PDs correctly redirect to leader regardless of which PD node wins election.

Related docker bridge networking PR: #2952

This was referenced Mar 5, 2026

fix(pd): resolve hostname entries in IpAuthHandler allowlist #2962

Open

fix(docker): migrate single-node compose from host to bridge networking #2952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961

fix(pd): add timeout and null-safety to getLeaderGrpcAddress()#2961
bitflicker64 wants to merge 1 commit intoapache:masterfrom
bitflicker64:fix/raft-engine-leader-grpc-address

bitflicker64 commented Mar 5, 2026 •

edited

Loading

Uh oh!

bitflicker64 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bitflicker64 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

bitflicker64 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bitflicker64 commented Mar 5, 2026 •

edited

Loading