Skip to content

Conversation

@kabragaurav
Copy link
Collaborator

Issues

Espresso storage nodes are encountering NullPointerException in Helix's ZkCallbackCache during ZooKeeper callback processing:

java.lang.NullPointerException: Cannot invoke "org.apache.zookeeper.data.Stat.getCzxid()" because "oldStat" is null
at org.apache.helix.manager.zk.ZkCallbackCache.update(ZkCallbackCache.java:89)
at org.apache.helix.manager.zk.ZkCallbackCache.updateRecursive(ZkCallbackCache.java:117)
at org.apache.helix.manager.zk.ZkCallbackCache.handleChildChange(ZkCallbackCache.java:150

This is a transient error during schema operations, caused by a race condition where oldStat is null when comparing czxid values.
The oldStat can be null in two scenarios:

  1. ZNode created with null stat: When cache.update(path, data, null) is called in ZkCacheBaseDataAccessor.java
  2. Stat set to null: When znode.setStat(stat) is called with stat=null in ZkCallbackCache.java

Both scenarios lead to znode.getStat() returning null on subsequent calls. ZNode ctor does not stop null stat.

  • My PR addresses the following Helix issues and references them in the PR description: CICP-3092, CICP-2712

Description

  • Here are some details about my PR:

Root Cause: In ZkCallbackCache.update() and handleDataChange(), the code retrieves oldStat from a cached ZNode and uses it without null check:

Stat oldStat = znode.getStat();
if (oldStat.getCzxid() != stat.getCzxid()) { // NPE when oldStat is null

The Fix: Add null check to handle the race condition gracefully:

if (oldStat == null || oldStat.getCzxid() != stat.getCzxid()) {
    fireEvents(path, EventType.NodeDeleted);
    fireEvents(path, EventType.NodeCreated);
}
  • When oldStat is null, we cannot determine if the node was recreated
  • fire Delete+Create events to ensure listeners resync
  • This matches the existing behavior when czxid changes
  • File HierarchicalDataHolder.java already handles null oldStat.
  • Espresso's InFlightSchemaStore listener handles these events
Screenshot 2026-01-07 at 6 26 07 PM Screenshot 2026-01-07 at 6 26 21 PM

Tests

  • The following tests are written for this issue:

TestZkCacheSyncOpSingleThread.testUpdateHandlesNullOldStat

Covers:

  1. No NPE when oldStat is null
  2. Delete+Create events fire when oldStat is null
  3. Normal behavior preserved when oldStat is valid (version change → DataChanged, czxid change → Delete+Create)

mvn test -pl helix-core -Dtest=TestZkCacheSyncOpSingleThread 2>&1

Screenshot 2026-01-07 at 6 12 18 PM

@kabragaurav kabragaurav force-pushed the CICP-3092/gkabra/stat-npe branch from d71f6c7 to 686dad1 Compare January 7, 2026 13:48
@ngngwr
Copy link
Collaborator

ngngwr commented Jan 8, 2026

I see new failures in the tests. Can you please check those. testRebalanceTool is existing one but other three needs to be checked.

// see ZkClient.fireDataChangedEvents()
if (oldStat.getCzxid() != stat.getCzxid()) {
// If we don't know the previous czxid (oldState is null), that means node was recreated and listeners need to resync their state
if (oldStat == null || oldStat.getCzxid() != stat.getCzxid()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This oldStat == null is not required, it won't reach here if oldStat is null. line #90 should take care of it.

Copy link
Collaborator Author

@kabragaurav kabragaurav Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require this. The one you pointed out is one flow. Other is through ZkClient.java when ZK fires a watch event which does not go through update(). In that flow, it directly reads from _cache.get(dataPath); (L164)


if (oldStat.getCzxid() != stat.getCzxid()) {
// If we don't know the previous czxid (oldState is null), that means node was recreated and listeners need to resync their state
if (oldStat == null || oldStat.getCzxid() != stat.getCzxid()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this will happen for the case when the node is seen for the first time so we should just fire NodeCreate event, not consider as recreate.
Incase of recreation, the ZKCache should still have the old stat.
if (oldStat == null) {
fireEvents(path, EventType.NodeCreated);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this will unnecessarily invoke the node delete event handler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

accessor._zkCache._cache.put(nodePath, znodeNullStat);
listener.reset();

org.apache.zookeeper.data.Stat newStat = new org.apache.zookeeper.data.Stat();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move fully qualified path as import. Much cleaner

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@kabragaurav
Copy link
Collaborator Author

I see new failures in the tests. Can you please check those. testRebalanceTool is existing one but other three needs to be checked.

I see TestWagedRebalance fails before my changes.
The changes concern only ZkCallbackCache.java and TestZkCacheSyncOpSingleThread.java and none of the failing tests (TestWagedRebalance, TestGreedyRebalance, TestConfigAccessor, TestHelixConfigAccessor) use ZkCallbackCache. So node not existing, semaphore permits and ZkClient closed are not related to my changes.

Further my changes only makes code more robust by adding null check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants