HDDS-14108. Provide option in ‘scm safemode status’ to show status of all SCM nodes #9611

sreejasahithi · 2026-01-09T07:07:45Z

What changes were proposed in this pull request?

This PR provides an option --all to show the safemode status of each SCM node in the cluster.
If verbose, It also provides the status of each safemode exit rule for each SCM node.

This PR also fixes the bug stated in HDDS-13832 where when --scm option is used in HA it always shows the status of the leader SCM and silently ignores the node specified via the option.

What is the link to the Apache JIRA

HDDS-14108

How was this patch tested?

This patch was tested locally in a docker ozone-ha cluster:

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: IN SAFE MODE
validated:false, DataNodeSafeModeRule, registered datanodes (=0) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);

When one of the SCM node is down :

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: ERROR: Failed to get safe mode status from SCM node scm2
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);

bash-5.1$ ozone admin safemode status --scm=scm2:9860
Service ID: scmservice
scm2:9860 [scm2]: ERROR: Failed to get safe mode status from SCM node scm2

Green CI : https://github.com/sreejasahithi/ozone/actions/runs/20842284515

… all SCM nodes

dombizita · 2026-01-10T10:36:35Z

@octachoron would you like to take a look at it if you have time? It's related to what we discussed recently :)

octachoron · 2026-01-11T03:26:17Z

@dombizita, absolutely, thank you! I don't think my vote is enough to merge though.

octachoron

Thank you @sreejasahithi for the patch. I added a few thoughts and questions inline. 🙂

...pache/hadoop/hdds/scm/protocolPB/StorageContainerLocationProtocolClientSideTranslatorPB.java

.../apache/hadoop/hdds/scm/protocol/StorageContainerLocationProtocolServerSideTranslatorPB.java

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java

octachoron

Thank you, the changes look good to me. Do you think there is a good way to write tests for the feature? (I do not see straightforward precedent other than actual integration tests, but that does not mean there isn't a way. 🙂)

ashishkumar50

@sreejasahithi Thanks for working on this.

ashishkumar50 · 2026-01-13T11:13:45Z

...op-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/SafeModeCheckSubcommand.java

+  }
+
+  private void executeForSpecificNodeInHA(ScmClient scmClient, String serviceId) throws IOException {
+    String scmAddress = getScmOption().getScm();


scmAddress is not mandatory option.

ashishkumar50 · 2026-01-13T11:16:13Z

...op-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/SafeModeCheckSubcommand.java

+    } else if (StringUtils.isNotEmpty(getScmOption().getScm()) && serviceId != null) {
+      executeForSpecificNodeInHA(scmClient, serviceId);
+    } else {
+      executeForSingleNode(scmClient);


In normal or existing behaviour we need safemode status from leader node most of the time. When no scm address is passed, whether we are getting safe mode status from leader node or not? Because now follower also can accept safemode and can return the status.

Thanks @ashishkumar50 for finding this bug, you are right now that we are allowing follower to also accept status command there can be a possibility where when we run safemode status command with no additional option it can return the status of the follower.

I have fixed this issue.

priyeshkaratha

Thanks @sreejasahithi for working on this. I have one minor comment on handling audit logs.

priyeshkaratha · 2026-01-14T04:11:47Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java

+  @Override
+  public Map<String, Pair<Boolean, String>> getSafeModeRuleStatusesForNode(String nodeId) throws IOException {
+    Map<String, Pair<Boolean, String>> result = getSafeModeRuleStatuses();
+    AUDIT.logReadSuccess(


Please use logReadSuccess with auditMap with nodeId added and also add auditlog on failure since its called for nodeId

sadanand48 · 2026-01-14T08:08:55Z

...op-ozone/cli-admin/src/main/java/org/apache/hadoop/hdds/scm/cli/SafeModeCheckSubcommand.java

+          target.getAddress().equals(nodeAddr.getAddress());
+    } catch (Exception e) {
+      // If address resolution fails, no match
+      return false;


nit : Log the exception here before returning false

HDDS-14108. Provide option in ‘scm safemode status’ to show status of…

14173ff

… all SCM nodes

dombizita self-requested a review January 10, 2026 10:35

octachoron reviewed Jan 11, 2026

View reviewed changes

Refactored code for better readability and reusability

5958634

sreejasahithi requested a review from octachoron January 12, 2026 10:27

jojochuang requested review from errose28 and sumitagrawl and removed request for octachoron January 12, 2026 17:27

octachoron reviewed Jan 13, 2026

View reviewed changes

ashishkumar50 reviewed Jan 13, 2026

View reviewed changes

priyeshkaratha reviewed Jan 14, 2026

View reviewed changes

sadanand48 reviewed Jan 14, 2026

View reviewed changes

Sreeja Chintalapati added 2 commits January 16, 2026 15:33

Should return leader node status when no option provided

4b95b95

Addressed review comments wrt logs

dc935e6

sreejasahithi requested review from ashishkumar50, priyeshkaratha and sadanand48 January 16, 2026 10:36

HDDS-14108. Provide option in ‘scm safemode status’ to show status of all SCM nodes #9611

Are you sure you want to change the base?

HDDS-14108. Provide option in ‘scm safemode status’ to show status of all SCM nodes #9611

Conversation

sreejasahithi commented Jan 9, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

dombizita commented Jan 10, 2026

Uh oh!

octachoron commented Jan 11, 2026

Uh oh!

octachoron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

octachoron left a comment

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 left a comment

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

ashishkumar50 Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

sreejasahithi Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

priyeshkaratha left a comment

Choose a reason for hiding this comment

Uh oh!

priyeshkaratha Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

sadanand48 Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sreejasahithi Jan 16, 2026 •

edited

Loading

sadanand48 Jan 14, 2026 •

edited

Loading