Skip to content

Conversation

@sreejasahithi
Copy link
Contributor

What changes were proposed in this pull request?

This PR provides an option --all to show the safemode status of each SCM node in the cluster.
If verbose, It also provides the status of each safemode exit rule for each SCM node.

This PR also fixes the bug stated in HDDS-13832 where when --scm option is used in HA it always shows the status of the leader SCM and silently ignores the node specified via the option.

What is the link to the Apache JIRA

HDDS-14108

How was this patch tested?

This patch was tested locally in a docker ozone-ha cluster:

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: IN SAFE MODE
validated:false, DataNodeSafeModeRule, registered datanodes (=0) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);

When one of the SCM node is down :

bash-5.1$ ozone admin safemode status --all --verbose
Service ID: scmservice
scm1:9860 [scm1]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
scm2:9860 [scm2]: ERROR: Failed to get safe mode status from SCM node scm2
scm3:9860 [scm3]: OUT OF SAFE MODE
validated:true, DataNodeSafeModeRule, registered datanodes (=1) >= required datanodes (=1)
validated:true, RatisContainerSafeModeRule, 100.00% of [RATIS] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
validated:true, HealthyPipelineSafeModeRule, healthy Ratis/THREE pipelines (=0) >= healthyPipelineThresholdCount (=0)
validated:true, StateMachineReadyRule, Refreshed SCM State Machine after leader ready: true
validated:true, OneReplicaPipelineSafeModeRule, reported Ratis/THREE pipelines with at least one datanode (=0) >= threshold (=0)
validated:true, ECContainerSafeModeRule, 100.00% of [EC] Containers(0 / 0) with at least N reported replica (=1.00) >= safeModeCutoff (=0.99);
bash-5.1$ ozone admin safemode status --scm=scm2:9860
Service ID: scmservice
scm2:9860 [scm2]: ERROR: Failed to get safe mode status from SCM node scm2

Green CI : https://github.com/sreejasahithi/ozone/actions/runs/20842284515

@dombizita dombizita self-requested a review January 10, 2026 10:35
@dombizita
Copy link
Contributor

@octachoron would you like to take a look at it if you have time? It's related to what we discussed recently :)

@octachoron
Copy link
Contributor

@dombizita, absolutely, thank you! I don't think my vote is enough to merge though.

Copy link
Contributor

@octachoron octachoron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sreejasahithi for the patch. I added a few thoughts and questions inline. 🙂

@jojochuang jojochuang requested review from errose28 and sumitagrawl and removed request for octachoron January 12, 2026 17:27
Copy link
Contributor

@octachoron octachoron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, the changes look good to me. Do you think there is a good way to write tests for the feature? (I do not see straightforward precedent other than actual integration tests, but that does not mean there isn't a way. 🙂)

Copy link
Contributor

@ashishkumar50 ashishkumar50 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sreejasahithi Thanks for working on this.

}

private void executeForSpecificNodeInHA(ScmClient scmClient, String serviceId) throws IOException {
String scmAddress = getScmOption().getScm();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scmAddress is not mandatory option.

} else if (StringUtils.isNotEmpty(getScmOption().getScm()) && serviceId != null) {
executeForSpecificNodeInHA(scmClient, serviceId);
} else {
executeForSingleNode(scmClient);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In normal or existing behaviour we need safemode status from leader node most of the time. When no scm address is passed, whether we are getting safe mode status from leader node or not? Because now follower also can accept safemode and can return the status.

Copy link
Contributor Author

@sreejasahithi sreejasahithi Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ashishkumar50 for finding this bug, you are right now that we are allowing follower to also accept status command there can be a possibility where when we run safemode status command with no additional option it can return the status of the follower.

I have fixed this issue.

Copy link
Contributor

@priyeshkaratha priyeshkaratha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sreejasahithi for working on this. I have one minor comment on handling audit logs.

@Override
public Map<String, Pair<Boolean, String>> getSafeModeRuleStatusesForNode(String nodeId) throws IOException {
Map<String, Pair<Boolean, String>> result = getSafeModeRuleStatuses();
AUDIT.logReadSuccess(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logReadSuccess with auditMap with nodeId added and also add auditlog on failure since its called for nodeId

target.getAddress().equals(nodeAddr.getAddress());
} catch (Exception e) {
// If address resolution fails, no match
return false;
Copy link
Contributor

@sadanand48 sadanand48 Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : Log the exception here before returning false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants