Add replica-only node type for read scalability #20295

synhershko · 2025-12-21T06:05:27Z

This commit introduces a new replica_only node role that provides read scalability without impacting the write path or triggering cluster rebalancing, enabling cost-effective horizontal scaling for read-heavy workloads.

Motivation

Organizations often need to scale read capacity independently from write capacity. Traditional approaches of adding more data nodes cause:

Unwanted shard rebalancing across the cluster
Increased write coordination overhead
Higher infrastructure costs for full-featured data nodes

The replica-only node type addresses these challenges by:

Providing read-only shard hosting without participating in primary shard allocation or rebalancing
Enabling cheap, ephemeral nodes that can be added/removed without cluster disruption
Supporting integration with object stores (S3, etc.) for pulling index data on-demand

Why Auto-Expand Replicas 0-all Only

The replica-only node type exclusively supports indices with index.auto_expand_replicas: 0-all for several critical reasons:

Dynamic Replica Management: Auto-expand automatically adjusts replica counts when replica-only nodes join/leave, eliminating manual intervention and preventing under-replication
No Manual Rebalancing: Without auto-expand, adding replica-only nodes would require manual replica count adjustments and could trigger rebalancing on data nodes
Predictable Behavior: The 0-all setting guarantees one copy per eligible node, making replica distribution deterministic and transparent
Operational Safety: Prevents accidental allocation of critical production indices to nodes that may be ephemeral or have different SLAs

High-Level Design

Core Components

New Node Role (DiscoveryNodeRole.REPLICA_ONLY_ROLE)
- Role name: replica_only
- Dedicated role that cannot coexist with any other role
Allocation Decider (ReplicaOnlyAllocationDecider)
- Blocks ALL primary shard allocation to replica-only nodes

Blocks replica allocation unless index has auto_expand_replicas: 0-all
- Prevents force allocation of primaries (safety guarantee)

Rebalancing Prevention (LocalShardsBalancer)
- Excludes replica-only nodes from rebalancing model entirely

Adding/removing replica-only nodes causes zero data node rebalancing
- Maintains cluster balance stability

Replica Promotion Prevention (RoutingNodes)
- Blocks promotion of replicas to primaries on replica-only nodes
- Cluster enters YELLOW/RED state when primary fails and only replica-only nodes have copies
- Ensures data integrity by requiring regular data node for primaries

Design Concerns Addressed

1. Data Integrity and Cluster Health

What happens if primary fails and only replica-only nodes have copies?

Replicas on replica-only nodes NEVER promote to primaries. The cluster enters YELLOW/RED state and waits for a regular data node. This prevents data loss scenarios where an ephemeral node becomes the source of truth.

2. Rebalancing Isolation

Will replica-only nodes trigger rebalancing on production data nodes?

Replica-only nodes are completely excluded from the BalancedShardsAllocator model. They are invisible to the balancer, ensuring zero rebalancing impact when nodes join/leave.

3. Role Transitions

What happens if a data node transitions to replica-only role?

Primary shards are relocated to other data nodes
Replicas from non-auto-expand indices are relocated out
Replicas from auto-expand 0-all indices remain
All transitions are safe with no data loss (canRemain() enforcement)

4. Recovery Code Paths

Could recovery logic accidentally create primaries on replica-only nodes?

No, via multiple layers of protection:

AllocationDecider blocks at allocation time
canForceAllocatePrimary() blocks forced allocation
promoteReplicaToPrimary() has explicit replica-only check

5. Auto-Expand Node Counting

How do replica-only nodes integrate with auto-expand replica counting?

shouldAutoExpandToNode() in ReplicaOnlyAllocationDecider is automatically called by
AutoExpandReplicas.getDesiredNumberOfReplicas().
Replica-only nodes are counted only for 0-all indices, ensuring correct replica counts.

Configuration Example

Node configuration:

node.roles: [replica_only]

Index configuration:

PUT /my-index
{
  "settings": {
    "index.auto_expand_replicas": "0-all"
  }
}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new replica-only node role enabling nodes dedicated to storing replica shards only.
Bug Fixes
- Enhanced shard allocation logic to prevent primary shards from being assigned to replica-only nodes.
- Added safeguards to prevent replica promotion on replica-only nodes.
- Excluded replica-only nodes from rebalancing calculations.
Tests
- Added comprehensive test coverage for replica-only node functionality and allocation rules.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-21T06:05:49Z

Walkthrough

This PR introduces a new replica-only node role to OpenSearch. It adds REPLICA_ONLY_ROLE with validation enforcing it as an exclusive role, a new ReplicaOnlyAllocationDecider that prevents primary allocation to replica-only nodes while allowing replica allocation under specific auto-expand settings, and extends the routing layer to detect and skip replica-only nodes during promotion and rebalancing.

Changes

Cohort / File(s)	Change Summary
Node Role Definition `server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java`	Added `REPLICA_ONLY_ROLE` public role with "replica_only" name and "ro" abbreviation; includes validation to enforce exclusive role presence; updated `BUILT_IN_ROLES` to include the new role.
Node Role Detection `server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java`	Added static helper `isReplicaOnlyNode(Settings)` and instance method `isReplicaOnlyNode()` to query replica-only role presence.
Allocation Decider `server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java`	New allocation decider class preventing primary allocation to replica-only nodes; allows replicas only for indices with 0-all auto_expand_replicas; implements `canAllocate`, `canRemain`, `canForceAllocatePrimary`, and `shouldAutoExpandToNode` decision methods.
Decider Registration `server/src/main/java/org/opensearch/cluster/ClusterModule.java`	Added import and registration of `ReplicaOnlyAllocationDecider` in allocation decider pipeline.
Routing Adjustments `server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java`	Added logger and guard in `promoteReplicaToPrimary()` to prevent promoting replicas residing on replica-only nodes.
Rebalancing Adjustments `server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/LocalShardsBalancer.java`	Excludes replica-only nodes from rebalancing model construction in `buildModelFromAssigned()`.
Role Tests `server/src/test/java/org/opensearch/cluster/node/DiscoveryNodeRoleTests.java`	Added three unit tests: `testReplicaOnlyRoleIsDedicated`, `testReplicaOnlyRoleProperties`, and `testReplicaOnlyRoleInBuiltInRoles`.
Allocation Decider Tests `server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java`	New comprehensive test class covering allocation decisions, auto-expand scenarios, force-allocation, remain behavior, and eligibility checks across replica-only and data nodes.

Sequence Diagram

sequenceDiagram
    actor User
    participant Allocator as Shard Allocator
    participant Decider as ReplicaOnlyAllocationDecider
    participant NodeRoles as Node Roles<br/>(DiscoveryNode)
    participant IndexMeta as Index Metadata<br/>(auto_expand_replicas)

    User->>Allocator: Request shard allocation decision
    Allocator->>Decider: canAllocate(ShardRouting, Node)
    
    alt Allocating Primary
        Decider->>NodeRoles: Check if target node is replica-only
        alt Is Replica-Only Node
            NodeRoles-->>Decider: true
            Decider-->>Allocator: DENY (primaries not allowed)
        else Is Data Node
            NodeRoles-->>Decider: false
            Decider-->>Allocator: YES (allow primary)
        end
    else Allocating Replica
        Decider->>NodeRoles: Check if target node is replica-only
        alt Is Replica-Only Node
            NodeRoles-->>Decider: true
            Decider->>IndexMeta: Check auto_expand_replicas setting
            alt Is 0-all (auto-expand)
                IndexMeta-->>Decider: 0-all detected
                Decider-->>Allocator: YES (allow replica)
            else Other Settings
                IndexMeta-->>Decider: Not 0-all
                Decider-->>Allocator: DENY (replica only for 0-all)
            end
        else Is Data Node
            NodeRoles-->>Decider: false
            Decider-->>Allocator: YES (allow replica)
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring additional attention:

ReplicaOnlyAllocationDecider logic: Complex conditional branching for auto_expand_replicas eligibility and primary vs. replica handling; review the canAllocate and shouldAutoExpandToNode decision paths carefully.
DiscoveryNodeRole.validateRole enforcement: Verify that the exclusive role constraint is properly enforced and doesn't conflict with existing role logic or node initialization.
RoutingNodes.promoteReplicaToPrimary guard: Ensure the guard correctly identifies replica-only nodes and that early return doesn't break other promotion logic.
LocalShardsBalancer exclusion: Confirm that skipping replica-only nodes in model construction doesn't inadvertently exclude valid replica placements or cause rebalancing issues.
Test coverage comprehensiveness: Review ReplicaOnlyAllocationDeciderTests for edge cases around mixed auto_expand settings and state transitions.

Suggested labels

enhancement, Other

Suggested reviewers

andrross
reta
dbwiddis
cwperks
kotwanikunal
anasalkouz

Poem

🐰 A new role hops into the cluster today,
Replica-only nodes lead shards the right way!
No primaries here, just backups so keen,
The finest rebalance you ever have seen.
Watch allocations dance with fresh delight,
Auto-expand rules keep everything right! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add replica-only node type for read scalability' accurately summarizes the main change: introducing a new node role for read-focused scaling without write path impact.
Description check	✅ Passed	The description comprehensively covers motivation, design rationale, core components, design concerns, and configuration examples. It aligns with the template's Description section and includes Apache 2.0 license confirmation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

This commit introduces a new `replica_only` node role that provides read scalability without impacting the write path or triggering cluster rebalancing, enabling cost-effective horizontal scaling for read-heavy workloads. ## Motivation Organizations often need to scale read capacity independently from write capacity. Traditional approaches of adding more data nodes cause: - Unwanted shard rebalancing across the cluster - Increased write coordination overhead - Higher infrastructure costs for full-featured data nodes The replica-only node type addresses these challenges by: - Providing read-only shard hosting without participating in primary shard allocation or rebalancing - Enabling cheap, ephemeral nodes that can be added/removed without cluster disruption - Supporting integration with object stores (S3, etc.) for pulling index data on-demand ## Why Auto-Expand Replicas 0-all Only The replica-only node type exclusively supports indices with `index.auto_expand_replicas: 0-all` for several critical reasons: 1. **Dynamic Replica Management**: Auto-expand automatically adjusts replica counts when replica-only nodes join/leave, eliminating manual intervention and preventing under-replication 2. **No Manual Rebalancing**: Without auto-expand, adding replica-only nodes would require manual replica count adjustments and could trigger rebalancing on data nodes 3. **Predictable Behavior**: The 0-all setting guarantees one copy per eligible node, making replica distribution deterministic and transparent 4. **Operational Safety**: Prevents accidental allocation of critical production indices to nodes that may be ephemeral or have different SLAs ## High-Level Design ### Core Components 1. **New Node Role (DiscoveryNodeRole.REPLICA_ONLY_ROLE)** - Role name: `replica_only` - Dedicated role that cannot coexist with any other role 2. **Allocation Decider (ReplicaOnlyAllocationDecider)** - Blocks ALL primary shard allocation to replica-only nodes - Blocks replica allocation unless index has auto_expand_replicas: 0-all - Prevents force allocation of primaries (safety guarantee) 3. **Rebalancing Prevention (LocalShardsBalancer)** - Excludes replica-only nodes from rebalancing model entirely - Adding/removing replica-only nodes causes zero data node rebalancing - Maintains cluster balance stability 4. **Replica Promotion Prevention (RoutingNodes)** - Blocks promotion of replicas to primaries on replica-only nodes - Cluster enters YELLOW/RED state when primary fails and only replica-only nodes have copies - Ensures data integrity by requiring regular data node for primaries ## Design Concerns Addressed ### 1. Data Integrity and Cluster Health What happens if primary fails and only replica-only nodes have copies? Replicas on replica-only nodes NEVER promote to primaries. The cluster enters YELLOW/RED state and waits for a regular data node. This prevents data loss scenarios where an ephemeral node becomes the source of truth. ### 2. Rebalancing Isolation Will replica-only nodes trigger rebalancing on production data nodes? Replica-only nodes are completely excluded from the BalancedShardsAllocator model. They are invisible to the balancer, ensuring zero rebalancing impact when nodes join/leave. ### 3. Role Transitions What happens if a data node transitions to replica-only role? - Primary shards are relocated to other data nodes - Replicas from non-auto-expand indices are relocated out - Replicas from auto-expand 0-all indices remain - All transitions are safe with no data loss (canRemain() enforcement) ### 4. Recovery Code Paths Could recovery logic accidentally create primaries on replica-only nodes? No, via multiple layers of protection: - AllocationDecider blocks at allocation time - canForceAllocatePrimary() blocks forced allocation - promoteReplicaToPrimary() has explicit replica-only check ### 5. Auto-Expand Node Counting How do replica-only nodes integrate with auto-expand replica counting? shouldAutoExpandToNode() in ReplicaOnlyAllocationDecider is automatically called by AutoExpandReplicas.getDesiredNumberOfReplicas(). Replica-only nodes are counted only for 0-all indices, ensuring correct replica counts. ## Configuration Example Node configuration: ```yaml node.roles: [replica_only] ``` Index configuration: ```json PUT /my-index { "settings": { "index.auto_expand_replicas": "0-all" } } ``` Signed-off-by: Itamar Syn-Hershko <itamar@bigdataboutique.com>

synhershko · 2025-12-21T06:09:53Z

Pinging @msfroh @andrross @prudhvigodithi for a review :)

We haven't tested this extensively just yet but believe it's a strong addition to OpenSearch and would love peer review and assistance with testing this thoroughly.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)

80-81: Minor: Import ordering.

The import for ReplicaOnlyAllocationDecider is placed between SameShardAllocationDecider and SearchReplicaAllocationDecider. While not strictly required to be alphabetical, consider placing it after RemoteStoreMigrationAllocationDecider to maintain alphabetical ordering with other decider imports.
server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (1)
66-69: Extract duplicated isAutoExpandAll logic into a private helper method.

The same logic for determining if an index has auto_expand_replicas: 0-all is duplicated in both shouldAutoExpandToNode (lines 66-69) and canAllocate (lines 111-113). This violates DRY and risks inconsistency if one location is updated but not the other.
🔎 Proposed refactor to extract helper method
+    private static boolean isAutoExpandAll(IndexMetadata indexMetadata) {
+        AutoExpandReplicas autoExpandReplicas = AutoExpandReplicas.SETTING.get(indexMetadata.getSettings());
+        return autoExpandReplicas.isEnabled()
+            && autoExpandReplicas.getMaxReplicas() == Integer.MAX_VALUE
+            && autoExpandReplicas.getMinReplicas() == 0;
+    }
+
     @Override
     public Decision shouldAutoExpandToNode(IndexMetadata indexMetadata, DiscoveryNode node, RoutingAllocation allocation) {
         if (!node.isReplicaOnlyNode()) {
             // Regular data nodes participate in auto-expand for all indices
             return allocation.decision(Decision.YES, NAME, "node [%s] is a data node, eligible for auto-expand", node.getId());
         }

         // Replica-only nodes only participate in 0-all auto-expand
-        AutoExpandReplicas autoExpandReplicas = AutoExpandReplicas.SETTING.get(indexMetadata.getSettings());
-        boolean isAutoExpandAll = autoExpandReplicas.isEnabled()
-            && autoExpandReplicas.getMaxReplicas() == Integer.MAX_VALUE
-            && autoExpandReplicas.toString().startsWith("0-");
-
-        if (isAutoExpandAll) {
+        if (isAutoExpandAll(indexMetadata)) {
Apply similar change in canAllocate method.
Also applies to: 111-113
server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (2)
56-62: Remove unused AllocationService instance.

The AllocationService is instantiated but never used in any test method. This is dead code that adds unnecessary complexity to the test setup.
🔎 Proposed fix
         allocationDeciders = new AllocationDeciders(
             Arrays.asList(decider, new SameShardAllocationDecider(Settings.EMPTY, clusterSettings), new ReplicaAfterPrimaryActiveAllocationDecider())
         );

-        AllocationService service = new AllocationService(
-            allocationDeciders,
-            new TestGatewayAllocator(),
-            new BalancedShardsAllocator(Settings.EMPTY),
-            EmptyClusterInfoService.INSTANCE,
-            EmptySnapshotsInfoService.INSTANCE
-        );
-
         // Create a cluster with 2 nodes: one regular data node, one replica-only node
293-342: Consider adding shouldAutoExpandToNode assertions for non-0-all auto-expand settings.

The test correctly verifies that canAllocate rejects replicas with 0-5 and 1-all settings on replica-only nodes. For completeness, consider also testing shouldAutoExpandToNode with these settings to ensure the auto-expand eligibility check is consistent.
🔎 Proposed addition
         decision = decider.canAllocate(replica1all, replicaOnlyNode, allocation);
         assertEquals(Decision.Type.NO, decision.type());
+
+        // Verify shouldAutoExpandToNode also rejects non-0-all settings
+        IndexMetadata index05 = testState.metadata().index("auto-expand-0-5");
+        IndexMetadata index1all = testState.metadata().index("auto-expand-1-all");
+        DiscoveryNode replicaOnlyDiscoveryNode = testState.nodes().get("replica-only-node");
+
+        decision = decider.shouldAutoExpandToNode(index05, replicaOnlyDiscoveryNode, allocation);
+        assertEquals(Decision.Type.NO, decision.type());
+
+        decision = decider.shouldAutoExpandToNode(index1all, replicaOnlyDiscoveryNode, allocation);
+        assertEquals(Decision.Type.NO, decision.type());
     }
 }

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 84bc68e and 9d31c2c.

📒 Files selected for processing (8)

server/src/main/java/org/opensearch/cluster/ClusterModule.java (2 hunks)
server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java (2 hunks)
server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java (1 hunks)
server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java (3 hunks)
server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/LocalShardsBalancer.java (1 hunks)
server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (1 hunks)
server/src/test/java/org/opensearch/cluster/node/DiscoveryNodeRoleTests.java (1 hunks)
server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (2)

test/framework/src/main/java/org/opensearch/cluster/OpenSearchAllocationTestCase.java (1)

OpenSearchAllocationTestCase (76-543)

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java (1)

BalancedShardsAllocator (93-876)

server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)

server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (1)

ReplicaOnlyAllocationDecider (29-140)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: gradle-check
GitHub Check: Mend Security Check

🔇 Additional comments (12)

server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java (2)

133-135: LGTM!

The static helper follows the existing pattern used for other roles like isWarmNode and isIngestNode.

549-556: LGTM!

The instance method follows the established pattern for role checks (e.g., isSearchNode, isWarmNode) with consistent Javadoc formatting.

server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java (2)

340-373: LGTM!

The REPLICA_ONLY_ROLE definition is well-documented and follows the established pattern for exclusive roles (similar to SEARCH_ROLE). The validation logic correctly enforces that replica_only must be the only role on a node.

378-382: LGTM!

The role is correctly added to BUILT_IN_ROLES. The sorted set will maintain proper ordering by role name.

server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/LocalShardsBalancer.java (1)

768-775: LGTM!

Correctly excludes replica-only nodes from the rebalancing model, which ensures that adding/removing these nodes doesn't trigger rebalancing of shards on regular data nodes. The trace log is helpful for debugging allocation decisions.

server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java (1)

89-90: LGTM!

Logger addition follows the standard pattern used in other classes.

server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)

400-400: LGTM!

The ReplicaOnlyAllocationDecider is correctly registered in the allocation deciders pipeline.

server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (2)

29-56: LGTM on the core structure and force allocation blocking.

The decider correctly:

Delegates canAllocate and canRemain to a shared helper for consistency

Blocks primary shard allocation to replica-only nodes

Blocks force allocation of primaries to replica-only nodes (critical safety measure)

Provides informative decision messages for debugging

67-69: The suggested fix cannot be implemented: getMinReplicas() is a private method.

The fragility concern about using autoExpandReplicas.toString().startsWith("0-") is valid. However, the proposed fix to use autoExpandReplicas.getMinReplicas() == 0 is not viable—the getMinReplicas() method is private and cannot be called from ReplicaOnlyAllocationDecider.

To address the fragility concern, either:

Make getMinReplicas() public in AutoExpandReplicas, then use the suggested check, or

Add a public helper method to AutoExpandReplicas (e.g., isMinReplicasZero()) that encapsulates this logic.

Likely an incorrect or invalid review comment.

server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (1)

90-156: Good test structure with thorough assertions.

The tests properly verify both the decision type and the decision message content. Enabling debugDecision(true) ensures meaningful assertion failure messages. The coverage of primary/replica allocation scenarios for both node types is comprehensive.

server/src/test/java/org/opensearch/cluster/node/DiscoveryNodeRoleTests.java (2)

163-187: Thorough validation of the dedicated role constraint.

The test correctly verifies that replica_only cannot coexist with DATA_ROLE, CLUSTER_MANAGER_ROLE, or INGEST_ROLE, and that it passes validation when specified alone. The assertion messages are clear and will help diagnose failures.

189-198: LGTM on role property and built-in roles tests.

These tests provide essential coverage for:

Role name (replica_only) and abbreviation (ro)

Data-containing capability (canContainData() returns true)

No legacy setting (expected for new roles)

Presence in BUILT_IN_ROLES set

coderabbitai · 2025-12-21T06:11:39Z

server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java

+
+        // CRITICAL: Never promote replicas on replica-only nodes
+        RoutingNode routingNode = node(activeReplica.currentNodeId());
+        if (routingNode != null && routingNode.node().isReplicaOnlyNode()) {
+            logger.warn(
+                "Cannot promote replica shard [{}] to primary on replica-only node [{}]. "
+                    + "Shard will remain as replica. Primary must be allocated to a regular data node.",
+                activeReplica.shardId(),
+                routingNode.nodeId()
+            );
+            // Do NOT call promoteActiveReplicaShardToPrimary - just return
+            // The primary will remain unassigned, triggering allocation to a data node
+            return;
+        }


⚠️ Potential issue | 🔴 Critical

Critical: Early return leaves cluster without a primary shard.

The guard returns early without promoting the replica, but at this point movePrimaryToUnassignedAndDemoteToReplica() has already been called in unassignPrimaryAndPromoteActiveReplicaIfExists(). This leaves the shard group with no primary:

Original primary → demoted to unassigned replica

Active replica on replica-only node → remains a replica (not promoted)

Result: No primary exists for this shard

The replica-only check should occur before the primary is demoted, ideally in unassignPrimaryAndPromoteActiveReplicaIfExists() when selecting the active replica. The replica selection logic should skip replicas on replica-only nodes.

🔎 Suggested approach

Modify the replica selection in unassignPrimaryAndPromoteActiveReplicaIfExists() to exclude replicas on replica-only nodes:

private void unassignPrimaryAndPromoteActiveReplicaIfExists( ShardRouting failedShard, UnassignedInfo unassignedInfo, RoutingChangesObserver routingChangesObserver ) { assert failedShard.primary(); ShardRouting activeReplica = null; if (isMigratingToRemoteStore(metadata)) { activeReplica = activeReplicaOnRemoteNode(failedShard.shardId()); + // Skip if on replica-only node + if (activeReplica != null && isReplicaOnReplicaOnlyNode(activeReplica)) { + activeReplica = null; + } } if (activeReplica == null) { if (metadata.isSegmentReplicationEnabled(failedShard.getIndexName())) { activeReplica = activeReplicaWithOldestVersion(failedShard.shardId()); } else { activeReplica = activeReplicaWithHighestVersion(failedShard.shardId()); } + // Skip if on replica-only node + if (activeReplica != null && isReplicaOnReplicaOnlyNode(activeReplica)) { + activeReplica = null; + } } // ... rest of method } + +private boolean isReplicaOnReplicaOnlyNode(ShardRouting replica) { + RoutingNode rn = node(replica.currentNodeId()); + return rn != null && rn.node().isReplicaOnlyNode(); +}

Alternatively, filter replica-only nodes in activeReplicaWithHighestVersion() and activeReplicaWithOldestVersion().

Committable suggestion skipped: line range outside the PR's diff.

github-actions · 2025-12-21T08:10:02Z

❌ Gradle check result for 9d31c2c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

prudhvigodithi · 2025-12-30T00:35:10Z

Thanks for this PR @synhershko

The use case of scaling read capacity independently is definitely valuable. I wanted to point out that OpenSearch recently introduced Search Replicas as part of the reader/writer separation feature, which addresses a similar problem. This feature uses:

A dedicated search node role.
New shard types (search replicas vs write replicas).
Segment replication with remote store backend.
A _scale API for search-only mode.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add replica-only node type for read scalability #20295

Add replica-only node type for read scalability #20295

Uh oh!

synhershko commented Dec 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 21, 2025 •

edited

Loading

Uh oh!

synhershko commented Dec 21, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 21, 2025

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

prudhvigodithi commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add replica-only node type for read scalability #20295

Are you sure you want to change the base?

Add replica-only node type for read scalability #20295

Uh oh!

Conversation

synhershko commented Dec 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Why Auto-Expand Replicas 0-all Only

High-Level Design

Core Components

Design Concerns Addressed

1. Data Integrity and Cluster Health

2. Rebalancing Isolation

3. Role Transitions

4. Recovery Code Paths

5. Auto-Expand Node Counting

Configuration Example

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

synhershko commented Dec 21, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 21, 2025

Uh oh!

prudhvigodithi commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

synhershko commented Dec 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 21, 2025 •

edited

Loading

prudhvigodithi commented Dec 30, 2025 •

edited

Loading