-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add replica-only node type for read scalability #20295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add replica-only node type for read scalability #20295
Conversation
WalkthroughThis PR introduces a new replica-only node role to OpenSearch. It adds Changes
Sequence DiagramsequenceDiagram
actor User
participant Allocator as Shard Allocator
participant Decider as ReplicaOnlyAllocationDecider
participant NodeRoles as Node Roles<br/>(DiscoveryNode)
participant IndexMeta as Index Metadata<br/>(auto_expand_replicas)
User->>Allocator: Request shard allocation decision
Allocator->>Decider: canAllocate(ShardRouting, Node)
alt Allocating Primary
Decider->>NodeRoles: Check if target node is replica-only
alt Is Replica-Only Node
NodeRoles-->>Decider: true
Decider-->>Allocator: DENY (primaries not allowed)
else Is Data Node
NodeRoles-->>Decider: false
Decider-->>Allocator: YES (allow primary)
end
else Allocating Replica
Decider->>NodeRoles: Check if target node is replica-only
alt Is Replica-Only Node
NodeRoles-->>Decider: true
Decider->>IndexMeta: Check auto_expand_replicas setting
alt Is 0-all (auto-expand)
IndexMeta-->>Decider: 0-all detected
Decider-->>Allocator: YES (allow replica)
else Other Settings
IndexMeta-->>Decider: Not 0-all
Decider-->>Allocator: DENY (replica only for 0-all)
end
else Is Data Node
NodeRoles-->>Decider: false
Decider-->>Allocator: YES (allow replica)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring additional attention:
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This commit introduces a new `replica_only` node role that provides
read scalability without impacting the write path or triggering cluster
rebalancing, enabling cost-effective horizontal scaling for read-heavy
workloads.
## Motivation
Organizations often need to scale read capacity independently from write
capacity. Traditional approaches of adding more data nodes cause:
- Unwanted shard rebalancing across the cluster
- Increased write coordination overhead
- Higher infrastructure costs for full-featured data nodes
The replica-only node type addresses these challenges by:
- Providing read-only shard hosting without participating in primary
shard allocation or rebalancing
- Enabling cheap, ephemeral nodes that can be added/removed without
cluster disruption
- Supporting integration with object stores (S3, etc.) for pulling
index data on-demand
## Why Auto-Expand Replicas 0-all Only
The replica-only node type exclusively supports indices with
`index.auto_expand_replicas: 0-all` for several critical reasons:
1. **Dynamic Replica Management**: Auto-expand automatically adjusts
replica counts when replica-only nodes join/leave, eliminating
manual intervention and preventing under-replication
2. **No Manual Rebalancing**: Without auto-expand, adding replica-only
nodes would require manual replica count adjustments and could
trigger rebalancing on data nodes
3. **Predictable Behavior**: The 0-all setting guarantees one copy per
eligible node, making replica distribution deterministic and
transparent
4. **Operational Safety**: Prevents accidental allocation of critical
production indices to nodes that may be ephemeral or have different
SLAs
## High-Level Design
### Core Components
1. **New Node Role (DiscoveryNodeRole.REPLICA_ONLY_ROLE)**
- Role name: `replica_only`
- Dedicated role that cannot coexist with any other role
2. **Allocation Decider (ReplicaOnlyAllocationDecider)**
- Blocks ALL primary shard allocation to replica-only nodes
- Blocks replica allocation unless index has auto_expand_replicas:
0-all
- Prevents force allocation of primaries (safety guarantee)
3. **Rebalancing Prevention (LocalShardsBalancer)**
- Excludes replica-only nodes from rebalancing model entirely
- Adding/removing replica-only nodes causes zero data node
rebalancing
- Maintains cluster balance stability
4. **Replica Promotion Prevention (RoutingNodes)**
- Blocks promotion of replicas to primaries on replica-only nodes
- Cluster enters YELLOW/RED state when primary fails and only
replica-only nodes have copies
- Ensures data integrity by requiring regular data node for primaries
## Design Concerns Addressed
### 1. Data Integrity and Cluster Health
What happens if primary fails and only replica-only nodes
have copies?
Replicas on replica-only nodes NEVER promote to primaries.
The cluster enters YELLOW/RED state and waits for a regular data node.
This prevents data loss scenarios where an ephemeral node becomes the
source of truth.
### 2. Rebalancing Isolation
Will replica-only nodes trigger rebalancing on production
data nodes?
Replica-only nodes are completely excluded from the
BalancedShardsAllocator model. They are invisible to the balancer,
ensuring zero rebalancing impact when nodes join/leave.
### 3. Role Transitions
What happens if a data node transitions to replica-only role?
- Primary shards are relocated to other data nodes
- Replicas from non-auto-expand indices are relocated out
- Replicas from auto-expand 0-all indices remain
- All transitions are safe with no data loss (canRemain() enforcement)
### 4. Recovery Code Paths
Could recovery logic accidentally create primaries on
replica-only nodes?
No, via multiple layers of protection:
- AllocationDecider blocks at allocation time
- canForceAllocatePrimary() blocks forced allocation
- promoteReplicaToPrimary() has explicit replica-only check
### 5. Auto-Expand Node Counting
How do replica-only nodes integrate with auto-expand
replica counting?
shouldAutoExpandToNode() in ReplicaOnlyAllocationDecider
is automatically called by
AutoExpandReplicas.getDesiredNumberOfReplicas().
Replica-only nodes are counted only for 0-all indices, ensuring correct
replica counts.
## Configuration Example
Node configuration:
```yaml
node.roles: [replica_only]
```
Index configuration:
```json
PUT /my-index
{
"settings": {
"index.auto_expand_replicas": "0-all"
}
}
```
Signed-off-by: Itamar Syn-Hershko <itamar@bigdataboutique.com>
df9b566 to
9d31c2c
Compare
|
Pinging @msfroh @andrross @prudhvigodithi for a review :) We haven't tested this extensively just yet but believe it's a strong addition to OpenSearch and would love peer review and assistance with testing this thoroughly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)
80-81: Minor: Import ordering.The import for
ReplicaOnlyAllocationDecideris placed betweenSameShardAllocationDeciderandSearchReplicaAllocationDecider. While not strictly required to be alphabetical, consider placing it afterRemoteStoreMigrationAllocationDeciderto maintain alphabetical ordering with other decider imports.server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (1)
66-69: Extract duplicatedisAutoExpandAlllogic into a private helper method.The same logic for determining if an index has
auto_expand_replicas: 0-allis duplicated in bothshouldAutoExpandToNode(lines 66-69) andcanAllocate(lines 111-113). This violates DRY and risks inconsistency if one location is updated but not the other.🔎 Proposed refactor to extract helper method
+ private static boolean isAutoExpandAll(IndexMetadata indexMetadata) { + AutoExpandReplicas autoExpandReplicas = AutoExpandReplicas.SETTING.get(indexMetadata.getSettings()); + return autoExpandReplicas.isEnabled() + && autoExpandReplicas.getMaxReplicas() == Integer.MAX_VALUE + && autoExpandReplicas.getMinReplicas() == 0; + } + @Override public Decision shouldAutoExpandToNode(IndexMetadata indexMetadata, DiscoveryNode node, RoutingAllocation allocation) { if (!node.isReplicaOnlyNode()) { // Regular data nodes participate in auto-expand for all indices return allocation.decision(Decision.YES, NAME, "node [%s] is a data node, eligible for auto-expand", node.getId()); } // Replica-only nodes only participate in 0-all auto-expand - AutoExpandReplicas autoExpandReplicas = AutoExpandReplicas.SETTING.get(indexMetadata.getSettings()); - boolean isAutoExpandAll = autoExpandReplicas.isEnabled() - && autoExpandReplicas.getMaxReplicas() == Integer.MAX_VALUE - && autoExpandReplicas.toString().startsWith("0-"); - - if (isAutoExpandAll) { + if (isAutoExpandAll(indexMetadata)) {Apply similar change in
canAllocatemethod.Also applies to: 111-113
server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (2)
56-62: Remove unusedAllocationServiceinstance.The
AllocationServiceis instantiated but never used in any test method. This is dead code that adds unnecessary complexity to the test setup.🔎 Proposed fix
allocationDeciders = new AllocationDeciders( Arrays.asList(decider, new SameShardAllocationDecider(Settings.EMPTY, clusterSettings), new ReplicaAfterPrimaryActiveAllocationDecider()) ); - AllocationService service = new AllocationService( - allocationDeciders, - new TestGatewayAllocator(), - new BalancedShardsAllocator(Settings.EMPTY), - EmptyClusterInfoService.INSTANCE, - EmptySnapshotsInfoService.INSTANCE - ); - // Create a cluster with 2 nodes: one regular data node, one replica-only node
293-342: Consider addingshouldAutoExpandToNodeassertions for non-0-all auto-expand settings.The test correctly verifies that
canAllocaterejects replicas with0-5and1-allsettings on replica-only nodes. For completeness, consider also testingshouldAutoExpandToNodewith these settings to ensure the auto-expand eligibility check is consistent.🔎 Proposed addition
decision = decider.canAllocate(replica1all, replicaOnlyNode, allocation); assertEquals(Decision.Type.NO, decision.type()); + + // Verify shouldAutoExpandToNode also rejects non-0-all settings + IndexMetadata index05 = testState.metadata().index("auto-expand-0-5"); + IndexMetadata index1all = testState.metadata().index("auto-expand-1-all"); + DiscoveryNode replicaOnlyDiscoveryNode = testState.nodes().get("replica-only-node"); + + decision = decider.shouldAutoExpandToNode(index05, replicaOnlyDiscoveryNode, allocation); + assertEquals(Decision.Type.NO, decision.type()); + + decision = decider.shouldAutoExpandToNode(index1all, replicaOnlyDiscoveryNode, allocation); + assertEquals(Decision.Type.NO, decision.type()); } }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
server/src/main/java/org/opensearch/cluster/ClusterModule.java(2 hunks)server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java(2 hunks)server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java(1 hunks)server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java(3 hunks)server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/LocalShardsBalancer.java(1 hunks)server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java(1 hunks)server/src/test/java/org/opensearch/cluster/node/DiscoveryNodeRoleTests.java(1 hunks)server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (2)
test/framework/src/main/java/org/opensearch/cluster/OpenSearchAllocationTestCase.java (1)
OpenSearchAllocationTestCase(76-543)server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java (1)
BalancedShardsAllocator(93-876)
server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)
server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (1)
ReplicaOnlyAllocationDecider(29-140)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: gradle-check
- GitHub Check: Mend Security Check
🔇 Additional comments (12)
server/src/main/java/org/opensearch/cluster/node/DiscoveryNode.java (2)
133-135: LGTM!The static helper follows the existing pattern used for other roles like
isWarmNodeandisIngestNode.
549-556: LGTM!The instance method follows the established pattern for role checks (e.g.,
isSearchNode,isWarmNode) with consistent Javadoc formatting.server/src/main/java/org/opensearch/cluster/node/DiscoveryNodeRole.java (2)
340-373: LGTM!The
REPLICA_ONLY_ROLEdefinition is well-documented and follows the established pattern for exclusive roles (similar toSEARCH_ROLE). The validation logic correctly enforces that replica_only must be the only role on a node.
378-382: LGTM!The role is correctly added to
BUILT_IN_ROLES. The sorted set will maintain proper ordering by role name.server/src/main/java/org/opensearch/cluster/routing/allocation/allocator/LocalShardsBalancer.java (1)
768-775: LGTM!Correctly excludes replica-only nodes from the rebalancing model, which ensures that adding/removing these nodes doesn't trigger rebalancing of shards on regular data nodes. The trace log is helpful for debugging allocation decisions.
server/src/main/java/org/opensearch/cluster/routing/RoutingNodes.java (1)
89-90: LGTM!Logger addition follows the standard pattern used in other classes.
server/src/main/java/org/opensearch/cluster/ClusterModule.java (1)
400-400: LGTM!The
ReplicaOnlyAllocationDecideris correctly registered in the allocation deciders pipeline.server/src/main/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDecider.java (2)
29-56: LGTM on the core structure and force allocation blocking.The decider correctly:
- Delegates
canAllocateandcanRemainto a shared helper for consistency- Blocks primary shard allocation to replica-only nodes
- Blocks force allocation of primaries to replica-only nodes (critical safety measure)
- Provides informative decision messages for debugging
67-69: The suggested fix cannot be implemented:getMinReplicas()is a private method.The fragility concern about using
autoExpandReplicas.toString().startsWith("0-")is valid. However, the proposed fix to useautoExpandReplicas.getMinReplicas() == 0is not viable—thegetMinReplicas()method is private and cannot be called fromReplicaOnlyAllocationDecider.To address the fragility concern, either:
- Make
getMinReplicas()public inAutoExpandReplicas, then use the suggested check, or- Add a public helper method to
AutoExpandReplicas(e.g.,isMinReplicasZero()) that encapsulates this logic.Likely an incorrect or invalid review comment.
server/src/test/java/org/opensearch/cluster/routing/allocation/decider/ReplicaOnlyAllocationDeciderTests.java (1)
90-156: Good test structure with thorough assertions.The tests properly verify both the decision type and the decision message content. Enabling
debugDecision(true)ensures meaningful assertion failure messages. The coverage of primary/replica allocation scenarios for both node types is comprehensive.server/src/test/java/org/opensearch/cluster/node/DiscoveryNodeRoleTests.java (2)
163-187: Thorough validation of the dedicated role constraint.The test correctly verifies that
replica_onlycannot coexist withDATA_ROLE,CLUSTER_MANAGER_ROLE, orINGEST_ROLE, and that it passes validation when specified alone. The assertion messages are clear and will help diagnose failures.
189-198: LGTM on role property and built-in roles tests.These tests provide essential coverage for:
- Role name (
replica_only) and abbreviation (ro)- Data-containing capability (
canContainData()returnstrue)- No legacy setting (expected for new roles)
- Presence in
BUILT_IN_ROLESset
|
|
||
| // CRITICAL: Never promote replicas on replica-only nodes | ||
| RoutingNode routingNode = node(activeReplica.currentNodeId()); | ||
| if (routingNode != null && routingNode.node().isReplicaOnlyNode()) { | ||
| logger.warn( | ||
| "Cannot promote replica shard [{}] to primary on replica-only node [{}]. " | ||
| + "Shard will remain as replica. Primary must be allocated to a regular data node.", | ||
| activeReplica.shardId(), | ||
| routingNode.nodeId() | ||
| ); | ||
| // Do NOT call promoteActiveReplicaShardToPrimary - just return | ||
| // The primary will remain unassigned, triggering allocation to a data node | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Early return leaves cluster without a primary shard.
The guard returns early without promoting the replica, but at this point movePrimaryToUnassignedAndDemoteToReplica() has already been called in unassignPrimaryAndPromoteActiveReplicaIfExists(). This leaves the shard group with no primary:
- Original primary → demoted to unassigned replica
- Active replica on replica-only node → remains a replica (not promoted)
- Result: No primary exists for this shard
The replica-only check should occur before the primary is demoted, ideally in unassignPrimaryAndPromoteActiveReplicaIfExists() when selecting the active replica. The replica selection logic should skip replicas on replica-only nodes.
🔎 Suggested approach
Modify the replica selection in unassignPrimaryAndPromoteActiveReplicaIfExists() to exclude replicas on replica-only nodes:
private void unassignPrimaryAndPromoteActiveReplicaIfExists(
ShardRouting failedShard,
UnassignedInfo unassignedInfo,
RoutingChangesObserver routingChangesObserver
) {
assert failedShard.primary();
ShardRouting activeReplica = null;
if (isMigratingToRemoteStore(metadata)) {
activeReplica = activeReplicaOnRemoteNode(failedShard.shardId());
+ // Skip if on replica-only node
+ if (activeReplica != null && isReplicaOnReplicaOnlyNode(activeReplica)) {
+ activeReplica = null;
+ }
}
if (activeReplica == null) {
if (metadata.isSegmentReplicationEnabled(failedShard.getIndexName())) {
activeReplica = activeReplicaWithOldestVersion(failedShard.shardId());
} else {
activeReplica = activeReplicaWithHighestVersion(failedShard.shardId());
}
+ // Skip if on replica-only node
+ if (activeReplica != null && isReplicaOnReplicaOnlyNode(activeReplica)) {
+ activeReplica = null;
+ }
}
// ... rest of method
}
+
+private boolean isReplicaOnReplicaOnlyNode(ShardRouting replica) {
+ RoutingNode rn = node(replica.currentNodeId());
+ return rn != null && rn.node().isReplicaOnlyNode();
+}Alternatively, filter replica-only nodes in activeReplicaWithHighestVersion() and activeReplicaWithOldestVersion().
Committable suggestion skipped: line range outside the PR's diff.
|
❌ Gradle check result for 9d31c2c: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
Thanks for this PR @synhershko The use case of scaling read capacity independently is definitely valuable. I wanted to point out that OpenSearch recently introduced Search Replicas as part of the reader/writer separation feature, which addresses a similar problem. This feature uses:
Related Docs link:
Have you tried this search replicas feature? I want to get some idea on how does this Thank you |
This commit introduces a new
replica_onlynode role that provides read scalability without impacting the write path or triggering cluster rebalancing, enabling cost-effective horizontal scaling for read-heavy workloads.Motivation
Organizations often need to scale read capacity independently from write capacity. Traditional approaches of adding more data nodes cause:
The replica-only node type addresses these challenges by:
Why Auto-Expand Replicas 0-all Only
The replica-only node type exclusively supports indices with
index.auto_expand_replicas: 0-allfor several critical reasons:Dynamic Replica Management: Auto-expand automatically adjusts replica counts when replica-only nodes join/leave, eliminating manual intervention and preventing under-replication
No Manual Rebalancing: Without auto-expand, adding replica-only nodes would require manual replica count adjustments and could trigger rebalancing on data nodes
Predictable Behavior: The 0-all setting guarantees one copy per eligible node, making replica distribution deterministic and transparent
Operational Safety: Prevents accidental allocation of critical production indices to nodes that may be ephemeral or have different SLAs
High-Level Design
Core Components
New Node Role (DiscoveryNodeRole.REPLICA_ONLY_ROLE)
replica_onlyAllocation Decider (ReplicaOnlyAllocationDecider)
Design Concerns Addressed
1. Data Integrity and Cluster Health
What happens if primary fails and only replica-only nodes have copies?
Replicas on replica-only nodes NEVER promote to primaries. The cluster enters YELLOW/RED state and waits for a regular data node. This prevents data loss scenarios where an ephemeral node becomes the source of truth.
2. Rebalancing Isolation
Will replica-only nodes trigger rebalancing on production data nodes?
Replica-only nodes are completely excluded from the BalancedShardsAllocator model. They are invisible to the balancer, ensuring zero rebalancing impact when nodes join/leave.
3. Role Transitions
What happens if a data node transitions to replica-only role?
4. Recovery Code Paths
Could recovery logic accidentally create primaries on replica-only nodes?
No, via multiple layers of protection:
5. Auto-Expand Node Counting
How do replica-only nodes integrate with auto-expand replica counting?
shouldAutoExpandToNode() in ReplicaOnlyAllocationDecider is automatically called by
AutoExpandReplicas.getDesiredNumberOfReplicas().
Replica-only nodes are counted only for 0-all indices, ensuring correct replica counts.
Configuration Example
Node configuration:
Index configuration:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.