Skip to content

Conversation

@agparadiso
Copy link
Contributor

@agparadiso agparadiso commented Jan 2, 2026

ShardOrchestrator gRPC Service Implementation

Implements the ShardOrchestrator gRPC service that exposes workflow-to-shard routing decisions from Ring OCR consensus to other shards in the system.

Changes

  • Proto Definition Location
    • Moved shard_orchestrator.proto from ring/pb/shardorchestrator/pb/
      Reason: Proto should live with its service implementation, not in the Ring OCR package
  • Service Implementation
    • Store: Thread-safe in-memory store with version tracking for cache invalidation
      gRPC
    • Server: Implements batch workflow mapping queries and shard registration reporting
      Test Helpers (test_adapter.go): Test utilities for simulating Ring OCR updates
  • Data Flow
    Ring OCR consensus → Ring Transmitter → Dual update:
    • Ring's internal store (for consensus logic)
    • ShardOrchestrator store (for gRPC exposure to other shards)

CRE-1601

Requires

Supports

@github-actions
Copy link

github-actions bot commented Jan 2, 2026

⚠️ API Diff Results - Breaking changes detected

📦 Module: github-com-smartcontractkit-chainlink-common

🔴 Breaking Changes (17)

pkg/workflows/ring (2)
  • NewFactory — Type changed:
func(
  *Store, 
  + *github.com/smartcontractkit/chainlink-common/pkg/workflows/shardorchestrator.Store, 
  github.com/smartcontractkit/chainlink-common/pkg/workflows/ring/pb.ArbiterScalerClient, 
  github.com/smartcontractkit/chainlink-common/pkg/logger.Logger, 
  *ConsensusConfig
)
(*Factory, error)
  • NewTransmitter — Type changed:
func(
  github.com/smartcontractkit/chainlink-common/pkg/logger.Logger, 
  *Store, 
  + *github.com/smartcontractkit/chainlink-common/pkg/workflows/shardorchestrator.Store, 
  github.com/smartcontractkit/chainlink-common/pkg/workflows/ring/pb.ArbiterScalerClient, 
  github.com/smartcontractkit/libocr/offchainreporting2plus/types.Account
)
*Transmitter
pkg/workflows/ring/pb (15)
  • File_shard_orchestrator_proto — 🗑️ Removed

  • GetWorkflowShardMappingRequest — 🗑️ Removed

  • GetWorkflowShardMappingResponse — 🗑️ Removed

  • NewShardOrchestratorServiceClient — 🗑️ Removed

  • RegisterShardOrchestratorServiceServer — 🗑️ Removed

  • ReportWorkflowTriggerRegistrationRequest — 🗑️ Removed

  • ReportWorkflowTriggerRegistrationResponse — 🗑️ Removed

  • ShardOrchestratorService_GetWorkflowShardMapping_FullMethodName — 🗑️ Removed

  • ShardOrchestratorService_ReportWorkflowTriggerRegistration_FullMethodName — 🗑️ Removed

  • ShardOrchestratorService_ServiceDesc — 🗑️ Removed

  • ShardOrchestratorServiceClient — 🗑️ Removed

  • ShardOrchestratorServiceServer — 🗑️ Removed

  • UnimplementedShardOrchestratorServiceServer — 🗑️ Removed

  • UnsafeShardOrchestratorServiceServer — 🗑️ Removed

  • WorkflowMappingState — 🗑️ Removed


📄 View full apidiff report

@agparadiso agparadiso force-pushed the cre-1601-shard-orchestrator-implementation branch from 3d888b8 to aecf0cb Compare January 2, 2026 16:20
@agparadiso agparadiso force-pushed the cre-1601-shard-orchestrator-implementation branch from aecf0cb to e193376 Compare January 2, 2026 16:21
Base automatically changed from cre-1601-shard-orchestrator-plugin to main January 5, 2026 08:51
@agparadiso agparadiso force-pushed the cre-1601-shard-orchestrator-implementation branch from bcdcbcb to 4526ebc Compare January 5, 2026 09:51
@agparadiso agparadiso force-pushed the cre-1601-shard-orchestrator-implementation branch from 4526ebc to 273098b Compare January 5, 2026 10:19
@agparadiso agparadiso force-pushed the cre-1601-shard-orchestrator-implementation branch from 273098b to f78484e Compare January 5, 2026 10:21
@agparadiso agparadiso changed the title wip shard-orchestrator [CRE-1601] shard-orchestrator implementation Jan 5, 2026
@agparadiso agparadiso marked this pull request as ready for review January 7, 2026 14:37
@agparadiso agparadiso requested a review from a team as a code owner January 7, 2026 14:37
Copilot AI review requested due to automatic review settings January 7, 2026 14:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the ShardOrchestrator gRPC service that exposes workflow-to-shard routing decisions from Ring OCR consensus to other shards in the system. The implementation moves the proto definition from the Ring package to a dedicated shardorchestrator package and establishes a dual-update pattern where Ring's transmitter updates both its internal store and the ShardOrchestrator store for external consumption.

Key changes:

  • Moved proto definitions and generated code to shardorchestrator/pb/ package for better separation of concerns
  • Implemented thread-safe in-memory store with version tracking for cache invalidation
  • Created gRPC server and client implementations for cross-shard communication

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/workflows/shardorchestrator/store.go New in-memory store managing workflow-to-shard mappings with version tracking
pkg/workflows/shardorchestrator/service.go gRPC server implementation exposing workflow mappings and accepting shard registrations
pkg/workflows/shardorchestrator/client.go gRPC client for querying workflow mappings from shard 0
pkg/workflows/shardorchestrator/pb/*.proto Relocated proto definitions from ring package with timestamp fields removed
pkg/workflows/ring/transmitter.go Updated to perform dual updates to both ring store and orchestrator store
pkg/workflows/ring/factory.go Updated constructor to accept orchestrator store parameter
pkg/workflows/shardorchestrator/*_test.go Comprehensive test coverage for store, service, client, and integration scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


existingMapping, err := t.shardOrchestratorStore.GetWorkflowMapping(ctx, workflowID)
if err != nil {
// New workflow - no previous assignment
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "New workflow - no previous assignment" is misleading because the error could represent any retrieval failure, not just missing workflows. Consider checking specifically for "not found" errors versus other error types, or update the comment to reflect that any error is treated as a new workflow.

Suggested change
// New workflow - no previous assignment
// Treat any retrieval error as no previous assignment (new workflow or store failure)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't agree, the GetWorkflowMapping implementation only fails if the workflowID is not present in the map. the comment is correct


if err := t.shardOrchestratorStore.BatchUpdateWorkflowMappings(ctx, mappings); err != nil {
t.lggr.Errorw("failed to update ShardOrchestrator store", "err", err, "workflowCount", len(mappings))
// Don't fail the entire transmission if ShardOrchestrator update fails
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "Don't fail the entire transmission" explains the behavior but doesn't explain the rationale. Consider documenting why it's safe to continue despite this failure, such as "ShardOrchestrator updates are for external consumption only and don't affect Ring's internal consensus state."

Suggested change
// Don't fail the entire transmission if ShardOrchestrator update fails
// Don't fail the entire transmission if ShardOrchestrator update fails.
// The Ring store is the authoritative source of routing state and workflow mappings.
// ShardOrchestrator updates are best-effort, external-facing metadata that do not
// participate in Ring's internal consensus or routing decisions, so failing here
// would unnecessarily impact consensus progress without improving safety.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't agree, its too verbose. the original comment explain why we don't return and it makes clear this is intentional.

@agparadiso agparadiso requested a review from mchain0 January 7, 2026 14:37
mchain0
mchain0 previously approved these changes Jan 7, 2026
Copy link
Contributor

@mchain0 mchain0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls just fix suggestions from copilot

// Determine if system is in transition state
systemInTransition := false
if outcome.State != nil {
if _, ok := outcome.State.State.(*pb.RoutingState_Transition); ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log a warning if the conversion fails

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not converting it, is a type assertion that is basically asking: "is the state of type *pb.RoutingState_Transition? if so I'll mark the systemInTranstion = true"

george-dorin
george-dorin previously approved these changes Jan 7, 2026
@mchain0 mchain0 enabled auto-merge January 8, 2026 09:59
@mchain0 mchain0 added this pull request to the merge queue Jan 8, 2026
Merged via the queue into main with commit 7044bd4 Jan 8, 2026
34 of 36 checks passed
@mchain0 mchain0 deleted the cre-1601-shard-orchestrator-implementation branch January 8, 2026 10:16
WorkflowID string
OldShardID uint32
NewShardID uint32
TransitionState TransitionState
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to the game here but why are we storing the state separately for each workflow? The whole system is either in steady or transitioning state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants