[Feature]: Implement Checkpointing for Fault Tolerance

## Summary

Add checkpointing capabilities to Cortex.Streams to enable fault-tolerant stream processing with exactly-once or at-least-once processing semantics, allowing streams to recover from failures without data loss.

## Problem Statement

Currently, if a Cortex stream fails or is restarted:

- **All in-flight data is lost**: Events being processed are not recoverable
- **Window state is lost**: Aggregations in progress are discarded
- **Source offsets are not tracked**: Cannot resume from where processing stopped
- **State stores may be inconsistent**: Partial updates may have been applied
- **No way to upgrade**: Cannot update stream logic without losing state

### Current Behavior

```csharp
var stream = StreamBuilder<Order>.CreateNewStream("OrderProcessor")
    .Stream(kafkaSource)
    .Aggregate(
        keySelector: o => o.CustomerId,
        aggregateFunction: (acc, order) => acc + order.Amount,
        stateStore: rocksDbStore)
    .Sink(SendAlert)
    .Build();

await stream.StartAsync();

// If the application crashes here...
// - Kafka offset is unknown (will re-read from beginning or miss messages)
// - RocksDB state may have partial aggregations
// - No way to recover to a consistent state
```

### Impact

Without checkpointing:
- Production deployments risk data loss on failures
- Cannot guarantee exactly-once processing
- Rolling upgrades require manual state migration
- No disaster recovery capability

## Acceptance Criteria

- [ ] Streams can be configured with checkpointing enabled
- [ ] Checkpoints are automatically created at configured intervals
- [ ] Streams automatically restore from latest checkpoint on startup
- [ ] Source offsets are captured and restored correctly
- [ ] Operator state is captured and restored correctly
- [ ] Old checkpoints are automatically cleaned up
- [ ] Manual checkpoint triggering is supported
- [ ] Restoration from specific checkpoint is supported
- [ ] At-least-once semantics work correctly
- [ ] Exactly-once semantics work with transactional sinks
- [ ] Telemetry/metrics for checkpoint operations
- [ ] Backward compatible - streams without checkpointing continue to work

## Technical Considerations

1. **Serialization**: Need efficient serialization for state. Consider using MessagePack or protobuf.

2. **Checkpoint Barriers**: For exactly-once, need to inject barriers and ensure all operators snapshot at consistent points.

3. **State Store Integration**: `IDataStore` implementations should support snapshotting. RocksDB has native snapshot support.

4. **Large State**: For large state stores, consider incremental checkpoints (only changed data).

5. **Async Snapshots**: Snapshots should be non-blocking to minimize latency impact.

6. **Failure During Checkpoint**: If checkpoint fails, the stream should continue and retry.

7. **Upgrade Compatibility**: Consider state schema versioning for upgrades.

## References

- [Apache Flink - Checkpoints](https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/checkpoints/)
- [Apache Flink - Savepoints](https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/savepoints/)
- [Kafka Streams - Exactly-Once Semantics](https://kafka.apache.org/documentation/streams/core-concepts#streams_processing_guarantee)
- [Chandy-Lamport Algorithm](https://en.wikipedia.org/wiki/Chandy%E2%80%93Lamport_algorithm)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Implement Checkpointing for Fault Tolerance #198

Summary

Problem Statement

Current Behavior

Impact

Acceptance Criteria

Technical Considerations

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Implement Checkpointing for Fault Tolerance #198

Description

Summary

Problem Statement

Current Behavior

Impact

Acceptance Criteria

Technical Considerations

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions