Skip to content

Improve Retry Policy + 408, 5xx Support #45134

@tvaron3

Description

@tvaron3

Summary

During fault-injection testing for the Cosmos DB Rust SDK, gaps and inconsistencies were identified in the current retry logic compared to other SDKs.

Current Behavior

  • No retries are performed for HTTP 408 (Request Timeout).
  • Retries for 500, 503, and 410 (substatus 1022) are limited:
    • Only a single retry is attempted.
    • Retries are not executed across all eligible regions.
  • 410-1022 does not occur in gateway mode and is handled explicitly.
  • Retry behavior differs from other SDKs (e.g., Python), which retry a broader set of transient errors across regions.

Proposed Changes

1. Retry by Category Instead of Narrow Allowlist

  • Retry based on error classes, not specific hand-picked status codes.
  • Treat the following as retriable by default:
    • All 5xx errors
    • 408
    • 410 (substatus 1022)
  • Maintain a blocklist of non-retriable errors instead.
  • This aligns with:
    • Python SDK behavior
    • Envoy-style retry semantics
  • Improves resilience to future infrastructure or status-code changes.

2. Retry Across All Eligible Regions

  • Retry across all applicable regions, excluding explicitly excluded ones.
  • Respect preferred region order.
  • Matches retry semantics in other Cosmos DB SDKs.
  • Improves availability under regional or transient faults.

3. Different Defaults for Reads vs Writes

  • Reads:
    • Retry by default.
    • Minimal blocklist.
  • Writes:
    • More conservative.
    • Curated set of retriable status codes to avoid side effects.

Motivation

  • Improves consistency across Cosmos DB SDKs.
  • Increases resilience to transient and regional failures.
  • Simplifies retry logic by focusing on exceptions rather than enumerating every transient case.
  • Reduces the risk of missing new transient status codes.

Open Questions

  1. Reads vs Writes
    • Should reads and writes use different retry strategies by default?

    • What should the write retry blocklist include?

    • Any Rust SDK–specific execution model concerns?

Metadata

Metadata

Assignees

Labels

Type

Projects

Status

Untriaged

Status

No status

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions