Skip to content

Comments

feat(ray): Add actor resilience with zombie detection and lease recovery#1408

Draft
jioffe502 wants to merge 1 commit intoNVIDIA:mainfrom
jioffe502:actor-resilience
Draft

feat(ray): Add actor resilience with zombie detection and lease recovery#1408
jioffe502 wants to merge 1 commit intoNVIDIA:mainfrom
jioffe502:actor-resilience

Conversation

@jioffe502
Copy link
Collaborator

Description

Defense-in-depth so the pipeline self-heals when any actor dies (OOM, native crash, segfault) instead of silently black-holing jobs and hanging:

  • Zombie detection: stat collector quarantines actors with dead processing threads (~30s detection)
  • Lease-based job recovery: source claims leases on dequeued jobs; sink acks atomically; expired leases are requeued to healthy actors
  • Readiness verification: scale-up retries actors that fail admission checks instead of admitting broken actors to the scheduling pool
  • Min-replica enforcement: quarantined actors are replaced even with dynamic scaling disabled

discovered from debugging ray zombies in regression that #1379 fixes.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

Adds defense-in-depth for actor failures:

- Zombie detection: stat collector detects dead processing threads and
  quarantines actors after 3 consecutive unhealthy cycles
- Health probes: actors expose processing_thread_alive in get_stats()
  and a lightweight health_probe() method
- Readiness verification: scale-up retries actors that fail readiness
  checks instead of admitting broken actors to the scheduling pool
- Lease-based job recovery: source claims leases on dequeued jobs;
  sink atomically pushes results and acks leases; expired leases are
  periodically swept and requeued
- Min-replica enforcement: _reconcile_min_replicas runs even when
  dynamic scaling is disabled to replace quarantined actors

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested a review from a team as a code owner February 18, 2026 22:32
@jioffe502 jioffe502 requested a review from ChrisJar February 18, 2026 22:32
@jioffe502 jioffe502 marked this pull request as draft February 18, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant