Skip to content

Fix Kubernetes readiness probe to wait for user code containers#492

Draft
Copilot wants to merge 3 commits intouser-code-startup-probefrom
copilot/fix-6bc8b483-a84c-47da-84bb-60013a365033
Draft

Fix Kubernetes readiness probe to wait for user code containers#492
Copilot wants to merge 3 commits intouser-code-startup-probefrom
copilot/fix-6bc8b483-a84c-47da-84bb-60013a365033

Conversation

Copy link

Copilot AI commented Sep 1, 2025

Problem

The controller container was reporting as ready before user code containers finished installing dependencies, causing sampling requests to be sent too early. This happened because:

  1. User code containers install dependencies via pip during startup, which can take several minutes
  2. The controller's Kubernetes readiness probe only checked the controller process status
  3. Kubernetes would route traffic to pods before user code servers could handle log probability evaluations within the required ~1s timeout

This resulted in failed sampling requests and degraded system performance.

Solution

Modified the controller's health check callback to wait for all user code containers to be ready before reporting the controller as ready.

Changes Made

Added check_user_code_server_ready() function:

  • Connects to user code server on localhost:50052 (same pod)
  • Uses gRPC health check with configurable timeout (default 1s)
  • Only returns True when server responds with SERVING status
  • Handles connection errors, timeouts, and various health statuses gracefully

Updated controller_state() callback:

  • First checks if controller process is running
  • Then checks if user code server is ready to handle requests
  • Only returns SERVING when both controller process AND user code server are ready
  • Returns NOT_SERVING while user code is still installing dependencies

Behavior

Before: Controller reports ready immediately → Kubernetes routes traffic → Sampling requests fail during pip install

After: Controller waits for user code → Reports ready only when log prob evaluations can complete in ~1s → No failed requests

Testing

Comprehensive testing confirms:

  • Controller reports NOT_SERVING during dependency installation phase
  • Controller reports SERVING only after user code containers are ready
  • Kubernetes readiness probe correctly prevents traffic routing until ready
  • All edge cases handled (timeouts, errors, connection failures)

This ensures the controller only starts receiving sampling requests when all user code containers can reliably respond within the expected timeframe.


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits September 1, 2025 09:39
Co-authored-by: simeoncarstens <9200049+simeoncarstens@users.noreply.github.com>
Co-authored-by: simeoncarstens <9200049+simeoncarstens@users.noreply.github.com>
Copilot AI changed the title [WIP] This branch contains an attempt of implementing a Kubernetes readiness probe. The problem is that the user code container sometimes installs dependencies, e.g. via pip, that take a while, but the controller container sometimes sends out sampling requ... Fix Kubernetes readiness probe to wait for user code containers Sep 1, 2025
Copilot AI requested a review from simeoncarstens September 1, 2025 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants