Make pods wait for user code containers to be actual ready#491
Draft
simeoncarstens wants to merge 10 commits intomainfrom
Draft
Make pods wait for user code containers to be actual ready#491simeoncarstens wants to merge 10 commits intomainfrom
simeoncarstens wants to merge 10 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an attempt (and WIP) to solve #386. The current strategy is to implement a readiness or startup probe (currently, startup, but probably readiness is the appropriate one) to make sure the user code containers are ready, meaning the gRPC services for log-prob / gradient are ready to respond in a timely manner.
Once the probe succeeds, that container is deemed ready / started, and the pod can be considered ready.
One pitfall could be that possibly the controller pod is also running a user code container that is actually used in the calculation. So we want the controller container to only start sending out sampling requests until not only once all user code containers in other pods are ready, but also the user code container in the controller pod has to be ready.
I'm not sure whether a startup probe is enough, or whether we need an init container on the controller pod that makes all controller pod containers start only once all other pods are ready.