fix(trainer): make get_job_logs optionally fail fast when pod is missing#339
fix(trainer): make get_job_logs optionally fail fast when pod is missing#339sjiang83 wants to merge 3 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR addresses issue #317 where get_job_logs() silently returned an empty iterator when no pod was found for the requested step, providing no way to distinguish "logs not ready" from "pod not found." The fix adds an opt-in strict flag to get_job_logs() across all backends.
Changes:
- Adds
strict: bool = Falseparameter toget_job_logsin all backends andTrainerClient, raisingPodNotFoundErrorwhenstrict=Trueand no pod is found. - Introduces a new
PodNotFoundError(RuntimeError)class in the Kubernetes trainer backend. - Adds a unit test verifying that
strict=TrueraisesPodNotFoundErrorwhen all steps are pending.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
kubeflow/trainer/backends/kubernetes/backend.py |
Adds PodNotFoundError class and strict flag to get_job_logs |
kubeflow/trainer/backends/base.py |
Adds strict to the abstract get_job_logs signature |
kubeflow/trainer/backends/localprocess/backend.py |
Adds strict to get_job_logs signature (no-op) |
kubeflow/trainer/backends/container/backend.py |
Adds strict to get_job_logs signature (no-op) |
kubeflow/trainer/api/trainer_client.py |
Exposes strict on the public get_job_logs API |
kubeflow/optimizer/backends/kubernetes/backend.py |
Adds strict flag and corresponding RuntimeError raise to optimizer's get_job_logs |
kubeflow/trainer/backends/kubernetes/backend_test.py |
Adds unit test for strict=True behavior with pending pods |
Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>
6fa100c to
f928e74
Compare
Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>
4e3170a to
b4acf43
Compare
|
Just a quick heads-up that the CI workflows are paused waiting for a maintainer's approval. Whenever someone has a second to kick off the unit and E2E tests, I'd really appreciate it! |
Hey @andreyvelich, looks like this needs a quick |
|
/ok-to-test |
Signed-off-by: sjiang83 <sjiang83@fordham.edu>
|
All the CI tests are green across the board now. Just waiting on an LGTM and an approve label whenever a maintainer has a moment to take a look. Thanks again! |
Fixes #317
The Issue
Currently, when calling TrainerClient.get_job_logs() using the Kubernetes backend, it silently returns an empty iterator if a pod isn't found for the requested step. This usually happens if the step is still Pending or the selector just comes up empty. This behavior creates a lot of ambiguity, especially for MCP tools and AI agents that rely on clear feedback to know what is actually going on.
What this PR does
I've added an opt-in strict flag to the method:
I also updated the signatures across the other Trainer backends to accept the strict argument so the interface stays consistent (it simply ignores the flag for non-Kubernetes backends). I applied this same strict logic to the Optimizer Kubernetes backend log retrieval as well.
Why this helps MCP and Agents
For agents to debug effectively, they absolutely need to be able to tell the difference between three things:
By opting into strict=True, we get a much more deterministic observability flow without forcing a breaking change on existing SDK users.
Testing and Compatibility