fix(trainer): make get_job_logs optionally fail fast when pod is missing by sjiang83 · Pull Request #339 · kubeflow/sdk

sjiang83 · 2026-02-28T02:59:50Z

Fixes #317

The Issue
Currently, when calling TrainerClient.get_job_logs() using the Kubernetes backend, it silently returns an empty iterator if a pod isn't found for the requested step. This usually happens if the step is still Pending or the selector just comes up empty. This behavior creates a lot of ambiguity, especially for MCP tools and AI agents that rely on clear feedback to know what is actually going on.

What this PR does
I've added an opt-in strict flag to the method:

strict=False (the default): Keeps the current behavior exactly as is, which might yield nothing.
strict=True: Explicitly raises a PodNotFoundError if no pod is found for the requested TrainJob or step.

I also updated the signatures across the other Trainer backends to accept the strict argument so the interface stays consistent (it simply ignores the flag for non-Kubernetes backends). I applied this same strict logic to the Optimizer Kubernetes backend log retrieval as well.

Why this helps MCP and Agents
For agents to debug effectively, they absolutely need to be able to tell the difference between three things:

The job exists but the logs just aren't ready or outputting yet.
The pod actually cannot be found (which we now catch explicitly with strict=True).
Standard permission or RBAC errors (which still bubble up from the Kubernetes client just like before).

By opting into strict=True, we get a much more deterministic observability flow without forcing a breaking change on existing SDK users.

Testing and Compatibility

Tests: I added a unit test to verify that strict=True correctly raises an error when all steps are still in the Pending state.
Backward compatibility: 100% safe. Since the default behavior remains unchanged, existing users won't notice any difference.

google-oss-prow · 2026-02-28T02:59:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-28T02:59:59Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR addresses issue #317 where get_job_logs() silently returned an empty iterator when no pod was found for the requested step, providing no way to distinguish "logs not ready" from "pod not found." The fix adds an opt-in strict flag to get_job_logs() across all backends.

Changes:

Adds strict: bool = False parameter to get_job_logs in all backends and TrainerClient, raising PodNotFoundError when strict=True and no pod is found.
Introduces a new PodNotFoundError(RuntimeError) class in the Kubernetes trainer backend.
Adds a unit test verifying that strict=True raises PodNotFoundError when all steps are pending.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`kubeflow/trainer/backends/kubernetes/backend.py`	Adds `PodNotFoundError` class and `strict` flag to `get_job_logs`
`kubeflow/trainer/backends/base.py`	Adds `strict` to the abstract `get_job_logs` signature
`kubeflow/trainer/backends/localprocess/backend.py`	Adds `strict` to `get_job_logs` signature (no-op)
`kubeflow/trainer/backends/container/backend.py`	Adds `strict` to `get_job_logs` signature (no-op)
`kubeflow/trainer/api/trainer_client.py`	Exposes `strict` on the public `get_job_logs` API
`kubeflow/optimizer/backends/kubernetes/backend.py`	Adds `strict` flag and corresponding `RuntimeError` raise to optimizer's `get_job_logs`
`kubeflow/trainer/backends/kubernetes/backend_test.py`	Adds unit test for `strict=True` behavior with pending pods

kubeflow/trainer/backends/kubernetes/backend_test.py

kubeflow/trainer/backends/kubernetes/backend.py

kubeflow/optimizer/backends/kubernetes/backend.py

kubeflow/trainer/api/trainer_client.py

kubeflow/trainer/backends/kubernetes/backend.py

Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>

sjiang83 · 2026-02-28T03:40:52Z

Just a quick heads-up that the CI workflows are paused waiting for a maintainer's approval.

Whenever someone has a second to kick off the unit and E2E tests, I'd really appreciate it!

sjiang83 · 2026-02-28T03:50:48Z

Just a quick heads-up that the CI workflows are paused waiting for a maintainer's approval.

Whenever someone has a second to kick off the unit and E2E tests, I'd really appreciate it!

Hey @andreyvelich, looks like this needs a quick /ok-to-test to kick off the checks. No rush, thanks!

jaiakash · 2026-02-28T10:37:34Z

/ok-to-test

Signed-off-by: sjiang83 <sjiang83@fordham.edu>

sjiang83 · 2026-02-28T14:21:50Z

All the CI tests are green across the board now.

Just waiting on an LGTM and an approve label whenever a maintainer has a moment to take a look. Thanks again!

Copilot AI review requested due to automatic review settings February 28, 2026 02:59

google-oss-prow bot requested review from Electronic-Waste, andreyvelich and szaher February 28, 2026 02:59

google-oss-prow bot added the size/M label Feb 28, 2026

Copilot started reviewing on behalf of sjiang83 February 28, 2026 03:00 View session

sjiang83 changed the title ~~trainer: make get_job_logs optionally fail fast when pod is missing~~ fix(trainer): make get_job_logs optionally fail fast when pod is missing Feb 28, 2026

Copilot AI reviewed Feb 28, 2026

View reviewed changes

trainer: add strict mode for get_job_logs when pod is missing

f928e74

Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>

sjiang83 force-pushed the fix-317-get-job-logs-strict branch from 6fa100c to f928e74 Compare February 28, 2026 03:04

fix: expose strict log mode and unify pod-not-found errors

b4acf43

Signed-off-by: Shanhuizi Jiang <sjiang83@fordham.edu>

sjiang83 force-pushed the fix-317-get-job-logs-strict branch from 4e3170a to b4acf43 Compare February 28, 2026 03:32

google-oss-prow bot added the ok-to-test label Feb 28, 2026

chore: apply ruff formatting

5d3e300

Signed-off-by: sjiang83 <sjiang83@fordham.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): make get_job_logs optionally fail fast when pod is missing#339

fix(trainer): make get_job_logs optionally fail fast when pod is missing#339
sjiang83 wants to merge 3 commits intokubeflow:mainfrom
sjiang83:fix-317-get-job-logs-strict

sjiang83 commented Feb 28, 2026

Uh oh!

google-oss-prow bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

jaiakash commented Feb 28, 2026

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sjiang83 commented Feb 28, 2026

Uh oh!

google-oss-prow bot commented Feb 28, 2026

Uh oh!

github-actions bot commented Feb 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

jaiakash commented Feb 28, 2026

Uh oh!

sjiang83 commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants