Skip to content

feat: Add Logfire error tracking to fleet env#6

Open
dzorlu wants to merge 11 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire
Open

feat: Add Logfire error tracking to fleet env#6
dzorlu wants to merge 11 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Feb 26, 2026

Summary

  • Adds structured error tracking via Logfire across fleet env (init failures, tool call failures, MCP timeouts, verifier errors)
  • New telemetry.py wrapper with 4 functions (fleet_info, fleet_warning, fleet_error, fleet_exception)
  • 15 instrumentation sites across task_env.py, client.py, mcp_tools.py
  • No-op if configure_fleet_telemetry() is never called — logfire silently drops events

Test plan

  • All 41 previously-passing tests still pass
  • 4 pre-existing failures unaffected
  • Smoke test with configure_fleet_telemetry(send_to_logfire=False) to verify structured output

🤖 Generated with Claude Code

Structured observability for fleet env errors (init failures, tool call
failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and
15 instrumentation sites across task_env.py, client.py, mcp_tools.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality

- Add set_task_context() to establish base attributes for all events
- All telemetry events now inherit env_key, env_version, task_key, modality
- Parse env_key:version in client.py to log separately
- Add fleet_rollout_started and fleet_rollout_completed events
- Default environment changed to "training_rollouts"
- Update README with new schema and example SQL query

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deniz and others added 2 commits February 26, 2026 21:52
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}

This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context

- Move set_task_context() before from_fleet() call in task_env.py
- Remove explicit env_key/env_version from client.py telemetry calls (now from context)
- This ensures fleet_make_failed events include task_key and modality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
fleet_exception(
"fleet_tools_list_failed",
step_count=self._step_count,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unprotected telemetry calls in error handlers risk disrupting recovery

Low Severity

fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.

Additional Locations (2)

Fix in Cursor Fix in Web

The eval step was hanging for 6+ hours because MCP list_tools/call_tool
were not enforcing timeouts during connection establishment.

Changes:
- Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for()
- Reduce internal timeouts (30s connect, 60s SSE read)
- Raise TimeoutError with clear message when operations hang

This prevents a single slow/stuck Fleet env from blocking the entire
training run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
env_version=self.env_version,
task_key=self.task_key,
modality=self.modality,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared ContextVar causes wrong telemetry context across instances

Medium Severity

set_task_context is called only in __init__, writing to a thread-level ContextVar. When multiple FleetTaskEnv instances exist in the same thread (e.g., via from_json_file_all), each constructor overwrites the shared _session_context. All subsequent telemetry events from earlier instances (during reset_async, step_async, _compute_reward) will carry the last-created instance's env_key/task_key/etc., producing silently incorrect monitoring data. The context needs to be re-established before telemetry-emitting methods, not just once at construction.

Additional Locations (1)

Fix in Cursor Fix in Web

Deniz and others added 3 commits February 27, 2026 13:48
Fleet.make() is synchronous and can block for ~10 minutes per attempt
when an env has health check failures (e.g., fostgres). With 3 retries
and time.sleep(), this blocks the entire event loop for ~30 minutes,
freezing ALL other async trajectories in the batch.

Changes:
- Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and
  asyncio.sleep() for retries — yields to event loop while waiting
- Defer provisioning from FleetTaskEnv.__init__() to reset_async() via
  _ensure_provisioned(), so fleet.make() runs in async context
- After async provisioning, get sync env handle via Fleet.instance()
  for close() and verify_detailed() compatibility (fast GET, ~100ms)
- Update tests to match new deferred provisioning pattern

No SkyRL changes needed — it already calls __init__() then reset_async().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add console-visible logger.warning() alongside logfire events for
fleet_mcp_tool_error and fleet_tool_call_failed. Includes:
- env_key:env_version (e.g., amazon:v0.0.12)
- step N/max_steps (e.g., step 3/50)
- tool_name and error message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fleet_env_reset_failed: use fleet_warning (no traceback) instead of
  fleet_exception — reset 502s are expected, one-line warning is enough
- fleet_env_close_failed: silence entirely — "Instance already terminated"
  is expected when TTL expires before cleanup
- fleet_env_created: remove logfire console print — logger.info already
  prints the useful "Fleet instance ready in Xs: {id}" line

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

timeout_s=self.reset_timeout_s,
error_type=type(e).__name__,
error_message=str(e)[:200],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset failure loses exception traceback using wrong telemetry level

Medium Severity

The fleet_env_reset_failed event uses fleet_warning instead of fleet_exception, which means the exception traceback is not captured. The README documents this as an "exception"-level event, and all other similar failure events in the same file (fleet_tools_list_failed, fleet_screenshot_failed, fleet_tool_call_failed, fleet_verifier_failed) correctly use fleet_exception. Since this code is inside an except Exception as e: block, fleet_exception (which passes _exc_info=True) would capture the traceback needed for debugging reset failures.

Additional Locations (1)

Fix in Cursor Fix in Web

Deniz and others added 2 commits February 27, 2026 16:38
…teps

- Move fleet_rollout_started to fire before provisioning so init failures
  (e.g., fostgres health check) are counted in total_rollouts
- Emit fleet_rollout_completed with failure_reason="init_error" on init
  failure, ensuring completed <= total_rollouts invariant
- Add total_steps (SUM of step_count) and init_errors columns to SQL query
- Clarify verifier_errors = code exceptions, not model failures
- Add test for init failure telemetry path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…me_s

Tracks per-instance provisioning latency in Logfire to diagnose Fleet
queue serialization (96 concurrent make() calls get processed at ~1/10s,
causing 10+ min queue delays for the last instance).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant