feat: Add Logfire error tracking to fleet env#6
feat: Add Logfire error tracking to fleet env#6dzorlu wants to merge 11 commits intodeniz/fleet_clientfrom
Conversation
Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}
This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| fleet_exception( | ||
| "fleet_tools_list_failed", | ||
| step_count=self._step_count, | ||
| ) |
There was a problem hiding this comment.
Unprotected telemetry calls in error handlers risk disrupting recovery
Low Severity
fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.
Additional Locations (2)
The eval step was hanging for 6+ hours because MCP list_tools/call_tool were not enforcing timeouts during connection establishment. Changes: - Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for() - Reduce internal timeouts (30s connect, 60s SSE read) - Raise TimeoutError with clear message when operations hang This prevents a single slow/stuck Fleet env from blocking the entire training run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| env_version=self.env_version, | ||
| task_key=self.task_key, | ||
| modality=self.modality, | ||
| ) |
There was a problem hiding this comment.
Shared ContextVar causes wrong telemetry context across instances
Medium Severity
set_task_context is called only in __init__, writing to a thread-level ContextVar. When multiple FleetTaskEnv instances exist in the same thread (e.g., via from_json_file_all), each constructor overwrites the shared _session_context. All subsequent telemetry events from earlier instances (during reset_async, step_async, _compute_reward) will carry the last-created instance's env_key/task_key/etc., producing silently incorrect monitoring data. The context needs to be re-established before telemetry-emitting methods, not just once at construction.
Additional Locations (1)
Fleet.make() is synchronous and can block for ~10 minutes per attempt when an env has health check failures (e.g., fostgres). With 3 retries and time.sleep(), this blocks the entire event loop for ~30 minutes, freezing ALL other async trajectories in the batch. Changes: - Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and asyncio.sleep() for retries — yields to event loop while waiting - Defer provisioning from FleetTaskEnv.__init__() to reset_async() via _ensure_provisioned(), so fleet.make() runs in async context - After async provisioning, get sync env handle via Fleet.instance() for close() and verify_detailed() compatibility (fast GET, ~100ms) - Update tests to match new deferred provisioning pattern No SkyRL changes needed — it already calls __init__() then reset_async(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add console-visible logger.warning() alongside logfire events for fleet_mcp_tool_error and fleet_tool_call_failed. Includes: - env_key:env_version (e.g., amazon:v0.0.12) - step N/max_steps (e.g., step 3/50) - tool_name and error message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fleet_env_reset_failed: use fleet_warning (no traceback) instead of
fleet_exception — reset 502s are expected, one-line warning is enough
- fleet_env_close_failed: silence entirely — "Instance already terminated"
is expected when TTL expires before cleanup
- fleet_env_created: remove logfire console print — logger.info already
prints the useful "Fleet instance ready in Xs: {id}" line
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| timeout_s=self.reset_timeout_s, | ||
| error_type=type(e).__name__, | ||
| error_message=str(e)[:200], | ||
| ) |
There was a problem hiding this comment.
Reset failure loses exception traceback using wrong telemetry level
Medium Severity
The fleet_env_reset_failed event uses fleet_warning instead of fleet_exception, which means the exception traceback is not captured. The README documents this as an "exception"-level event, and all other similar failure events in the same file (fleet_tools_list_failed, fleet_screenshot_failed, fleet_tool_call_failed, fleet_verifier_failed) correctly use fleet_exception. Since this code is inside an except Exception as e: block, fleet_exception (which passes _exc_info=True) would capture the traceback needed for debugging reset failures.
Additional Locations (1)
…teps - Move fleet_rollout_started to fire before provisioning so init failures (e.g., fostgres health check) are counted in total_rollouts - Emit fleet_rollout_completed with failure_reason="init_error" on init failure, ensuring completed <= total_rollouts invariant - Add total_steps (SUM of step_count) and init_errors columns to SQL query - Clarify verifier_errors = code exceptions, not model failures - Add test for init failure telemetry path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…me_s Tracks per-instance provisioning latency in Logfire to diagnose Fleet queue serialization (96 concurrent make() calls get processed at ~1/10s, causing 10+ min queue delays for the last instance). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Summary
telemetry.pywrapper with 4 functions (fleet_info,fleet_warning,fleet_error,fleet_exception)task_env.py,client.py,mcp_tools.pyconfigure_fleet_telemetry()is never called — logfire silently drops eventsTest plan
configure_fleet_telemetry(send_to_logfire=False)to verify structured output🤖 Generated with Claude Code