feat: Add Logfire error tracking to fleet env by dzorlu · Pull Request #6 · fleet-ai/OpenEnv

dzorlu · 2026-02-26T04:54:13Z

Summary

Adds structured error tracking via Logfire across fleet env (init failures, tool call failures, MCP timeouts, verifier errors)
New telemetry.py wrapper with 4 functions (fleet_info, fleet_warning, fleet_error, fleet_exception)
15 instrumentation sites across task_env.py, client.py, mcp_tools.py
No-op if configure_fleet_telemetry() is never called — logfire silently drops events

Test plan

All 41 previously-passing tests still pass
4 pre-existing failures unaffected
Smoke test with configure_fleet_telemetry(send_to_logfire=False) to verify structured output

🤖 Generated with Claude Code

Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

src/envs/fleet_env/task_env.py

…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/envs/fleet_env/task_env.py

Tracks MCP server errors (returned in tool results) separately from Python exceptions: - fleet_tool_call_failed: Python exception during call_tool() - fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true} This aligns telemetry with WandB tool_error counting which tracks both exception-based errors and error patterns in tool results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

src/envs/fleet_env/client.py

src/envs/fleet_env/task_env.py

- Remove unused fleet_error import from task_env.py - Fix _is_tool_error to check truthy values (avoid {"error": null} false positives) - Make close() exception-safe with try/finally for cleanup - Emit fleet_rollout_completed on ALL paths (not just verifier success) - Remove unused _env_name/_env_version parsing in client.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor · 2026-02-27T06:23:09Z

src/envs/fleet_env/task_env.py

+                fleet_exception(
+                    "fleet_tools_list_failed",
+                    step_count=self._step_count,
+                )


Unprotected telemetry calls in error handlers risk disrupting recovery

Low Severity

fleet_exception calls inside except blocks in reset_async, step_async, and _compute_reward are not wrapped in try/except, unlike the identical pattern in close() which correctly protects against telemetry failures with a comment "Telemetry failure should not break cleanup." If fleet_exception raises, it disrupts the graceful error handling — e.g., at line 274, self._tools_cache = [] is never reached; at line 434, the rest of step_async (done/reward/obs logic) is skipped. This inconsistency means a telemetry failure could leave the environment in a broken state.

Additional Locations (2)

src/envs/fleet_env/task_env.py#L433-L438

src/envs/fleet_env/task_env.py#L535-L542

The eval step was hanging for 6+ hours because MCP list_tools/call_tool were not enforcing timeouts during connection establishment. Changes: - Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for() - Reduce internal timeouts (30s connect, 60s SSE read) - Raise TimeoutError with clear message when operations hang This prevents a single slow/stuck Fleet env from blocking the entire training run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor · 2026-02-27T15:08:34Z

src/envs/fleet_env/task_env.py

+            env_version=self.env_version,
+            task_key=self.task_key,
+            modality=self.modality,
+        )


Shared ContextVar causes wrong telemetry context across instances

Medium Severity

set_task_context is called only in __init__, writing to a thread-level ContextVar. When multiple FleetTaskEnv instances exist in the same thread (e.g., via from_json_file_all), each constructor overwrites the shared _session_context. All subsequent telemetry events from earlier instances (during reset_async, step_async, _compute_reward) will carry the last-created instance's env_key/task_key/etc., producing silently incorrect monitoring data. The context needs to be re-established before telemetry-emitting methods, not just once at construction.

Additional Locations (1)

src/envs/fleet_env/telemetry.py#L16-L18

Fleet.make() is synchronous and can block for ~10 minutes per attempt when an env has health check failures (e.g., fostgres). With 3 retries and time.sleep(), this blocks the entire event loop for ~30 minutes, freezing ALL other async trajectories in the batch. Changes: - Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and asyncio.sleep() for retries — yields to event loop while waiting - Defer provisioning from FleetTaskEnv.__init__() to reset_async() via _ensure_provisioned(), so fleet.make() runs in async context - After async provisioning, get sync env handle via Fleet.instance() for close() and verify_detailed() compatibility (fast GET, ~100ms) - Update tests to match new deferred provisioning pattern No SkyRL changes needed — it already calls __init__() then reset_async(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add console-visible logger.warning() alongside logfire events for fleet_mcp_tool_error and fleet_tool_call_failed. Includes: - env_key:env_version (e.g., amazon:v0.0.12) - step N/max_steps (e.g., step 3/50) - tool_name and error message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- fleet_env_reset_failed: use fleet_warning (no traceback) instead of fleet_exception — reset 502s are expected, one-line warning is enough - fleet_env_close_failed: silence entirely — "Instance already terminated" is expected when TTL expires before cleanup - fleet_env_created: remove logfire console print — logger.info already prints the useful "Fleet instance ready in Xs: {id}" line Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-02-28T00:19:47Z

src/envs/fleet_env/task_env.py

+                    timeout_s=self.reset_timeout_s,
+                    error_type=type(e).__name__,
+                    error_message=str(e)[:200],
+                )


Reset failure loses exception traceback using wrong telemetry level

Medium Severity

The fleet_env_reset_failed event uses fleet_warning instead of fleet_exception, which means the exception traceback is not captured. The README documents this as an "exception"-level event, and all other similar failure events in the same file (fleet_tools_list_failed, fleet_screenshot_failed, fleet_tool_call_failed, fleet_verifier_failed) correctly use fleet_exception. Since this code is inside an except Exception as e: block, fleet_exception (which passes _exc_info=True) would capture the traceback needed for debugging reset failures.

Additional Locations (1)

src/envs/fleet_env/README.md#L154-L155

src/envs/fleet_env/client.py

…teps - Move fleet_rollout_started to fire before provisioning so init failures (e.g., fostgres health check) are counted in total_rollouts - Emit fleet_rollout_completed with failure_reason="init_error" on init failure, ensuring completed <= total_rollouts invariant - Add total_steps (SUM of step_count) and init_errors columns to SQL query - Clarify verifier_errors = code exceptions, not model failures - Add test for init failure telemetry path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…me_s Tracks per-instance provisioning latency in Logfire to diagnose Fleet queue serialization (96 concurrent make() calls get processed at ~1/10s, causing 10+ min queue delays for the last instance). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Feb 26, 2026

View reviewed changes

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

dzorlu mentioned this pull request Feb 26, 2026

feat: Wire LOGFIRE_TOKEN to fleet env telemetry fleet-ai/SkyRL#258

Merged

2 tasks

cursor bot reviewed Feb 27, 2026

View reviewed changes

src/envs/fleet_env/task_env.py Outdated Show resolved Hide resolved

Deniz and others added 2 commits February 26, 2026 21:52

cursor bot reviewed Feb 27, 2026

View reviewed changes

src/envs/fleet_env/client.py Outdated Show resolved Hide resolved

src/envs/fleet_env/task_env.py Show resolved Hide resolved

cursor bot reviewed Feb 27, 2026

View reviewed changes

Deniz and others added 3 commits February 27, 2026 13:48

cursor bot reviewed Feb 28, 2026

View reviewed changes

Deniz and others added 2 commits February 27, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Logfire error tracking to fleet env#6

feat: Add Logfire error tracking to fleet env#6
dzorlu wants to merge 11 commits intodeniz/fleet_clientfrom
deniz/fleet-logfire

dzorlu commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

cursor bot Feb 27, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Feb 26, 2026

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

Unprotected telemetry calls in error handlers risk disrupting recovery

Uh oh!

cursor bot Feb 27, 2026

Choose a reason for hiding this comment

Shared ContextVar causes wrong telemetry context across instances

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 28, 2026

Choose a reason for hiding this comment

Reset failure loses exception traceback using wrong telemetry level

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant