-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Bug
When a VM agent emits a nightshift.error event (e.g. bad API key, command failure), run_task_pooled() catches the exception internally, publishes ErrorEvent to the event buffer, and returns normally. The caller in _run_agent_task then calls registry.complete_run(run_id) without an error string, so the run record ends up as:
{"status": "completed", "error": null}…even though the event stream contains a nightshift.error with the real error message.
Repro
GET /api/runs/5698aa36-f5b2-4601-9af8-ee36abfc0de8
→ {"status": "completed", "error": null}
But run_events for that ID:
nightshift.error {"error": "Command failed with exit code 1 ..."}
Root cause
run_task_pooled() in task.py swallows the exception at line 192-193:
except Exception as e:
await log.publish(run_id, ErrorEvent(error=str(e)))It publishes the error event but does not re-raise — so _run_agent_task in server.py sees a clean return and calls complete_run(run_id) with no error.
Suggested fix
After run_task_pooled returns in _run_agent_task, check the event buffer for a nightshift.error terminal event and pass its error string to complete_run(). This keeps run_task_pooled non-throwing (which the CLI path depends on) while giving the DB record the correct status and error.