Skip to content

Runs with nightshift.error events are marked completed with error=null #108

@tensor-ninja

Description

@tensor-ninja

Bug

When a VM agent emits a nightshift.error event (e.g. bad API key, command failure), run_task_pooled() catches the exception internally, publishes ErrorEvent to the event buffer, and returns normally. The caller in _run_agent_task then calls registry.complete_run(run_id) without an error string, so the run record ends up as:

{"status": "completed", "error": null}

…even though the event stream contains a nightshift.error with the real error message.

Repro

GET /api/runs/5698aa36-f5b2-4601-9af8-ee36abfc0de8
→ {"status": "completed", "error": null}

But run_events for that ID:

nightshift.error  {"error": "Command failed with exit code 1 ..."}

Root cause

run_task_pooled() in task.py swallows the exception at line 192-193:

except Exception as e:
    await log.publish(run_id, ErrorEvent(error=str(e)))

It publishes the error event but does not re-raise — so _run_agent_task in server.py sees a clean return and calls complete_run(run_id) with no error.

Suggested fix

After run_task_pooled returns in _run_agent_task, check the event buffer for a nightshift.error terminal event and pass its error string to complete_run(). This keeps run_task_pooled non-throwing (which the CLI path depends on) while giving the DB record the correct status and error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions