RFC: Runtime Driver interface and lifecycle manager

# RFC: RuntimeDriver Interface

## Summary
Introduce a `RuntimeDriver` / `RuntimeInstance` protocol pair that abstracts sandbox lifecycle management away from Firecracker. The VM pool, task orchestrator, and server will depend only on these protocols, decoupling core runtime flow from any specific virtualization technology. This enables future backends (Docker, Kata, cloud VMs) and provides clean lifecycle boundaries that future observability plugins can hook into (design only in this RFC).

## Scope
**In scope**
- Define `RuntimeConfig`, `RuntimeInstance`, and `RuntimeDriver` interfaces.
- Refactor the existing Firecracker backend behind the new interfaces.
- Collapse agent execution to a single path that always goes through the VM pool (remove the legacy non-pooled path).

**Out of scope**
- Implementing a plugin system (design notes only).
- Adding additional runtime backends beyond Firecracker.

## Motivation (verified in current codebase)
Today `FirecrackerVM` and `VMConfig` (defined in `src/nightshift/vm/manager.py`) are imported and used directly in:
- `src/nightshift/vm/pool.py`
- `src/nightshift/task.py`
- `src/nightshift/server.py`

This results in:
- Backend portability is difficult (Firecracker concepts leak across layers).
- No clean lifecycle boundary for observability hooks.
- Driver-specific config and implementation details spread throughout the code.

There are currently two execution paths in `src/nightshift/task.py`:
- `run_task(...)` (legacy path that constructs `VMConfig`/`FirecrackerVM` directly)
- `run_task_pooled(...)` (pool-based path using `pool.checkout(...)`)

And `src/nightshift/server.py` branches between them inside:
- `@app.post("/api/agents/{name}/runs")`

This RFC collapses everything into a single path that always goes through the pool.

## High-level design (internal server call flow, not network connections)

```mermaid
flowchart TB
  CLI["CLI (nightshift run)"] --> POST["POST /api/agents/{name}/runs"] --> S["server.py"]
  S --> RUN["_run_agent_task()"] --> TASK["run_task(pool, agent_id, ...)"]
  TASK --> CO["async with pool.checkout(...) as instance"]
  CO --> SUB["await instance.submit_run(...)"]
  SUB --> WAIT["await instance.wait_for_completion(...)"]
  WAIT --> RES["success -> checkin, error -> invalidate"]

  S -. "pool created during lifespan()" .-> POOL["VMPool(driver, idle_timeout, max_vms)"]
  POOL --> CREATE["driver.create_instance(id, RuntimeConfig)"]
```



### proposed new module

add `src/nightshift/vm/runtime.py`

**runtimeconfig (dataclass)**
- `workspace_path`, `agent_pkg_path`, `env_vars`, `vcpu_count`, `mem_size_mib`, `health_timeout`
- excludes firecracker-only fields like `kernel_path`, `base_rootfs_path`, `event_port`

**runtimeinstance (protocol)**
- `instance_id`
- `start()`, `submit_run(...)`, `wait_for_completion(...)`, `copy_workspace_out(...)`, `destroy()`
- `is_healthy()`, `is_healthy_async()`, `get_serial_log()`
- async context manager (`__aenter__` starts, `__aexit__` destroys)

**runtimedriver (protocol)**
- `create_instance(instance_id, runtimeconfig) -> runtimeinstance` (unstarted)
- `cleanup_stale_resources()` called once on startup (firecracker calls `cleanup_stale_taps()` in `src/nightshift/vm/network.py`)

### changes (phase 1)
- **firecracker driver**: implement `firecrackerdriver` that maps `runtimeconfig -> vmconfig` and returns `firecrackervm`
- **pool** (`src/nightshift/vm/pool.py`): accept `runtimedriver`, stop importing `firecrackervm/vmconfig`, create/start instances via driver
- **task** (`src/nightshift/task.py`): collapse to a single pooled `run_task(...)` path using `pool.checkout(...)`
- **server** (`src/nightshift/server.py`): in lifespan, construct driver once + call `cleanup_stale_resources()`, pass into `vmpool`; in runs endpoint, remove branching and always use pooled path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Runtime Driver interface and lifecycle manager #111

RFC: RuntimeDriver Interface

Summary

Scope

Motivation (verified in current codebase)

High-level design (internal server call flow, not network connections)

proposed new module

changes (phase 1)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Runtime Driver interface and lifecycle manager #111

Description

RFC: RuntimeDriver Interface

Summary

Scope

Motivation (verified in current codebase)

High-level design (internal server call flow, not network connections)

proposed new module

changes (phase 1)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions