-
Notifications
You must be signed in to change notification settings - Fork 3
Description
RFC: RuntimeDriver Interface
Summary
Introduce a RuntimeDriver / RuntimeInstance protocol pair that abstracts sandbox lifecycle management away from Firecracker. The VM pool, task orchestrator, and server will depend only on these protocols, decoupling core runtime flow from any specific virtualization technology. This enables future backends (Docker, Kata, cloud VMs) and provides clean lifecycle boundaries that future observability plugins can hook into (design only in this RFC).
Scope
In scope
- Define
RuntimeConfig,RuntimeInstance, andRuntimeDriverinterfaces. - Refactor the existing Firecracker backend behind the new interfaces.
- Collapse agent execution to a single path that always goes through the VM pool (remove the legacy non-pooled path).
Out of scope
- Implementing a plugin system (design notes only).
- Adding additional runtime backends beyond Firecracker.
Motivation (verified in current codebase)
Today FirecrackerVM and VMConfig (defined in src/nightshift/vm/manager.py) are imported and used directly in:
src/nightshift/vm/pool.pysrc/nightshift/task.pysrc/nightshift/server.py
This results in:
- Backend portability is difficult (Firecracker concepts leak across layers).
- No clean lifecycle boundary for observability hooks.
- Driver-specific config and implementation details spread throughout the code.
There are currently two execution paths in src/nightshift/task.py:
run_task(...)(legacy path that constructsVMConfig/FirecrackerVMdirectly)run_task_pooled(...)(pool-based path usingpool.checkout(...))
And src/nightshift/server.py branches between them inside:
@app.post("/api/agents/{name}/runs")
This RFC collapses everything into a single path that always goes through the pool.
High-level design (internal server call flow, not network connections)
flowchart TB
CLI["CLI (nightshift run)"] --> POST["POST /api/agents/{name}/runs"] --> S["server.py"]
S --> RUN["_run_agent_task()"] --> TASK["run_task(pool, agent_id, ...)"]
TASK --> CO["async with pool.checkout(...) as instance"]
CO --> SUB["await instance.submit_run(...)"]
SUB --> WAIT["await instance.wait_for_completion(...)"]
WAIT --> RES["success -> checkin, error -> invalidate"]
S -. "pool created during lifespan()" .-> POOL["VMPool(driver, idle_timeout, max_vms)"]
POOL --> CREATE["driver.create_instance(id, RuntimeConfig)"]
proposed new module
add src/nightshift/vm/runtime.py
runtimeconfig (dataclass)
workspace_path,agent_pkg_path,env_vars,vcpu_count,mem_size_mib,health_timeout- excludes firecracker-only fields like
kernel_path,base_rootfs_path,event_port
runtimeinstance (protocol)
instance_idstart(),submit_run(...),wait_for_completion(...),copy_workspace_out(...),destroy()is_healthy(),is_healthy_async(),get_serial_log()- async context manager (
__aenter__starts,__aexit__destroys)
runtimedriver (protocol)
create_instance(instance_id, runtimeconfig) -> runtimeinstance(unstarted)cleanup_stale_resources()called once on startup (firecracker callscleanup_stale_taps()insrc/nightshift/vm/network.py)
changes (phase 1)
- firecracker driver: implement
firecrackerdriverthat mapsruntimeconfig -> vmconfigand returnsfirecrackervm - pool (
src/nightshift/vm/pool.py): acceptruntimedriver, stop importingfirecrackervm/vmconfig, create/start instances via driver - task (
src/nightshift/task.py): collapse to a single pooledrun_task(...)path usingpool.checkout(...) - server (
src/nightshift/server.py): in lifespan, construct driver once + callcleanup_stale_resources(), pass intovmpool; in runs endpoint, remove branching and always use pooled path