Skip to content

RFC: Runtime Driver interface and lifecycle manager #111

@dhunganapramod9

Description

@dhunganapramod9

RFC: RuntimeDriver Interface

Summary

Introduce a RuntimeDriver / RuntimeInstance protocol pair that abstracts sandbox lifecycle management away from Firecracker. The VM pool, task orchestrator, and server will depend only on these protocols, decoupling core runtime flow from any specific virtualization technology. This enables future backends (Docker, Kata, cloud VMs) and provides clean lifecycle boundaries that future observability plugins can hook into (design only in this RFC).

Scope

In scope

  • Define RuntimeConfig, RuntimeInstance, and RuntimeDriver interfaces.
  • Refactor the existing Firecracker backend behind the new interfaces.
  • Collapse agent execution to a single path that always goes through the VM pool (remove the legacy non-pooled path).

Out of scope

  • Implementing a plugin system (design notes only).
  • Adding additional runtime backends beyond Firecracker.

Motivation (verified in current codebase)

Today FirecrackerVM and VMConfig (defined in src/nightshift/vm/manager.py) are imported and used directly in:

  • src/nightshift/vm/pool.py
  • src/nightshift/task.py
  • src/nightshift/server.py

This results in:

  • Backend portability is difficult (Firecracker concepts leak across layers).
  • No clean lifecycle boundary for observability hooks.
  • Driver-specific config and implementation details spread throughout the code.

There are currently two execution paths in src/nightshift/task.py:

  • run_task(...) (legacy path that constructs VMConfig/FirecrackerVM directly)
  • run_task_pooled(...) (pool-based path using pool.checkout(...))

And src/nightshift/server.py branches between them inside:

  • @app.post("/api/agents/{name}/runs")

This RFC collapses everything into a single path that always goes through the pool.

High-level design (internal server call flow, not network connections)

flowchart TB
  CLI["CLI (nightshift run)"] --> POST["POST /api/agents/{name}/runs"] --> S["server.py"]
  S --> RUN["_run_agent_task()"] --> TASK["run_task(pool, agent_id, ...)"]
  TASK --> CO["async with pool.checkout(...) as instance"]
  CO --> SUB["await instance.submit_run(...)"]
  SUB --> WAIT["await instance.wait_for_completion(...)"]
  WAIT --> RES["success -> checkin, error -> invalidate"]

  S -. "pool created during lifespan()" .-> POOL["VMPool(driver, idle_timeout, max_vms)"]
  POOL --> CREATE["driver.create_instance(id, RuntimeConfig)"]
Loading

proposed new module

add src/nightshift/vm/runtime.py

runtimeconfig (dataclass)

  • workspace_path, agent_pkg_path, env_vars, vcpu_count, mem_size_mib, health_timeout
  • excludes firecracker-only fields like kernel_path, base_rootfs_path, event_port

runtimeinstance (protocol)

  • instance_id
  • start(), submit_run(...), wait_for_completion(...), copy_workspace_out(...), destroy()
  • is_healthy(), is_healthy_async(), get_serial_log()
  • async context manager (__aenter__ starts, __aexit__ destroys)

runtimedriver (protocol)

  • create_instance(instance_id, runtimeconfig) -> runtimeinstance (unstarted)
  • cleanup_stale_resources() called once on startup (firecracker calls cleanup_stale_taps() in src/nightshift/vm/network.py)

changes (phase 1)

  • firecracker driver: implement firecrackerdriver that maps runtimeconfig -> vmconfig and returns firecrackervm
  • pool (src/nightshift/vm/pool.py): accept runtimedriver, stop importing firecrackervm/vmconfig, create/start instances via driver
  • task (src/nightshift/task.py): collapse to a single pooled run_task(...) path using pool.checkout(...)
  • server (src/nightshift/server.py): in lifespan, construct driver once + call cleanup_stale_resources(), pass into vmpool; in runs endpoint, remove branching and always use pooled path

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions