-
Notifications
You must be signed in to change notification settings - Fork 354
Description
RFC Discussion: #115
Problem Statement
OpenViking's core write operations (rm, mv, add_resource, session.commit) coordinate across multiple subsystems — VikingFS, VectorDB, and QueueManager. Currently there is no atomicity guarantee: if a failure occurs mid-operation (e.g., VectorDB update fails after FS move succeeds), the system is left in an inconsistent state with no way to recover.
For example:
rm()deletes files from FS, then fails to clean up VectorDB records → orphan index entriesmv()moves files in FS, then VectorDB URI update fails → search returns stale URIssession.commit()archives messages, then LLM call times out → partial commit with no recovery- Process crash during
add_resource→ enqueued semantic generation never happens
Proposed Solution
Implement a write-ahead journal + undo log transaction mechanism that wraps multi-subsystem operations with automatic rollback on failure and crash recovery on restart.
Architecture
TransactionContext (async context manager)
├── TransactionManager (lifecycle, journal, lock coordination)
│ ├── TransactionJournal (AGFS-persisted WAL at /local/_system/transactions/)
│ ├── PathLock (fencing-token-based distributed locks)
│ └── Crash Recovery (journal replay on startup)
├── UndoLog (ordered list of reversible sub-operations)
└── PostActions (deferred work after commit, e.g. enqueue_semantic)
Key Design Decisions
-
Undo log, not WAL redo: We record how to reverse each completed sub-operation. On rollback, entries are replayed in reverse order. This is simpler than redo logging since our operations are heterogeneous (FS + VectorDB).
-
Fencing tokens for locks: Lock files contain
{tx_id}:{monotonic_ns}instead of just transaction ID. This enables stale lock detection and prevents ABA problems during concurrent acquisition. -
Two-phase session commit:
session.commit()is split into two independent transactions with a checkpoint file between them:- Phase 1 (Archive): Lock session → write archive → clear messages → checkpoint(status=archived)
- LLM call (no transaction — can safely retry)
- Phase 2 (Memory): Lock session → write memories → checkpoint(status=completed) → post_action(enqueue_semantic)
-
Post-actions: Deferred work (like
enqueue_semantic) is registered during the transaction but executed after commit. If the process crashes after commit but before post-actions complete, the journal retains them for replay on restart. -
Best-effort rollback: Each undo step is independently try-caught. A failing rollback entry doesn't prevent subsequent entries from running.
fs_rmis marked as non-reversible (skip on rollback).
Operation-specific strategies
| Operation | Lock Mode | Order | Rollback |
|---|---|---|---|
rm() |
rm (bottom-up tree) or normal (parent for files) | VectorDB delete → FS delete | Restore VectorDB from snapshot |
mv() |
mv (src tree + dst parent) | FS move → VectorDB URI update | Reverse FS move |
add_resource (finalize) |
normal (parent dir) | FS move + post_action(enqueue_semantic) | Delete final dir |
session.commit |
normal (session dir) × 2 | Two-phase with checkpoint | Phase-specific cleanup |
Crash Recovery
On TransactionManager.start(), scan /local/_system/transactions/ for residual journals:
| Journal Status | Recovery Action |
|---|---|
COMMITTED + post_actions |
Replay post_actions → cleanup |
COMMITTED / RELEASED |
Cleanup locks + journal |
EXEC (age > 300s) |
Execute rollback → cleanup |
INIT / ACQUIRE |
Cleanup locks + journal |
Alternatives Considered
- Database-level transactions (SQLite): Only covers VectorDB, not AGFS filesystem operations.
- Saga pattern with compensating transactions: More complex orchestration; undo log achieves the same goal with simpler code.
- Event sourcing: Overkill for the current write patterns; would require rewriting the storage layer.
Use Case
Any production deployment where:
- Process crashes or restarts can leave data inconsistent
- Concurrent operations on the same directory tree need coordination
session.commit()with LLM calls can timeout, leaving partial stateadd_resourcemust guarantee that semantic generation is eventually enqueued
Example API (Optional)
from openviking.storage.transaction import TransactionContext, get_transaction_manager
tx_manager = get_transaction_manager()
async with TransactionContext(tx_manager, "my_operation", [path], lock_mode="normal") as tx:
# Record undo entry before each sub-operation
seq = tx.record_undo("fs_write_new", {"uri": path})
await do_something(path)
tx.mark_completed(seq)
# Register deferred post-commit work
tx.add_post_action("enqueue_semantic", {"uri": uri, ...})
# Commit (or auto-rollback on exception / missing commit)
await tx.commit()Additional Context
- Design document:
test_scripts/transaction-design.md - Existing skeleton code in
openviking/storage/transaction/has been extended - The
Session.commit()method changes from sync to async (def commit→async def commit) - 52 new tests cover unit (undo, journal, path_lock, context_manager) and integration (rm/mv rollback, crash recovery, post_actions, concurrent locks)
Contribution
- I am willing to contribute to implementing this feature
Metadata
Metadata
Assignees
Labels
Type
Projects
Status