Skip to content

[Feature]: Transaction mechanism for atomic multi-subsystem operations #390

@qin-ctx

Description

@qin-ctx

RFC Discussion: #115

Problem Statement

OpenViking's core write operations (rm, mv, add_resource, session.commit) coordinate across multiple subsystems — VikingFS, VectorDB, and QueueManager. Currently there is no atomicity guarantee: if a failure occurs mid-operation (e.g., VectorDB update fails after FS move succeeds), the system is left in an inconsistent state with no way to recover.

For example:

  • rm() deletes files from FS, then fails to clean up VectorDB records → orphan index entries
  • mv() moves files in FS, then VectorDB URI update fails → search returns stale URIs
  • session.commit() archives messages, then LLM call times out → partial commit with no recovery
  • Process crash during add_resource → enqueued semantic generation never happens

Proposed Solution

Implement a write-ahead journal + undo log transaction mechanism that wraps multi-subsystem operations with automatic rollback on failure and crash recovery on restart.

Architecture

TransactionContext (async context manager)
  ├── TransactionManager (lifecycle, journal, lock coordination)
  │     ├── TransactionJournal (AGFS-persisted WAL at /local/_system/transactions/)
  │     ├── PathLock (fencing-token-based distributed locks)
  │     └── Crash Recovery (journal replay on startup)
  ├── UndoLog (ordered list of reversible sub-operations)
  └── PostActions (deferred work after commit, e.g. enqueue_semantic)

Key Design Decisions

  1. Undo log, not WAL redo: We record how to reverse each completed sub-operation. On rollback, entries are replayed in reverse order. This is simpler than redo logging since our operations are heterogeneous (FS + VectorDB).

  2. Fencing tokens for locks: Lock files contain {tx_id}:{monotonic_ns} instead of just transaction ID. This enables stale lock detection and prevents ABA problems during concurrent acquisition.

  3. Two-phase session commit: session.commit() is split into two independent transactions with a checkpoint file between them:

    • Phase 1 (Archive): Lock session → write archive → clear messages → checkpoint(status=archived)
    • LLM call (no transaction — can safely retry)
    • Phase 2 (Memory): Lock session → write memories → checkpoint(status=completed) → post_action(enqueue_semantic)
  4. Post-actions: Deferred work (like enqueue_semantic) is registered during the transaction but executed after commit. If the process crashes after commit but before post-actions complete, the journal retains them for replay on restart.

  5. Best-effort rollback: Each undo step is independently try-caught. A failing rollback entry doesn't prevent subsequent entries from running. fs_rm is marked as non-reversible (skip on rollback).

Operation-specific strategies

Operation Lock Mode Order Rollback
rm() rm (bottom-up tree) or normal (parent for files) VectorDB delete → FS delete Restore VectorDB from snapshot
mv() mv (src tree + dst parent) FS move → VectorDB URI update Reverse FS move
add_resource (finalize) normal (parent dir) FS move + post_action(enqueue_semantic) Delete final dir
session.commit normal (session dir) × 2 Two-phase with checkpoint Phase-specific cleanup

Crash Recovery

On TransactionManager.start(), scan /local/_system/transactions/ for residual journals:

Journal Status Recovery Action
COMMITTED + post_actions Replay post_actions → cleanup
COMMITTED / RELEASED Cleanup locks + journal
EXEC (age > 300s) Execute rollback → cleanup
INIT / ACQUIRE Cleanup locks + journal

Alternatives Considered

  • Database-level transactions (SQLite): Only covers VectorDB, not AGFS filesystem operations.
  • Saga pattern with compensating transactions: More complex orchestration; undo log achieves the same goal with simpler code.
  • Event sourcing: Overkill for the current write patterns; would require rewriting the storage layer.

Use Case

Any production deployment where:

  • Process crashes or restarts can leave data inconsistent
  • Concurrent operations on the same directory tree need coordination
  • session.commit() with LLM calls can timeout, leaving partial state
  • add_resource must guarantee that semantic generation is eventually enqueued

Example API (Optional)

from openviking.storage.transaction import TransactionContext, get_transaction_manager

tx_manager = get_transaction_manager()

async with TransactionContext(tx_manager, "my_operation", [path], lock_mode="normal") as tx:
    # Record undo entry before each sub-operation
    seq = tx.record_undo("fs_write_new", {"uri": path})
    await do_something(path)
    tx.mark_completed(seq)

    # Register deferred post-commit work
    tx.add_post_action("enqueue_semantic", {"uri": uri, ...})

    # Commit (or auto-rollback on exception / missing commit)
    await tx.commit()

Additional Context

  • Design document: test_scripts/transaction-design.md
  • Existing skeleton code in openviking/storage/transaction/ has been extended
  • The Session.commit() method changes from sync to async (def commitasync def commit)
  • 52 new tests cover unit (undo, journal, path_lock, context_manager) and integration (rm/mv rollback, crash recovery, post_actions, concurrent locks)

Contribution

  • I am willing to contribute to implementing this feature

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions