[Feature]: Transaction mechanism for atomic multi-subsystem operations

> **RFC Discussion**: https://github.com/volcengine/OpenViking/discussions/115

## Problem Statement

OpenViking's core write operations (`rm`, `mv`, `add_resource`, `session.commit`) coordinate across multiple subsystems — VikingFS, VectorDB, and QueueManager. Currently there is no atomicity guarantee: if a failure occurs mid-operation (e.g., VectorDB update fails after FS move succeeds), the system is left in an inconsistent state with no way to recover.

For example:
- `rm()` deletes files from FS, then fails to clean up VectorDB records → orphan index entries
- `mv()` moves files in FS, then VectorDB URI update fails → search returns stale URIs
- `session.commit()` archives messages, then LLM call times out → partial commit with no recovery
- Process crash during `add_resource` → enqueued semantic generation never happens

## Proposed Solution

Implement a **write-ahead journal + undo log** transaction mechanism that wraps multi-subsystem operations with automatic rollback on failure and crash recovery on restart.

### Architecture

```
TransactionContext (async context manager)
  ├── TransactionManager (lifecycle, journal, lock coordination)
  │     ├── TransactionJournal (AGFS-persisted WAL at /local/_system/transactions/)
  │     ├── PathLock (fencing-token-based distributed locks)
  │     └── Crash Recovery (journal replay on startup)
  ├── UndoLog (ordered list of reversible sub-operations)
  └── PostActions (deferred work after commit, e.g. enqueue_semantic)
```

### Key Design Decisions

1. **Undo log, not WAL redo**: We record how to reverse each completed sub-operation. On rollback, entries are replayed in reverse order. This is simpler than redo logging since our operations are heterogeneous (FS + VectorDB).

2. **Fencing tokens for locks**: Lock files contain `{tx_id}:{monotonic_ns}` instead of just transaction ID. This enables stale lock detection and prevents ABA problems during concurrent acquisition.

3. **Two-phase session commit**: `session.commit()` is split into two independent transactions with a checkpoint file between them:
   - **Phase 1 (Archive)**: Lock session → write archive → clear messages → checkpoint(status=archived)
   - **LLM call** (no transaction — can safely retry)
   - **Phase 2 (Memory)**: Lock session → write memories → checkpoint(status=completed) → post_action(enqueue_semantic)

4. **Post-actions**: Deferred work (like `enqueue_semantic`) is registered during the transaction but executed after commit. If the process crashes after commit but before post-actions complete, the journal retains them for replay on restart.

5. **Best-effort rollback**: Each undo step is independently try-caught. A failing rollback entry doesn't prevent subsequent entries from running. `fs_rm` is marked as non-reversible (skip on rollback).

### Operation-specific strategies

| Operation | Lock Mode | Order | Rollback |
|-----------|-----------|-------|----------|
| `rm()` | rm (bottom-up tree) or normal (parent for files) | VectorDB delete → FS delete | Restore VectorDB from snapshot |
| `mv()` | mv (src tree + dst parent) | FS move → VectorDB URI update | Reverse FS move |
| `add_resource` (finalize) | normal (parent dir) | FS move + post_action(enqueue_semantic) | Delete final dir |
| `session.commit` | normal (session dir) × 2 | Two-phase with checkpoint | Phase-specific cleanup |

### Crash Recovery

On `TransactionManager.start()`, scan `/local/_system/transactions/` for residual journals:

| Journal Status | Recovery Action |
|----------------|-----------------|
| `COMMITTED` + post_actions | Replay post_actions → cleanup |
| `COMMITTED` / `RELEASED` | Cleanup locks + journal |
| `EXEC` (age > 300s) | Execute rollback → cleanup |
| `INIT` / `ACQUIRE` | Cleanup locks + journal |

## Alternatives Considered

- **Database-level transactions (SQLite)**: Only covers VectorDB, not AGFS filesystem operations.
- **Saga pattern with compensating transactions**: More complex orchestration; undo log achieves the same goal with simpler code.
- **Event sourcing**: Overkill for the current write patterns; would require rewriting the storage layer.

## Use Case

Any production deployment where:
- Process crashes or restarts can leave data inconsistent
- Concurrent operations on the same directory tree need coordination
- `session.commit()` with LLM calls can timeout, leaving partial state
- `add_resource` must guarantee that semantic generation is eventually enqueued

## Example API (Optional)

```python
from openviking.storage.transaction import TransactionContext, get_transaction_manager

tx_manager = get_transaction_manager()

async with TransactionContext(tx_manager, "my_operation", [path], lock_mode="normal") as tx:
    # Record undo entry before each sub-operation
    seq = tx.record_undo("fs_write_new", {"uri": path})
    await do_something(path)
    tx.mark_completed(seq)

    # Register deferred post-commit work
    tx.add_post_action("enqueue_semantic", {"uri": uri, ...})

    # Commit (or auto-rollback on exception / missing commit)
    await tx.commit()
```

## Additional Context

- Design document: `test_scripts/transaction-design.md`
- Existing skeleton code in `openviking/storage/transaction/` has been extended
- The `Session.commit()` method changes from sync to async (`def commit` → `async def commit`)
- 52 new tests cover unit (undo, journal, path_lock, context_manager) and integration (rm/mv rollback, crash recovery, post_actions, concurrent locks)

## Contribution

- [x] I am willing to contribute to implementing this feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Transaction mechanism for atomic multi-subsystem operations #390

Problem Statement

Proposed Solution

Architecture

Key Design Decisions

Operation-specific strategies

Crash Recovery

Alternatives Considered

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Lock Mode	Order	Rollback
`rm()`	rm (bottom-up tree) or normal (parent for files)	VectorDB delete → FS delete	Restore VectorDB from snapshot
`mv()`	mv (src tree + dst parent)	FS move → VectorDB URI update	Reverse FS move
`add_resource` (finalize)	normal (parent dir)	FS move + post_action(enqueue_semantic)	Delete final dir
`session.commit`	normal (session dir) × 2	Two-phase with checkpoint	Phase-specific cleanup

Journal Status	Recovery Action
`COMMITTED` + post_actions	Replay post_actions → cleanup
`COMMITTED` / `RELEASED`	Cleanup locks + journal
`EXEC` (age > 300s)	Execute rollback → cleanup
`INIT` / `ACQUIRE`	Cleanup locks + journal

[Feature]: Transaction mechanism for atomic multi-subsystem operations #390

Description

Problem Statement

Proposed Solution

Architecture

Key Design Decisions

Operation-specific strategies

Crash Recovery

Alternatives Considered

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions