Skip to content

Crash-Resilient Structural Undo/Redo for Open Transactions #412

@evomimic

Description

@evomimic

1. Summary (Required)

What is the enhancement?
Implement crash-resilient, structural experience-level undo/redo for pre-commit transaction editing, backed by local SQLite persistence.

This includes:

  • transaction-scoped undo/redo stacks
  • structural checkpoints of staged/transient graph state
  • local crash recovery of latest consistent open-transaction snapshot
  • cleanup of persisted recovery state on commit/rollback

Checkpoint persistence will be triggered by command metadata policy (CommandDescriptor.snapshot_after), with interim trigger mocking allowed until Phase 3 command ingress cutover is complete.


2. Problem Statement (Required)

Why is this needed?
Current editing sessions risk losing provisional transaction state after crash/restart. Undo/redo is not yet guaranteed as deterministic, structural, and crash-resilient across process restarts.

We need a predictable human-first editing model where:

  • successful commands create deterministic structural history
  • undo/redo restores structure (not semantic replay)
  • open transaction state can be recovered after crash
  • all history is strictly transaction-scoped and removed on commit/rollback

3. Dependencies (Required)

Does this depend on other issues or features?

  • Can proceed in parallel with MAP Commands Phases 2.1, 2.2, and 3.
  • Final production trigger integration depends on Phase 3 single command ingress cutover (dispatch_map_command -> Runtime::dispatch).
  • Interim implementation may mock or adapter-trigger snapshot policy until Runtime descriptor pipeline is final.

Related references:


4. Proposed Solution (Required)

How would you solve it?

4.1 Structural Snapshot Model (No Command Replay)

Persist and restore transaction graph state directly (staged/transient + history), not command input replay.

Rationale:

  • avoids TemporaryId replay instability
  • preserves provider openness (no replay determinism contract)
  • aligns with existing structural undo model

4.2 Snapshot Payloads (Versioned)

Use serialized, versioned payload envelopes stored as SQLite BLOBs:

// Example conceptual shape
struct TxGraphSnapshotV1 {
    tx_id: TxId,
    staged_holons: SerializableHolonPool,
    transient_holons: SerializableHolonPool,
    local_holon_space: Option<HolonReferenceWire>,
    // optional transaction metadata
}

struct UndoRedoCheckpointV1 {
    checkpoint_id: String,
    tx_id: TxId,
    snapshot: TxGraphSnapshotV1,
    created_at_ms: i64,
    command_name: Option<String>,
    disable_undo: bool,
}

struct RecoveryEnvelopeV1 {
    tx_id: TxId,
    undo_stack: Vec<String>, // checkpoint_ids
    redo_stack: Vec<String>, // checkpoint_ids
    markers: Vec<UndoMarkerV1>,
    latest_checkpoint_id: Option<String>,
    updated_at_ms: i64,
}

4.3 Blob Generation Path (Normative)

Generate blobs from existing wire serializers, not custom SQL projection logic:

  1. Export runtime state from TransactionContext:
    • export_staged_holons()
    • export_transient_holons()
  2. Convert to serializable wire pools:
    • SerializableHolonPool::from(&HolonPool)
    • via HolonWire / TransientHolonWire / StagedHolonWire
  3. Build versioned snapshot/envelope structs
  4. Serialize to bytes (serde_json initially; optionally MessagePack later)
  5. Persist in SQLite within one atomic DB transaction

Restore is the inverse:

  1. Read BLOBs
  2. Deserialize envelopes
  3. Rebind pools via SerializableHolonPool.bind(&context)
  4. Import with import_staged_holons(...) / import_transient_holons(...)

4.4 SQLite Schema (Initial)

CREATE TABLE recovery_session (
  tx_id                 INTEGER PRIMARY KEY,
  space_id              TEXT,
  lifecycle_state       TEXT NOT NULL,           -- expected Open while recoverable
  latest_checkpoint_id  TEXT,
  envelope_blob         BLOB NOT NULL,           -- RecoveryEnvelopeV1
  format_version        INTEGER NOT NULL DEFAULT 1,
  updated_at_ms         INTEGER NOT NULL
);

CREATE TABLE recovery_checkpoint (
  checkpoint_id         TEXT PRIMARY KEY,
  tx_id                 INTEGER NOT NULL,
  stack_kind            TEXT NOT NULL,           -- 'undo' | 'redo'
  stack_pos             INTEGER NOT NULL,        -- 0..N per stack
  snapshot_blob         BLOB NOT NULL,           -- UndoRedoCheckpointV1 / TxGraphSnapshotV1
  snapshot_hash         TEXT,                    -- optional integrity/debug
  created_at_ms         INTEGER NOT NULL,
  FOREIGN KEY (tx_id) REFERENCES recovery_session(tx_id) ON DELETE CASCADE
);

CREATE UNIQUE INDEX idx_checkpoint_stack_pos
  ON recovery_checkpoint(tx_id, stack_kind, stack_pos);

CREATE INDEX idx_checkpoint_tx_created
  ON recovery_checkpoint(tx_id, created_at_ms);

4.5 Write/Restore Semantics

  • One SQLite transaction per successful command post-processing.
  • If snapshot_after == true: persist new checkpoint + updated envelope atomically.
  • If disable_undo == true: skip undo checkpoint creation, but persist latest recoverable transaction state when configured.
  • Clear redo stack on any successful new undoable command.
  • Persist only fully consistent state; never partial command state.
  • On startup, restore only most recent consistent snapshot.

4.6 Lifecycle Cleanup

On commit or rollback:

  • destroy in-memory undo/redo history
  • delete persisted recovery rows (recovery_session + recovery_checkpoint) for tx
  • no history survives transaction boundary

5. Scope and Impact (Required)

What does this impact?
Impacts:

  • IntegrationHub transaction editing lifecycle
  • transaction state persistence layer (SQLite)
  • undo/redo behavior for experience-level editing
  • command execution post-success checkpoint pipeline

Does not impact:

  • post-commit/domain-level compensating undo semantics
  • trust-channel compensations/inter-agent reversal
  • cross-transaction undo history
  • DHT-visible persistence behavior

6. Testing Considerations (Required)

How will this enhancement be tested?

  • Can it be validated with existing test cases?
    • Partially (existing transaction/lifecycle tests remain relevant).
  • Do new test cases need to be created?
    • Yes:
      • undo/redo stack semantics (LIFO, redo clearing, empty-stack failures)
      • disable_undo behavior
      • snapshot_after-driven persistence behavior
      • blob serialization/deserialization roundtrip tests
      • crash recovery restore from SQLite snapshot
      • consistency guarantees (no partial state persisted)
      • cleanup on commit/rollback
  • Are there specific areas in the test ecosystem impacted by this enhancement?
    • host runtime transaction tests
    • integration tests simulating crash/restart
    • regression tests across loader/bulk operations with undo disabled

7. Definition of Done (Required)

When is this enhancement complete?

  • Structural undo/redo stacks implemented per open transaction
  • Checkpoint creation occurs only after successful command completion
  • Redo stack clears on successful new undoable command
  • disable_undo metadata behavior implemented
  • snapshot_after policy hook implemented (mocked trigger acceptable until Phase 3 cutover)
  • SQLite schema implemented (recovery_session, recovery_checkpoint, indexes)
  • Snapshot blobs generated from wire serializer path (export -> wire -> serialize)
  • Crash/restart restores consistent transaction state + stacks
  • Commit/rollback destroys in-memory history and deletes persisted recovery snapshot
  • Tests cover stack semantics, blob roundtrip, recovery, lifecycle cleanup, and policy triggers

Optional Details (Expand if needed)

8. Alternatives Considered

What other solutions did you think about?

  • Delta/replay-based undo: rejected for v0 due to TemporaryId fragility and provider determinism burden.
  • Command-specific semantic undo handlers: rejected for v0 to preserve openness and lower integration burden.
  • Cross-transaction persistent history: rejected as out-of-scope and semantically risky.

9. Risks or Concerns

What could go wrong?

  • Snapshot size/performance during high-frequency editing
  • Drift between mocked trigger and final Runtime descriptor trigger
  • Inconsistent recovery if persistence writes are not atomic

Mitigations:

  • enforce DB atomicity and “latest consistent only” restore rule
  • keep trigger integration seam explicit for Phase 3 handoff
  • allow disable_undo for bulk/regenerable operations
  • optimize encoding/compaction only after measurement

10. Additional Context

Any supporting material?
Based on: MAP Core — Structural Pre-Commit Transaction Editing Model / Structural Experience-Level Undo / Redo Specification.

Parallelization intent:

  • implement undo/recovery persistence engine now
  • integrate final command-trigger wiring when Phase 3 single-ingress Runtime path is active

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions