feat: add execution graph builder plan with reference implementation #269

johnnygreco · 2026-01-29T21:16:50Z

Experimenting with this plan-first approach!

Main goal is to build out a graph of all the cells that need to be generated along with their dependencies. The idea would be to have an execution engine fetch work from this graph when it is ready.

plans/async-engine/execution_graph_builder.plan.md
plans/async-engine/explore_execution_graph.ipynb - Interactive notebook exploring the implementation

Part of #260
Closes #267

NOTE: The reference implementation is just meant to help build some intuition for what a concrete implementation might look like.

📋 Summary

This PR introduces the Execution Graph Builder plan document that designs a memory-efficient graph-based execution framework for async dataset generation. A reference implementation is included to build intuition and validate the design decisions—the plan document should be the primary focus for review.

🎯 Focus: The Plan Document

The core of this PR is plans/async-engine/execution_graph_builder.plan.md which details:

Hybrid graph representation: O(C) memory for millions of records (cell nodes computed on-demand)
Generator execution traits: START, CELL_BY_CELL, ROW_STREAMABLE, BARRIER inferred from generator properties
Cell-level dependency resolution: Enables fine-grained async execution
Thread-safe completion tracking: Separate trackers for asyncio vs thread pool execution

🔄 Changes

✨ Added

Plan & Exploration:

plans/async-engine/execution_graph_builder.plan.md - Detailed design document for the execution graph framework
plans/async-engine/explore_execution_graph.ipynb - Interactive notebook exploring the implementation

Reference Implementation (in packages/data-designer-engine/src/data_designer/engine/execution_graph/):

node_id.py - CellNodeId, BarrierNodeId, and NodeId type alias
traits.py - ExecutionTraits Flag enum
column_descriptor.py - Column metadata with traits and dependencies
graph.py - ExecutionGraph with hybrid representation
builder.py - GraphBuilder factory with trait inference
completion.py - CompletionTracker and ThreadSafeCompletionTracker

Generator Properties:

Added is_row_streamable property to ColumnGenerator base class and overrides in subclasses

Tests (in packages/data-designer-engine/tests/engine/execution_graph/):

Comprehensive test suite covering graph building, node iteration, dependency resolution, and completion tracking

🔧 Changed

Minor updates to ModelConfig and schema transform processor

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

execution_graph_builder.plan.md - Primary review target: Does the design adequately address async execution requirements?
graph.py - Core hybrid representation logic
builder.py#L175-L210 - Trait inference logic (plugin-compatible approach)

🔄 Update: Checkpoint/Restart Support

Added row-complete batch checkpointing to enable resuming interrupted generation runs. This feature allows saving progress at row boundaries and efficiently restoring state without storing individual cell indices.

Key Design Decisions

Decision	Rationale
Row-complete batches	Checkpoint when all cells for a batch of rows are complete
Compact format	Store `{"completed_rows": N}` - O(1) regardless of dataset size
Contiguous completion	Only count contiguous rows from row 0 (simpler, matches batch-oriented generation)
Usable partial datasets	Each checkpoint represents a valid dataset subset

New APIs

ExecutionGraph (graph.py):

is_row_complete(row, completed) - Check if all cells for a row are complete
get_completed_row_count(completed) - Count contiguous complete rows from row 0

CompletionTracker / ThreadSafeCompletionTracker (completion.py):

to_checkpoint(graph) - Create compact checkpoint: {"version": 1, "completed_rows": N}
from_checkpoint(checkpoint, graph) - Restore tracker state from checkpoint

Usage Example

# Checkpoint during execution
BATCH_SIZE = 1000
completed = graph.get_completed_row_count(tracker)
if completed >= last_checkpoint + BATCH_SIZE:
    storage.update_metadata({"checkpoint": tracker.to_checkpoint(graph)})
    last_checkpoint = completed

# Resume from checkpoint
if "checkpoint" in metadata:
    tracker = CompletionTracker.from_checkpoint(metadata["checkpoint"], graph)
else:
    tracker = CompletionTracker(graph.num_records)

Files Changed

File	Change
`graph.py`	Added `is_row_complete()`, `get_completed_row_count()`
`completion.py`	Added `to_checkpoint()`, `from_checkpoint()` to both tracker classes
`__init__.py`	Exported `CHECKPOINT_VERSION`
`test_completion.py`	Added 17 new checkpoint tests
`test_graph.py`	Added 12 new row completion tests
`execution_graph_builder.plan.md`	Added checkpoint/restart documentation section

🤖 Generated with AI

This introduces a design plan for a memory-efficient execution graph that models cell-level dependencies for async dataset generation. The reference implementation is included to help build intuition about how the concepts could work in practice. The plan is the primary artifact - the code is exploratory.

Add row-complete batch checkpointing for resuming interrupted generation: - Add is_row_complete() and get_completed_row_count() to ExecutionGraph - Add to_checkpoint() and from_checkpoint() to CompletionTracker - Add thread-safe checkpoint methods to ThreadSafeCompletionTracker - Export CHECKPOINT_VERSION for compatibility checking - Add comprehensive tests for checkpoint functionality - Update plan documentation with checkpoint/restart section

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Add "Open Decision: Checkpointing with Barrier Columns" section to the execution graph builder plan. This documents that barrier columns prevent intermediate checkpoints and outlines four potential approaches to address this in the future.

johnnygreco · 2026-01-30T16:17:47Z

Update: Documented Barrier Checkpoint Limitation

Added a new section to the plan: "Open Decision: Checkpointing with Barrier Columns"

The Issue

With the current checkpoint design, barrier columns (like validation) prevent any intermediate checkpoints. Since barriers require ALL input rows before producing ANY output, get_completed_row_count() returns 0 until the entire barrier completes.

This means if a generation run fails before or during barrier execution, all pre-barrier work is lost.

Options Documented

The plan now outlines four approaches to address this:

Option	Summary
A: Accept current behavior	Keep dataset-scoped barriers, no intermediate checkpoints
B: Batch-scoped barriers	Execute barriers per-batch for incremental checkpoints
C: Column-level checkpointing	Track completed columns in checkpoint format
D: Hybrid approach	Add `BATCH_BARRIER` trait alongside `BARRIER`

Decision Deferred

We're documenting this as an open design question to be decided later based on real-world usage patterns. The current implementation works correctly; this is about checkpoint granularity optimization.

See the updated plan for full details and pros/cons analysis.

greptile-apps · 2026-01-30T16:21:02Z

Greptile Overview

Greptile Summary

Introduces a memory-efficient execution graph framework for async dataset generation with a hybrid representation that handles millions of records using O(C) memory instead of O(C×R). The implementation uses virtual cell nodes computed on-demand while storing only column-level metadata.

Key Contributions

Hybrid Graph Architecture: Explicit column structure with virtual cell nodes for memory efficiency
Plugin-Compatible Trait Inference: Uses generator properties (can_generate_from_scratch, is_row_streamable) instead of hardcoded class names, with bytecode inspection to evaluate properties without instantiation
Checkpoint/Restart Support: Row-complete batch checkpointing with O(1) checkpoint size, enabling recovery from interrupted generation runs
Thread-Safe Variants: Provides both CompletionTracker (asyncio/single-threaded) and ThreadSafeCompletionTracker (thread pool) implementations

Implementation Quality

The code follows the project's style guidelines with proper type annotations, SPDX license headers, and comprehensive test coverage. The plan document is thorough, including an "Open Decision" section about barrier column checkpoint limitations with four detailed options for future consideration.

Critical Issue Found

completion.py:227-229: Barrier nodes are marked complete unconditionally during checkpoint restoration, even when completed_rows < graph.num_records. This violates the barrier semantic where ALL input rows must complete before ANY output can be produced.

Confidence Score: 4/5

Safe to merge after fixing the barrier checkpoint restoration logic
The implementation is well-designed with excellent documentation and test coverage. The critical bug in barrier checkpoint restoration must be fixed, as it could lead to incorrect execution when resuming from checkpoints with barrier columns. Otherwise, the code demonstrates strong architectural design and follows all project conventions.
Pay close attention to completion.py line 227-229 where the barrier restoration logic needs correction

Important Files Changed

Filename	Overview
plans/async-engine/execution_graph_builder.plan.md	Comprehensive design document detailing hybrid graph representation, trait inference, dependency resolution, and checkpoint/restart support with clear examples and tradeoff analysis
packages/data-designer-engine/src/data_designer/engine/execution_graph/graph.py	Core hybrid graph implementation with on-demand cell node computation, dependency resolution, and row completion tracking for checkpoints
packages/data-designer-engine/src/data_designer/engine/execution_graph/builder.py	GraphBuilder with plugin-compatible trait inference using bytecode inspection to evaluate generator properties without instantiation
packages/data-designer-engine/src/data_designer/engine/execution_graph/completion.py	Memory-efficient completion tracking with O(C) memory, automatic compaction, and checkpoint/restart support for both single-threaded and thread-safe variants

Sequence Diagram

sequenceDiagram
    participant User
    participant GraphBuilder
    participant Registry
    participant ExecutionGraph
    participant CompletionTracker
    participant Engine

    Note over User,Engine: Graph Construction Phase
    User->>GraphBuilder: build(config, num_records)
    GraphBuilder->>Registry: get_for_config_type(config_type)
    Registry-->>GraphBuilder: generator_class
    GraphBuilder->>GraphBuilder: _infer_traits(gen_cls)
    Note over GraphBuilder: Uses bytecode inspection<br/>to evaluate properties
    GraphBuilder->>GraphBuilder: _build_column_descriptors()
    GraphBuilder->>ExecutionGraph: new(num_records, descriptors)
    ExecutionGraph->>ExecutionGraph: _validate_dependencies()
    ExecutionGraph-->>GraphBuilder: graph
    GraphBuilder-->>User: ExecutionGraph

    Note over User,Engine: Execution Phase
    User->>CompletionTracker: new(num_records)
    CompletionTracker-->>User: tracker
    
    loop For each ready node
        User->>ExecutionGraph: iter_ready_nodes(tracker)
        ExecutionGraph->>ExecutionGraph: Check dependencies satisfied
        ExecutionGraph-->>User: node (CellNodeId or BarrierNodeId)
        User->>ExecutionGraph: get_generator_and_config(node)
        ExecutionGraph-->>User: (generator_cls, config)
        User->>Engine: Execute node with generator
        Engine-->>User: result
        User->>CompletionTracker: mark_complete(node)
        Note over CompletionTracker: Automatic memory<br/>compaction when<br/>column completes
    end

    Note over User,Engine: Checkpoint/Restart
    User->>ExecutionGraph: get_completed_row_count(tracker)
    ExecutionGraph->>CompletionTracker: Check row completion
    ExecutionGraph-->>User: completed_rows
    User->>CompletionTracker: to_checkpoint(graph)
    CompletionTracker-->>User: {"version": 1, "completed_rows": N}
    Note over User: Store checkpoint
    
    Note over User,Engine: Resume from Checkpoint
    User->>CompletionTracker: from_checkpoint(checkpoint, graph)
    CompletionTracker->>CompletionTracker: Mark rows 0..N-1 complete
    CompletionTracker-->>User: restored_tracker
    User->>ExecutionGraph: iter_ready_nodes(restored_tracker)
    Note over ExecutionGraph: Skips completed nodes automatically

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-30T16:21:05Z

packages/data-designer-engine/src/data_designer/engine/execution_graph/completion.py

+            if desc.is_barrier:
+                # For barrier columns, mark the barrier as complete
+                tracker._completed_barriers.add(col_name)


Barrier marked complete unconditionally during checkpoint restoration. According to the plan document's "Open Decision" section, barrier columns prevent intermediate checkpoints since they require ALL input rows before producing ANY output. If completed_rows < graph.num_records, the barrier should not be marked complete.

Suggested change

if desc.is_barrier:

# For barrier columns, mark the barrier as complete

tracker._completed_barriers.add(col_name)

if desc.is_barrier:

# Only mark barrier complete if ALL rows are complete

if completed_rows == graph.num_records:

tracker._completed_barriers.add(col_name)

johnnygreco requested a review from a team as a code owner January 29, 2026 21:16

johnnygreco changed the base branch from main to jena/dev/async-engine January 29, 2026 21:17

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

NVIDIA-NeMo deleted a comment from greptile-apps bot Jan 30, 2026

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add execution graph builder plan with reference implementation #269

feat: add execution graph builder plan with reference implementation #269

johnnygreco commented Jan 29, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

johnnygreco commented Jan 30, 2026

Uh oh!

greptile-apps bot commented Jan 30, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add execution graph builder plan with reference implementation #269

Are you sure you want to change the base?

feat: add execution graph builder plan with reference implementation #269

Conversation

johnnygreco commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🎯 Focus: The Plan Document

🔄 Changes

✨ Added

🔧 Changed

🔍 Attention Areas

🔄 Update: Checkpoint/Restart Support

Key Design Decisions

New APIs

Usage Example

Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

johnnygreco commented Jan 30, 2026

Update: Documented Barrier Checkpoint Limitation

The Issue

Options Documented

Decision Deferred

Uh oh!

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Key Contributions

Implementation Quality

Critical Issue Found

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

johnnygreco commented Jan 29, 2026 •

edited

Loading