Skip to content

Conversation

@johnnygreco
Copy link
Contributor

@johnnygreco johnnygreco commented Jan 29, 2026

Experimenting with this plan-first approach!

Main goal is to build out a graph of all the cells that need to be generated along with their dependencies. The idea would be to have an execution engine fetch work from this graph when it is ready.

Part of #260
Closes #267

NOTE: The reference implementation is just meant to help build some intuition for what a concrete implementation might look like.

📋 Summary

This PR introduces the Execution Graph Builder plan document that designs a memory-efficient graph-based execution framework for async dataset generation. A reference implementation is included to build intuition and validate the design decisions—the plan document should be the primary focus for review.

🎯 Focus: The Plan Document

The core of this PR is plans/async-engine/execution_graph_builder.plan.md which details:

  • Hybrid graph representation: O(C) memory for millions of records (cell nodes computed on-demand)
  • Generator execution traits: START, CELL_BY_CELL, ROW_STREAMABLE, BARRIER inferred from generator properties
  • Cell-level dependency resolution: Enables fine-grained async execution
  • Thread-safe completion tracking: Separate trackers for asyncio vs thread pool execution

🔄 Changes

✨ Added

Plan & Exploration:

Reference Implementation (in packages/data-designer-engine/src/data_designer/engine/execution_graph/):

Generator Properties:

Tests (in packages/data-designer-engine/tests/engine/execution_graph/):

  • Comprehensive test suite covering graph building, node iteration, dependency resolution, and completion tracking

🔧 Changed

  • Minor updates to ModelConfig and schema transform processor

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:


🔄 Update: Checkpoint/Restart Support

Added row-complete batch checkpointing to enable resuming interrupted generation runs. This feature allows saving progress at row boundaries and efficiently restoring state without storing individual cell indices.

Key Design Decisions

Decision Rationale
Row-complete batches Checkpoint when all cells for a batch of rows are complete
Compact format Store {"completed_rows": N} - O(1) regardless of dataset size
Contiguous completion Only count contiguous rows from row 0 (simpler, matches batch-oriented generation)
Usable partial datasets Each checkpoint represents a valid dataset subset

New APIs

ExecutionGraph (graph.py):

  • is_row_complete(row, completed) - Check if all cells for a row are complete
  • get_completed_row_count(completed) - Count contiguous complete rows from row 0

CompletionTracker / ThreadSafeCompletionTracker (completion.py):

  • to_checkpoint(graph) - Create compact checkpoint: {"version": 1, "completed_rows": N}
  • from_checkpoint(checkpoint, graph) - Restore tracker state from checkpoint

Usage Example

# Checkpoint during execution
BATCH_SIZE = 1000
completed = graph.get_completed_row_count(tracker)
if completed >= last_checkpoint + BATCH_SIZE:
    storage.update_metadata({"checkpoint": tracker.to_checkpoint(graph)})
    last_checkpoint = completed

# Resume from checkpoint
if "checkpoint" in metadata:
    tracker = CompletionTracker.from_checkpoint(metadata["checkpoint"], graph)
else:
    tracker = CompletionTracker(graph.num_records)

Files Changed

File Change
graph.py Added is_row_complete(), get_completed_row_count()
completion.py Added to_checkpoint(), from_checkpoint() to both tracker classes
__init__.py Exported CHECKPOINT_VERSION
test_completion.py Added 17 new checkpoint tests
test_graph.py Added 12 new row completion tests
execution_graph_builder.plan.md Added checkpoint/restart documentation section

🤖 Generated with AI

This introduces a design plan for a memory-efficient execution graph
that models cell-level dependencies for async dataset generation.
The reference implementation is included to help build intuition
about how the concepts could work in practice.

The plan is the primary artifact - the code is exploratory.
@johnnygreco johnnygreco requested a review from a team as a code owner January 29, 2026 21:16
@johnnygreco johnnygreco changed the base branch from main to jena/dev/async-engine January 29, 2026 21:17
Add row-complete batch checkpointing for resuming interrupted generation:
- Add is_row_complete() and get_completed_row_count() to ExecutionGraph
- Add to_checkpoint() and from_checkpoint() to CompletionTracker
- Add thread-safe checkpoint methods to ThreadSafeCompletionTracker
- Export CHECKPOINT_VERSION for compatibility checking
- Add comprehensive tests for checkpoint functionality
- Update plan documentation with checkpoint/restart section
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Add "Open Decision: Checkpointing with Barrier Columns" section to the
execution graph builder plan. This documents that barrier columns
prevent intermediate checkpoints and outlines four potential approaches
to address this in the future.
@johnnygreco
Copy link
Contributor Author

Update: Documented Barrier Checkpoint Limitation

Added a new section to the plan: "Open Decision: Checkpointing with Barrier Columns"

The Issue

With the current checkpoint design, barrier columns (like validation) prevent any intermediate checkpoints. Since barriers require ALL input rows before producing ANY output, get_completed_row_count() returns 0 until the entire barrier completes.

This means if a generation run fails before or during barrier execution, all pre-barrier work is lost.

Options Documented

The plan now outlines four approaches to address this:

Option Summary
A: Accept current behavior Keep dataset-scoped barriers, no intermediate checkpoints
B: Batch-scoped barriers Execute barriers per-batch for incremental checkpoints
C: Column-level checkpointing Track completed columns in checkpoint format
D: Hybrid approach Add BATCH_BARRIER trait alongside BARRIER

Decision Deferred

We're documenting this as an open design question to be decided later based on real-world usage patterns. The current implementation works correctly; this is about checkpoint granularity optimization.

See the updated plan for full details and pros/cons analysis.

@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from greptile-apps bot Jan 30, 2026
@greptile-apps
Copy link

greptile-apps bot commented Jan 30, 2026

Greptile Overview

Greptile Summary

Introduces a memory-efficient execution graph framework for async dataset generation with a hybrid representation that handles millions of records using O(C) memory instead of O(C×R). The implementation uses virtual cell nodes computed on-demand while storing only column-level metadata.

Key Contributions

  • Hybrid Graph Architecture: Explicit column structure with virtual cell nodes for memory efficiency
  • Plugin-Compatible Trait Inference: Uses generator properties (can_generate_from_scratch, is_row_streamable) instead of hardcoded class names, with bytecode inspection to evaluate properties without instantiation
  • Checkpoint/Restart Support: Row-complete batch checkpointing with O(1) checkpoint size, enabling recovery from interrupted generation runs
  • Thread-Safe Variants: Provides both CompletionTracker (asyncio/single-threaded) and ThreadSafeCompletionTracker (thread pool) implementations

Implementation Quality

The code follows the project's style guidelines with proper type annotations, SPDX license headers, and comprehensive test coverage. The plan document is thorough, including an "Open Decision" section about barrier column checkpoint limitations with four detailed options for future consideration.

Critical Issue Found

  • completion.py:227-229: Barrier nodes are marked complete unconditionally during checkpoint restoration, even when completed_rows < graph.num_records. This violates the barrier semantic where ALL input rows must complete before ANY output can be produced.

Confidence Score: 4/5

  • Safe to merge after fixing the barrier checkpoint restoration logic
  • The implementation is well-designed with excellent documentation and test coverage. The critical bug in barrier checkpoint restoration must be fixed, as it could lead to incorrect execution when resuming from checkpoints with barrier columns. Otherwise, the code demonstrates strong architectural design and follows all project conventions.
  • Pay close attention to completion.py line 227-229 where the barrier restoration logic needs correction

Important Files Changed

Filename Overview
plans/async-engine/execution_graph_builder.plan.md Comprehensive design document detailing hybrid graph representation, trait inference, dependency resolution, and checkpoint/restart support with clear examples and tradeoff analysis
packages/data-designer-engine/src/data_designer/engine/execution_graph/graph.py Core hybrid graph implementation with on-demand cell node computation, dependency resolution, and row completion tracking for checkpoints
packages/data-designer-engine/src/data_designer/engine/execution_graph/builder.py GraphBuilder with plugin-compatible trait inference using bytecode inspection to evaluate generator properties without instantiation
packages/data-designer-engine/src/data_designer/engine/execution_graph/completion.py Memory-efficient completion tracking with O(C) memory, automatic compaction, and checkpoint/restart support for both single-threaded and thread-safe variants

Sequence Diagram

sequenceDiagram
    participant User
    participant GraphBuilder
    participant Registry
    participant ExecutionGraph
    participant CompletionTracker
    participant Engine

    Note over User,Engine: Graph Construction Phase
    User->>GraphBuilder: build(config, num_records)
    GraphBuilder->>Registry: get_for_config_type(config_type)
    Registry-->>GraphBuilder: generator_class
    GraphBuilder->>GraphBuilder: _infer_traits(gen_cls)
    Note over GraphBuilder: Uses bytecode inspection<br/>to evaluate properties
    GraphBuilder->>GraphBuilder: _build_column_descriptors()
    GraphBuilder->>ExecutionGraph: new(num_records, descriptors)
    ExecutionGraph->>ExecutionGraph: _validate_dependencies()
    ExecutionGraph-->>GraphBuilder: graph
    GraphBuilder-->>User: ExecutionGraph

    Note over User,Engine: Execution Phase
    User->>CompletionTracker: new(num_records)
    CompletionTracker-->>User: tracker
    
    loop For each ready node
        User->>ExecutionGraph: iter_ready_nodes(tracker)
        ExecutionGraph->>ExecutionGraph: Check dependencies satisfied
        ExecutionGraph-->>User: node (CellNodeId or BarrierNodeId)
        User->>ExecutionGraph: get_generator_and_config(node)
        ExecutionGraph-->>User: (generator_cls, config)
        User->>Engine: Execute node with generator
        Engine-->>User: result
        User->>CompletionTracker: mark_complete(node)
        Note over CompletionTracker: Automatic memory<br/>compaction when<br/>column completes
    end

    Note over User,Engine: Checkpoint/Restart
    User->>ExecutionGraph: get_completed_row_count(tracker)
    ExecutionGraph->>CompletionTracker: Check row completion
    ExecutionGraph-->>User: completed_rows
    User->>CompletionTracker: to_checkpoint(graph)
    CompletionTracker-->>User: {"version": 1, "completed_rows": N}
    Note over User: Store checkpoint
    
    Note over User,Engine: Resume from Checkpoint
    User->>CompletionTracker: from_checkpoint(checkpoint, graph)
    CompletionTracker->>CompletionTracker: Mark rows 0..N-1 complete
    CompletionTracker-->>User: restored_tracker
    User->>ExecutionGraph: iter_ready_nodes(restored_tracker)
    Note over ExecutionGraph: Skips completed nodes automatically
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +227 to +229
if desc.is_barrier:
# For barrier columns, mark the barrier as complete
tracker._completed_barriers.add(col_name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barrier marked complete unconditionally during checkpoint restoration. According to the plan document's "Open Decision" section, barrier columns prevent intermediate checkpoints since they require ALL input rows before producing ANY output. If completed_rows < graph.num_records, the barrier should not be marked complete.

Suggested change
if desc.is_barrier:
# For barrier columns, mark the barrier as complete
tracker._completed_barriers.add(col_name)
if desc.is_barrier:
# Only mark barrier complete if ALL rows are complete
if completed_rows == graph.num_records:
tracker._completed_barriers.add(col_name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants