-
Notifications
You must be signed in to change notification settings - Fork 56
feat: add execution graph builder plan with reference implementation #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: jena/dev/async-engine
Are you sure you want to change the base?
Conversation
This introduces a design plan for a memory-efficient execution graph that models cell-level dependencies for async dataset generation. The reference implementation is included to help build intuition about how the concepts could work in practice. The plan is the primary artifact - the code is exploratory.
Add row-complete batch checkpointing for resuming interrupted generation: - Add is_row_complete() and get_completed_row_count() to ExecutionGraph - Add to_checkpoint() and from_checkpoint() to CompletionTracker - Add thread-safe checkpoint methods to ThreadSafeCompletionTracker - Export CHECKPOINT_VERSION for compatibility checking - Add comprehensive tests for checkpoint functionality - Update plan documentation with checkpoint/restart section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
Add "Open Decision: Checkpointing with Barrier Columns" section to the execution graph builder plan. This documents that barrier columns prevent intermediate checkpoints and outlines four potential approaches to address this in the future.
Update: Documented Barrier Checkpoint LimitationAdded a new section to the plan: "Open Decision: Checkpointing with Barrier Columns" The IssueWith the current checkpoint design, barrier columns (like validation) prevent any intermediate checkpoints. Since barriers require ALL input rows before producing ANY output, This means if a generation run fails before or during barrier execution, all pre-barrier work is lost. Options DocumentedThe plan now outlines four approaches to address this:
Decision DeferredWe're documenting this as an open design question to be decided later based on real-world usage patterns. The current implementation works correctly; this is about checkpoint granularity optimization. See the updated plan for full details and pros/cons analysis. |
Greptile OverviewGreptile SummaryIntroduces a memory-efficient execution graph framework for async dataset generation with a hybrid representation that handles millions of records using O(C) memory instead of O(C×R). The implementation uses virtual cell nodes computed on-demand while storing only column-level metadata. Key Contributions
Implementation QualityThe code follows the project's style guidelines with proper type annotations, SPDX license headers, and comprehensive test coverage. The plan document is thorough, including an "Open Decision" section about barrier column checkpoint limitations with four detailed options for future consideration. Critical Issue Found
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant GraphBuilder
participant Registry
participant ExecutionGraph
participant CompletionTracker
participant Engine
Note over User,Engine: Graph Construction Phase
User->>GraphBuilder: build(config, num_records)
GraphBuilder->>Registry: get_for_config_type(config_type)
Registry-->>GraphBuilder: generator_class
GraphBuilder->>GraphBuilder: _infer_traits(gen_cls)
Note over GraphBuilder: Uses bytecode inspection<br/>to evaluate properties
GraphBuilder->>GraphBuilder: _build_column_descriptors()
GraphBuilder->>ExecutionGraph: new(num_records, descriptors)
ExecutionGraph->>ExecutionGraph: _validate_dependencies()
ExecutionGraph-->>GraphBuilder: graph
GraphBuilder-->>User: ExecutionGraph
Note over User,Engine: Execution Phase
User->>CompletionTracker: new(num_records)
CompletionTracker-->>User: tracker
loop For each ready node
User->>ExecutionGraph: iter_ready_nodes(tracker)
ExecutionGraph->>ExecutionGraph: Check dependencies satisfied
ExecutionGraph-->>User: node (CellNodeId or BarrierNodeId)
User->>ExecutionGraph: get_generator_and_config(node)
ExecutionGraph-->>User: (generator_cls, config)
User->>Engine: Execute node with generator
Engine-->>User: result
User->>CompletionTracker: mark_complete(node)
Note over CompletionTracker: Automatic memory<br/>compaction when<br/>column completes
end
Note over User,Engine: Checkpoint/Restart
User->>ExecutionGraph: get_completed_row_count(tracker)
ExecutionGraph->>CompletionTracker: Check row completion
ExecutionGraph-->>User: completed_rows
User->>CompletionTracker: to_checkpoint(graph)
CompletionTracker-->>User: {"version": 1, "completed_rows": N}
Note over User: Store checkpoint
Note over User,Engine: Resume from Checkpoint
User->>CompletionTracker: from_checkpoint(checkpoint, graph)
CompletionTracker->>CompletionTracker: Mark rows 0..N-1 complete
CompletionTracker-->>User: restored_tracker
User->>ExecutionGraph: iter_ready_nodes(restored_tracker)
Note over ExecutionGraph: Skips completed nodes automatically
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
| if desc.is_barrier: | ||
| # For barrier columns, mark the barrier as complete | ||
| tracker._completed_barriers.add(col_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barrier marked complete unconditionally during checkpoint restoration. According to the plan document's "Open Decision" section, barrier columns prevent intermediate checkpoints since they require ALL input rows before producing ANY output. If completed_rows < graph.num_records, the barrier should not be marked complete.
| if desc.is_barrier: | |
| # For barrier columns, mark the barrier as complete | |
| tracker._completed_barriers.add(col_name) | |
| if desc.is_barrier: | |
| # Only mark barrier complete if ALL rows are complete | |
| if completed_rows == graph.num_records: | |
| tracker._completed_barriers.add(col_name) |
Experimenting with this plan-first approach!
Main goal is to build out a graph of all the cells that need to be generated along with their dependencies. The idea would be to have an execution engine fetch work from this graph when it is ready.
plans/async-engine/execution_graph_builder.plan.mdplans/async-engine/explore_execution_graph.ipynb- Interactive notebook exploring the implementationPart of #260
Closes #267
NOTE: The reference implementation is just meant to help build some intuition for what a concrete implementation might look like.
📋 Summary
This PR introduces the Execution Graph Builder plan document that designs a memory-efficient graph-based execution framework for async dataset generation. A reference implementation is included to build intuition and validate the design decisions—the plan document should be the primary focus for review.
🎯 Focus: The Plan Document
The core of this PR is
plans/async-engine/execution_graph_builder.plan.mdwhich details:START,CELL_BY_CELL,ROW_STREAMABLE,BARRIERinferred from generator properties🔄 Changes
✨ Added
Plan & Exploration:
plans/async-engine/execution_graph_builder.plan.md- Detailed design document for the execution graph frameworkplans/async-engine/explore_execution_graph.ipynb- Interactive notebook exploring the implementationReference Implementation (in
packages/data-designer-engine/src/data_designer/engine/execution_graph/):node_id.py-CellNodeId,BarrierNodeId, andNodeIdtype aliastraits.py-ExecutionTraitsFlag enumcolumn_descriptor.py- Column metadata with traits and dependenciesgraph.py-ExecutionGraphwith hybrid representationbuilder.py-GraphBuilderfactory with trait inferencecompletion.py-CompletionTrackerandThreadSafeCompletionTrackerGenerator Properties:
is_row_streamableproperty toColumnGeneratorbase class and overrides in subclassesTests (in
packages/data-designer-engine/tests/engine/execution_graph/):🔧 Changed
ModelConfigand schema transform processor🔍 Attention Areas
execution_graph_builder.plan.md- Primary review target: Does the design adequately address async execution requirements?graph.py- Core hybrid representation logicbuilder.py#L175-L210- Trait inference logic (plugin-compatible approach)🔄 Update: Checkpoint/Restart Support
Added row-complete batch checkpointing to enable resuming interrupted generation runs. This feature allows saving progress at row boundaries and efficiently restoring state without storing individual cell indices.
Key Design Decisions
{"completed_rows": N}- O(1) regardless of dataset sizeNew APIs
ExecutionGraph (
graph.py):is_row_complete(row, completed)- Check if all cells for a row are completeget_completed_row_count(completed)- Count contiguous complete rows from row 0CompletionTracker / ThreadSafeCompletionTracker (
completion.py):to_checkpoint(graph)- Create compact checkpoint:{"version": 1, "completed_rows": N}from_checkpoint(checkpoint, graph)- Restore tracker state from checkpointUsage Example
Files Changed
graph.pyis_row_complete(),get_completed_row_count()completion.pyto_checkpoint(),from_checkpoint()to both tracker classes__init__.pyCHECKPOINT_VERSIONtest_completion.pytest_graph.pyexecution_graph_builder.plan.md🤖 Generated with AI