Skip to content

Add implementation plan for Rust engine#11

Open
galpin wants to merge 6 commits intomainfrom
claude/rust-pluck-engine-RnHyI
Open

Add implementation plan for Rust engine#11
galpin wants to merge 6 commits intomainfrom
claude/rust-pluck-engine-RnHyI

Conversation

@galpin
Copy link
Owner

@galpin galpin commented Feb 22, 2026

Outlines the architecture and steps for adding a PyO3/maturin-based
Rust engine to accelerate JSON normalization, tree walking, and frame
extraction while preserving the public Python API with graceful fallback.

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh

Outlines the architecture and steps for adding a PyO3/maturin-based
Rust engine to accelerate JSON normalization, tree walking, and frame
extraction while preserving the public Python API with graceful fallback.

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
Rebased onto latest main which migrated from Poetry to uv with uv_build
backend and added pre-commit hooks (ruff + ty). Updated plan to reflect:
- Replace uv_build (not poetry-core) with maturin as build backend
- CI steps use uv commands (uv sync, uv run) instead of poetry
- Pre-commit compatibility with ty's unresolved-import = "warn"
- Development workflow uses maturin develop + uv run pytest

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
Add a compiled Rust extension (pluck._pluck_engine) that accelerates
the two most expensive operations: JSON normalization with cross-joins
and frame extraction from GraphQL responses.

Key changes:
- rust/: PyO3 extension with normalize(), extract_frames(), and walker
- src/pluck/_engine.py: engine selector with graceful Python fallback
- src/pluck/_execution.py: delegates to engine instead of direct calls
- pyproject.toml: maturin build backend replaces uv_build
- CI.yml: adds Rust toolchain and maturin build steps
- tests/test_performance.py: benchmark comparing Python vs Rust (2.2x)

The public API is unchanged. When the Rust extension is unavailable,
the library falls back transparently to the existing Python code.

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
Two key optimizations to the Rust normalization engine:

1. Native Rust Value enum: Convert Python objects to Rust types once at
   the boundary, do all normalization in pure Rust (no GIL interaction),
   then convert back. Eliminates ~33K Py_INCREF calls per benchmark.
   Uses Rc<str> for column names so cross-join cloning is a pointer copy.

2. Columnar output format: New normalize_columnar() returns {col: [vals]}
   instead of [{col: val}]. Creates 1 dict + N_cols lists instead of
   N_rows dicts. Pandas consumes columnar data much faster.

End-to-end benchmarks (normalize + DataFrame creation):
  5K rows:  Python 0.077s → Rust 0.014s (5.6x)
  20K rows: Python 0.222s → Rust 0.049s (4.6x)
  60K rows: Python 0.717s → Rust 0.203s (3.5x)

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
…(6-7x)

- Replace path.to_vec() allocations with mutable push/pop path stack
- Add HashMap cache for generate_name to avoid redundant string allocation
- Add normalize_columnar_batch: single Rust call for all items, eliminating
  per-item Python↔Rust round-trips and Python-side merge loop
- Update _execution.py to use batch function
- All 43 tests pass, benchmark shows 5-7x end-to-end speedup

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
- Add arrow crate (v55) with pyarrow FFI for zero-copy Rust→Python transfer
- New normalize_arrow_batch: builds typed Arrow arrays (Int64, Float64,
  Boolean, Utf8) directly from Rust Value enum, passes RecordBatch to
  Python via Arrow C Data Interface — no per-cell PyObject creation
- Add pyarrow as runtime dependency
- Update _execution.py to use Arrow path as primary when Rust engine available
- Add create_from_arrow to DataFrameLibrary using RecordBatch.to_pandas()
- Benchmark: 6.2x (5K rows), 7.6x (20K), 10.1x (60K) end-to-end speedup

https://claude.ai/code/session_012euCx7hSyKp8V4yuYCK9Dh
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants