-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Summary
Investigated whether valr could benefit from direct C-level integration with bigWig/bigBed files via cpp11bigwig, to avoid inflating entire files to tibbles before intersection operations.
Analysis
What we explored
The idea was to extend bed_intersect() and bed_map() to accept file paths directly:
bed_map(peaks, "signal.bw", mean_signal = mean(value))
bed_intersect(peaks, "annotations.bb")libBigWig provides indexed, on-demand access:
bwStats()- computes mean/min/max/sum/coverage/stdev per interval using zoom levelsbbGetOverlappingEntries()- returns only entries overlapping a query region
Why it probably won't help
Per-interval queries have too much overhead:
Calling libBigWig functions in a loop (one per query interval) means N round trips for N intervals. The R/C++ conversion overhead alone would likely negate any benefits.
Per-chromosome batching is equivalent to current approach:
To reduce libBigWig calls, we'd query once per chromosome using a bounding box (min start to max end), then do intersection in C++. But this is essentially what the current approach does:
bed_map(peaks, read_bigwig("signal.bw"), ...)The only difference is whether we materialize to an R tibble in between - valr's C++ interval tree operations are already efficient.
When direct access would win:
- Sparse queries: e.g., 100 small intervals scattered across the genome against a huge bigWig
- In that case, querying only those specific regions beats inflating the whole file
For typical genomics workflows (peaks vs. signal across whole genome), the current approach is already reasonable.
Conclusion
No immediate action needed. The current workflow using read_bigwig()/read_bigbed() followed by valr operations is efficient enough for common use cases.
If users report specific performance issues (huge files, sparse queries, memory constraints), we can revisit direct integration. The design work is documented here for future reference.
Related
- cpp11bigwig package: https://github.com/rnabioco/cpp11bigwig
- libBigWig C library used under the hood