Skip to content

Explore direct bigWig/bigBed integration for bed_intersect/bed_map #444

@jayhesselberth

Description

@jayhesselberth

Summary

Investigated whether valr could benefit from direct C-level integration with bigWig/bigBed files via cpp11bigwig, to avoid inflating entire files to tibbles before intersection operations.

Analysis

What we explored

The idea was to extend bed_intersect() and bed_map() to accept file paths directly:

bed_map(peaks, "signal.bw", mean_signal = mean(value))
bed_intersect(peaks, "annotations.bb")

libBigWig provides indexed, on-demand access:

  • bwStats() - computes mean/min/max/sum/coverage/stdev per interval using zoom levels
  • bbGetOverlappingEntries() - returns only entries overlapping a query region

Why it probably won't help

Per-interval queries have too much overhead:
Calling libBigWig functions in a loop (one per query interval) means N round trips for N intervals. The R/C++ conversion overhead alone would likely negate any benefits.

Per-chromosome batching is equivalent to current approach:
To reduce libBigWig calls, we'd query once per chromosome using a bounding box (min start to max end), then do intersection in C++. But this is essentially what the current approach does:

bed_map(peaks, read_bigwig("signal.bw"), ...)

The only difference is whether we materialize to an R tibble in between - valr's C++ interval tree operations are already efficient.

When direct access would win:

  • Sparse queries: e.g., 100 small intervals scattered across the genome against a huge bigWig
  • In that case, querying only those specific regions beats inflating the whole file

For typical genomics workflows (peaks vs. signal across whole genome), the current approach is already reasonable.

Conclusion

No immediate action needed. The current workflow using read_bigwig()/read_bigbed() followed by valr operations is efficient enough for common use cases.

If users report specific performance issues (huge files, sparse queries, memory constraints), we can revisit direct integration. The design work is documented here for future reference.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions