Skip to content

Explore streaming / sliding-window support for the Dip Test #92

@prokolyvakis

Description

@prokolyvakis

What

Hello @RUrlus, I hope you are fine! I was recently considering the possibility of extending the current diptst() implementation to support stream-processing scenarios, where data arrive continuously rather than as a static sorted array. The idea is to compute or approximate the dip statistic incrementally — either for an append-only stream (ever-growing sample) or for a sliding window (fixed-size or time-based window).

Why

Right now, every dip calculation requires a full re-scan of all ( n ) sorted elements. In streaming or online systems, this means recomputing O(n) work for each new observation, which quickly becomes infeasible for large or high-rate data.

Supporting online updates would:

  • enable real-time unimodality monitoring in evolving distributions;
  • allow integration with systems like Kafka / Flink / AWS Kinesis;
  • reduce latency for continuous dip-based anomaly detection or clustering tasks.

Ideas / Discussion

We could explore three directions, each with different complexity and precision trade-offs:

  1. Append-only incremental dip

    • Maintain the data in an order-statistics tree (balanced BST with rank queries).
    • Locally update the convex envelopes (GCM/LCM) around the insertion point.
    • Expected complexity: O(log n) per insert + small local convex repairs.
    • Requires deeper refactoring of compute_indices() to support partial updates.
  2. Sliding-window recomputation (baseline)

    • Keep a fixed-length deque or circular buffer of the most recent ( W ) samples.
    • On each new sample, remove the oldest, insert the newest, re-sort, and call diptst().
    • Complexity: O(W) per step.
    • Easiest to implement today; could serve as an MVP for stream compatibility.
  3. Approximate streaming dip

    • Maintain a quantile sketch (e.g., KLL, t-digest, or reservoir sampling).
    • Periodically materialize a synthetic sorted sample from the sketch and run the exact dip on it.
    • Complexity: O(log m) per update, O(m) per dip query (m ≪ n).
    • Offers controllable accuracy–speed trade-off for large or high-velocity streams.

References

  • Hartigan & Hartigan (1985): The Dip Test of Unimodality
  • Incremental convex hull maintenance literature (e.g., Overmars & van Leeuwen, 1981)
  • Quantile sketch algorithms: KLL (Agarwal et al., 2018), t-digest (Dunning, 2013)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions