-
Notifications
You must be signed in to change notification settings - Fork 4
Description
What
Hello @RUrlus, I hope you are fine! I was recently considering the possibility of extending the current diptst() implementation to support stream-processing scenarios, where data arrive continuously rather than as a static sorted array. The idea is to compute or approximate the dip statistic incrementally — either for an append-only stream (ever-growing sample) or for a sliding window (fixed-size or time-based window).
Why
Right now, every dip calculation requires a full re-scan of all ( n ) sorted elements. In streaming or online systems, this means recomputing O(n) work for each new observation, which quickly becomes infeasible for large or high-rate data.
Supporting online updates would:
- enable real-time unimodality monitoring in evolving distributions;
- allow integration with systems like Kafka / Flink / AWS Kinesis;
- reduce latency for continuous dip-based anomaly detection or clustering tasks.
Ideas / Discussion
We could explore three directions, each with different complexity and precision trade-offs:
-
Append-only incremental dip
- Maintain the data in an order-statistics tree (balanced BST with rank queries).
- Locally update the convex envelopes (GCM/LCM) around the insertion point.
- Expected complexity: O(log n) per insert + small local convex repairs.
- Requires deeper refactoring of
compute_indices()to support partial updates.
-
Sliding-window recomputation (baseline)
- Keep a fixed-length deque or circular buffer of the most recent ( W ) samples.
- On each new sample, remove the oldest, insert the newest, re-sort, and call
diptst(). - Complexity: O(W) per step.
- Easiest to implement today; could serve as an MVP for stream compatibility.
-
Approximate streaming dip
- Maintain a quantile sketch (e.g., KLL, t-digest, or reservoir sampling).
- Periodically materialize a synthetic sorted sample from the sketch and run the exact dip on it.
- Complexity: O(log m) per update, O(m) per dip query (m ≪ n).
- Offers controllable accuracy–speed trade-off for large or high-velocity streams.
References
- Hartigan & Hartigan (1985): The Dip Test of Unimodality
- Incremental convex hull maintenance literature (e.g., Overmars & van Leeuwen, 1981)
- Quantile sketch algorithms: KLL (Agarwal et al., 2018), t-digest (Dunning, 2013)