Skip to content

Batching in query execution #329

@cheb0

Description

@cheb0

In this task we will try to acheive batching in LID (postings) tree processing, as well as for histograms and aggregations. Modern analytical engines also process data in batches. Seq-db currently supports histograms and aggregations, not only full-text search. These operation shall also benefit from batching.

Batches will be just LID slices of arrays. In future, it's possible to support bitmaps or any other data structure.

Batching will not allow more efficient execution from algorithmic perspective, but benefit well from CPU. It should also align well with block skipping (NextGEQ). It will also be interesting to measure simple scenarios like histograms and aggregations with plain queries service:some-service with large amount of LIDs scrolled through.

There are also some downsides. At the first glance, it seems we can hit additional overhead on certain queries. For example, pod:gateway-* AND request_id:'123'. If fetching LIDs from tokenrequest_id will result in [30_000, 300_000, 700_000], then we will proceed with passing every LID of those to pod:gateway-* tree and get large batches each time basically to merge with a single LID. This might be partially addressed with hinting Next method of how much LIDs do we need, so that we use batching where we have a lot of LIDs from both sides.

Pros

  • faster execution
  • ability to make GetMID, GetRID, histogram and aggretation support batching - increase CPU cache efficiency

Cons

  • additional memory management (temp slices) in iterators
  • need to make sure we do not do more job on some queries (disk reads, CPU work)

Metadata

Metadata

Assignees

Labels

performanceFeatures or improvements that positively affect seq-db performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions