refactoring and optimizations in annotation queries #582

karasikov · 2026-01-10T17:11:27Z

Don't reencode labels in query graph (helps for cases with extreme number of labels in the query graph)
Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)
More efficient top-k selection (nth + sort instead of sort + resize)
More efficient hits accumulation (adaptive with dense vector or hash table, depending on the number of labels)

Status of parallelization for different query setups with batch query:

Annotation type \ Query type	matches	counts	coords
basic	✅	na	na
with counts	❌ -> ✅	❌ (TODO, another PR)	na
with coords	❌ -> ✅	❌ (TODO, another PR)	- (always non-batch)

Examples

1 - 1.8x faster querying tara_assemblies

(Thanks to adaptive count accumulation, not reencoding the labels, and faster top-n selection.)

Logs

Before:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_50k_short.fq --mmap -p 40 -v -i graph_small.indexed.dbg -a annotation.v4.row_diff_sparse.annodbg -v --batch-size 10000000000 --query-mode matches --num-top-labels 50 --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 10:23:50.295] [trace] Metagraph started
[2026-01-22 10:24:37.251] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_50k_short.fq'
[2026-01-22 10:24:45.545] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 10:28:03.930] [trace] [Query graph construction] Batch graph contains 568137958 k-mers and took 198.384456538 sec to construct
[2026-01-22 10:34:33.144] [trace] [Query graph construction] Contig extraction took 389.214268666 sec
[2026-01-22 10:34:33.144] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 10:38:44.811] [trace] [Query graph construction] Contigs mapped to the full graph (found 1721062 / 568137958 k-mers) in 251.666932983 sec
[2026-01-22 10:38:46.510] [trace] [Query graph construction] Building the query graph...
[2026-01-22 10:38:52.060] [trace] [Query graph construction] Query graph contains 1721062 k-mers and took 5.549840243 sec to construct
[2026-01-22 10:38:52.060] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 10:39:00.283] [trace] [Query graph construction] Mapping between graphs constructed in 8.223766384 sec
[2026-01-22 10:39:01.438] [trace] [Query graph construction] Slicing 1721062 rows out of full annotation...
[2026-01-22 10:39:12.345] [trace] [Query graph construction] Query annotation with 1721062 rows, 1835925 labels, and 34915572 set bits constructed in 20.284933866 sec
[2026-01-22 10:50:04.125] [trace] Batch of 1226845631 bp from '/data/random_stud/queries/100_studies_50k_short.fq': Query graph constructed in 866.80252 sec, redundancy: 712.84 bp/kmer, queried in 651.77136 sec. Batch query time: 1518.57388 sec, 807893.3 bp/s

After:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_50k_short.fq --mmap -p 40 -v -i graph_small.indexed.dbg -a annotation.v4.row_diff_sparse.annodbg -v --batch-size 10000000000 --query-mode matches --num-top-labels 50 --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 10:06:10.704] [trace] Metagraph started
[2026-01-22 10:07:14.529] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_50k_short.fq'
[2026-01-22 10:07:22.164] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 10:10:10.783] [trace] [Query graph construction] Batch graph contains 568137958 k-mers and took 168.620122111 sec to construct
[2026-01-22 10:16:13.550] [trace] [Query graph construction] Contig extraction took 362.765982662 sec
[2026-01-22 10:16:13.550] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 10:20:29.169] [trace] [Query graph construction] Contigs mapped to the full graph (found 1721062 / 568137958 k-mers) in 255.618956579 sec
[2026-01-22 10:20:30.815] [trace] [Query graph construction] Building the query graph...
[2026-01-22 10:20:34.726] [trace] [Query graph construction] Query graph contains 1721062 k-mers and took 3.910616207 sec to construct
[2026-01-22 10:20:34.726] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 10:20:41.715] [trace] [Query graph construction] Mapping between graphs constructed in 6.989409407 sec
[2026-01-22 10:20:42.916] [trace] [Query graph construction] Slicing 1721062 rows out of full annotation...
[2026-01-22 10:20:57.171] [trace] [Query graph construction] Query annotation with 1721062 rows, 318205057 labels, and 34915572 set bits constructed in 22.442328573 sec
[2026-01-22 10:21:09.616] [trace] Batch of 1226845631 bp from '/data/random_stud/queries/100_studies_50k_short.fq': Query graph constructed in 815.01424 sec, redundancy: 712.84 bp/kmer, queried in 12.43879 sec. Batch query time: 827.45303 sec, 1482677.1 bp/s

2 - 2.9x faster querying (--query-mode matches) of RefSeq with coords

(Thanks to the parallelization.)

Logs

Before:

srv-metagraph@mex:/scratch/nvme7/refseq_33m$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_100_short.fq --mmap -p 40 -v -i /scratch/nvme7/refseq_33m/graph_k31.indexed.dbg -a /scratch/nvme6/refseq/annotation.relaxed.relabeled.row_diff_brwt_coord.annodbg -v --batch-size 10000000000 --query-mode matches --no-coord-mapping --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 14:03:19.327] [trace] Metagraph started
[2026-01-22 14:04:26.137] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_100_short.fq'
[2026-01-22 14:04:26.172] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 14:04:26.435] [trace] [Query graph construction] Batch graph contains 1674535 k-mers and took 0.263780349 sec to construct
[2026-01-22 14:04:26.932] [trace] [Query graph construction] Contig extraction took 0.49652752 sec
[2026-01-22 14:04:26.932] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 14:06:42.105] [trace] [Query graph construction] Contigs mapped to the full graph (found 625599 / 1674535 k-mers) in 135.166176331 sec
[2026-01-22 14:06:42.113] [trace] [Query graph construction] Building the query graph...
[2026-01-22 14:06:42.186] [trace] [Query graph construction] Query graph contains 625599 k-mers and took 0.07354974 sec to construct
[2026-01-22 14:06:42.186] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 14:06:42.269] [trace] [Query graph construction] Mapping between graphs constructed in 0.082363831 sec
[2026-01-22 14:06:42.271] [trace] [Query graph construction] Slicing 625599 rows out of full annotation...
[2026-01-22 14:22:54.974] [trace] [Query graph construction] Query annotation with 625599 rows, 71758 labels, and 34935291 set bits constructed in 972.783420676 sec
[2026-01-22 14:22:55.084] [trace] Batch of 2454716 bp from '/data/random_stud/queries/100_studies_100_short.fq': Query graph constructed in 1108.80454 sec, redundancy: 3.92 bp/kmer, queried in 0.10824 sec. Batch query time: 1108.91278 sec, 2213.6 bp/s
[2026-01-22 14:22:55.204] [trace] File '/data/random_stud/queries/100_studies_100_short.fq' with 2454716 base pairs was processed in 1109.070 sec, throughput: 2213.3 bp/s

After:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_100_short.fq --mmap -p 40 -v -i /scratch/nvme7/refseq_33m/graph_k31.indexed.dbg -a /scratch/nvme6/refseq/annotation.relaxed.relabeled.row_diff_brwt_coord.annodbg -v --batch-size 10000000000 --query-mode matches --no-coord-mapping --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 14:03:08.187] [trace] Metagraph started
[2026-01-22 14:04:26.115] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_100_short.fq'
[2026-01-22 14:04:26.169] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 14:04:26.432] [trace] [Query graph construction] Batch graph contains 1674535 k-mers and took 0.263120083 sec to construct
[2026-01-22 14:04:26.934] [trace] [Query graph construction] Contig extraction took 0.502187097 sec
[2026-01-22 14:04:26.934] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 14:06:42.116] [trace] [Query graph construction] Contigs mapped to the full graph (found 625599 / 1674535 k-mers) in 135.167619026 sec
[2026-01-22 14:06:42.132] [trace] [Query graph construction] Building the query graph...
[2026-01-22 14:06:42.253] [trace] [Query graph construction] Query graph contains 625599 k-mers and took 0.120320957 sec to construct
[2026-01-22 14:06:42.253] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 14:06:42.327] [trace] [Query graph construction] Mapping between graphs constructed in 0.073650761 sec
[2026-01-22 14:06:42.329] [trace] [Query graph construction] Slicing 625599 rows out of full annotation...
[2026-01-22 14:10:53.824] [trace] [Query graph construction] Query annotation with 625599 rows, 85375 labels, and 34935291 set bits constructed in 251.552649344 sec
[2026-01-22 14:10:54.172] [trace] Batch of 2454716 bp from '/data/random_stud/queries/100_studies_100_short.fq': Query graph constructed in 387.66273 sec, redundancy: 3.92 bp/kmer, queried in 0.34089 sec. Batch query time: 388.00362 sec, 6326.5 bp/s
[2026-01-22 14:10:54.187] [trace] File '/data/random_stud/queries/100_studies_100_short.fq' with 2454716 base pairs was processed in 388.072 sec, throughput: 6325.4 bp/s

…bels in results

…td::sort

…s to presence masks

… and BinaryMatrix::sum_rows

…s/counts_sum/coord

adamant-pwn

Thanks for the pull request! Nice to see the time improvements. Could you please help me figure out where "Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)" happened? I might have missed it somehow

metagraph/src/annotation/binary_matrix/base/binary_matrix.cpp

metagraph/src/common/algorithms.hpp

metagraph/src/annotation/int_matrix/base/int_matrix.cpp

metagraph/src/graph/annotated_dbg.cpp

karasikov · 2026-02-10T14:15:10Z

Thanks for the pull request! Nice to see the time improvements. Could you please help me figure out where "Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)" happened? I might have missed it somehow

Yeah, it was calling a single-threaded get_rows() for all annotations with counds or coords: https://github.com/ratschlab/metagraph/pull/582/changes#diff-ef08b77b860d3d9169dfced0cca32466901b6f4dea22f6c66bc8d05d1ce97638L602-L616

adamant-pwn

See new comments for int_matrix.cpp and algorithms.hpp

adamant-pwn

LGTM

karasikov added 8 commits January 10, 2026 18:10

Don't copy labels. Stable LabelEncoder to keep consistent order of la…

c2d14b8

…bels in results

refactoring: to_vector instead of values_container()+casting

61549d2

use the same top-n sorting routine with std::nth_element + resize + s…

242c8c8

…td::sort

removed duplicated code: do counting in filter

39b0c4f

extended tests

9965094

refactoring

d37037b

refactoring: filter_and_aggregate for unified filtering, attach count…

91cf1e3

…s to presence masks

Merge remote-tracking branch 'origin/master' into mk/anno

db40c6c

karasikov force-pushed the mk/anno branch from a26bc78 to 93b4b41 Compare January 21, 2026 11:03

accumulate_counts for adaptive match counting in filter_and_aggregate…

fd84233

… and BinaryMatrix::sum_rows

karasikov force-pushed the mk/anno branch from 93b4b41 to fd84233 Compare January 21, 2026 11:04

karasikov added 4 commits January 21, 2026 20:58

Merge branch 'master' into mk/anno

410d8ce

construct query annotations with counts only when query-mode is count…

6e12294

…s/counts_sum/coord

added todo

3d201dc

minor

8c8f3bc

karasikov requested a review from adamant-pwn January 22, 2026 09:12

minor

45d6b4d

adamant-pwn reviewed Feb 10, 2026

View reviewed changes

addressed review comments

228199c

karasikov requested a review from adamant-pwn February 10, 2026 15:08

fix snakemake

eb873c4

adamant-pwn reviewed Feb 10, 2026

View reviewed changes

karasikov added 2 commits February 11, 2026 10:20

revert

11e359e

Merge remote-tracking branch 'origin/master' into mk/anno

ed2f56f

karasikov requested a review from adamant-pwn February 11, 2026 09:32

adamant-pwn approved these changes Feb 11, 2026

View reviewed changes

karasikov merged commit cf7b1db into master Feb 11, 2026
53 of 57 checks passed

karasikov deleted the mk/anno branch February 11, 2026 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactoring and optimizations in annotation queries #582

refactoring and optimizations in annotation queries #582

Uh oh!

karasikov commented Jan 10, 2026 •

edited

Loading

Uh oh!

adamant-pwn left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karasikov commented Feb 10, 2026

Uh oh!

adamant-pwn left a comment

Uh oh!

adamant-pwn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactoring and optimizations in annotation queries #582

refactoring and optimizations in annotation queries #582

Uh oh!

Conversation

karasikov commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Examples

Uh oh!

adamant-pwn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karasikov commented Feb 10, 2026

Uh oh!

adamant-pwn left a comment

Choose a reason for hiding this comment

Uh oh!

adamant-pwn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karasikov commented Jan 10, 2026 •

edited

Loading