Skip to content

Conversation

@karasikov
Copy link
Member

@karasikov karasikov commented Jan 10, 2026

  • Don't reencode labels in query graph (helps for cases with extreme number of labels in the query graph)
  • Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)
  • More efficient top-k selection (nth + sort instead of sort + resize)
  • More efficient hits accumulation (adaptive with dense vector or hash table, depending on the number of labels)

Status of parallelization for different query setups with batch query:

Annotation type \ Query type matches counts coords
basic na na
with counts ❌ -> ✅ ❌ (TODO, another PR) na
with coords ❌ -> ✅ ❌ (TODO, another PR) - (always non-batch)

Examples

1 - 1.8x faster querying tara_assemblies

(Thanks to adaptive count accumulation, not reencoding the labels, and faster top-n selection.)

Logs

Before:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_50k_short.fq --mmap -p 40 -v -i graph_small.indexed.dbg -a annotation.v4.row_diff_sparse.annodbg -v --batch-size 10000000000 --query-mode matches --num-top-labels 50 --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 10:23:50.295] [trace] Metagraph started
[2026-01-22 10:24:37.251] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_50k_short.fq'
[2026-01-22 10:24:45.545] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 10:28:03.930] [trace] [Query graph construction] Batch graph contains 568137958 k-mers and took 198.384456538 sec to construct
[2026-01-22 10:34:33.144] [trace] [Query graph construction] Contig extraction took 389.214268666 sec
[2026-01-22 10:34:33.144] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 10:38:44.811] [trace] [Query graph construction] Contigs mapped to the full graph (found 1721062 / 568137958 k-mers) in 251.666932983 sec
[2026-01-22 10:38:46.510] [trace] [Query graph construction] Building the query graph...
[2026-01-22 10:38:52.060] [trace] [Query graph construction] Query graph contains 1721062 k-mers and took 5.549840243 sec to construct
[2026-01-22 10:38:52.060] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 10:39:00.283] [trace] [Query graph construction] Mapping between graphs constructed in 8.223766384 sec
[2026-01-22 10:39:01.438] [trace] [Query graph construction] Slicing 1721062 rows out of full annotation...
[2026-01-22 10:39:12.345] [trace] [Query graph construction] Query annotation with 1721062 rows, 1835925 labels, and 34915572 set bits constructed in 20.284933866 sec
[2026-01-22 10:50:04.125] [trace] Batch of 1226845631 bp from '/data/random_stud/queries/100_studies_50k_short.fq': Query graph constructed in 866.80252 sec, redundancy: 712.84 bp/kmer, queried in 651.77136 sec. Batch query time: 1518.57388 sec, 807893.3 bp/s

After:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_50k_short.fq --mmap -p 40 -v -i graph_small.indexed.dbg -a annotation.v4.row_diff_sparse.annodbg -v --batch-size 10000000000 --query-mode matches --num-top-labels 50 --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 10:06:10.704] [trace] Metagraph started
[2026-01-22 10:07:14.529] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_50k_short.fq'
[2026-01-22 10:07:22.164] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 10:10:10.783] [trace] [Query graph construction] Batch graph contains 568137958 k-mers and took 168.620122111 sec to construct
[2026-01-22 10:16:13.550] [trace] [Query graph construction] Contig extraction took 362.765982662 sec
[2026-01-22 10:16:13.550] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 10:20:29.169] [trace] [Query graph construction] Contigs mapped to the full graph (found 1721062 / 568137958 k-mers) in 255.618956579 sec
[2026-01-22 10:20:30.815] [trace] [Query graph construction] Building the query graph...
[2026-01-22 10:20:34.726] [trace] [Query graph construction] Query graph contains 1721062 k-mers and took 3.910616207 sec to construct
[2026-01-22 10:20:34.726] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 10:20:41.715] [trace] [Query graph construction] Mapping between graphs constructed in 6.989409407 sec
[2026-01-22 10:20:42.916] [trace] [Query graph construction] Slicing 1721062 rows out of full annotation...
[2026-01-22 10:20:57.171] [trace] [Query graph construction] Query annotation with 1721062 rows, 318205057 labels, and 34915572 set bits constructed in 22.442328573 sec
[2026-01-22 10:21:09.616] [trace] Batch of 1226845631 bp from '/data/random_stud/queries/100_studies_50k_short.fq': Query graph constructed in 815.01424 sec, redundancy: 712.84 bp/kmer, queried in 12.43879 sec. Batch query time: 827.45303 sec, 1482677.1 bp/s

2 - 2.9x faster querying (--query-mode matches) of RefSeq with coords

(Thanks to the parallelization.)

Logs

Before:

srv-metagraph@mex:/scratch/nvme7/refseq_33m$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_100_short.fq --mmap -p 40 -v -i /scratch/nvme7/refseq_33m/graph_k31.indexed.dbg -a /scratch/nvme6/refseq/annotation.relaxed.relabeled.row_diff_brwt_coord.annodbg -v --batch-size 10000000000 --query-mode matches --no-coord-mapping --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 14:03:19.327] [trace] Metagraph started
[2026-01-22 14:04:26.137] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_100_short.fq'
[2026-01-22 14:04:26.172] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 14:04:26.435] [trace] [Query graph construction] Batch graph contains 1674535 k-mers and took 0.263780349 sec to construct
[2026-01-22 14:04:26.932] [trace] [Query graph construction] Contig extraction took 0.49652752 sec
[2026-01-22 14:04:26.932] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 14:06:42.105] [trace] [Query graph construction] Contigs mapped to the full graph (found 625599 / 1674535 k-mers) in 135.166176331 sec
[2026-01-22 14:06:42.113] [trace] [Query graph construction] Building the query graph...
[2026-01-22 14:06:42.186] [trace] [Query graph construction] Query graph contains 625599 k-mers and took 0.07354974 sec to construct
[2026-01-22 14:06:42.186] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 14:06:42.269] [trace] [Query graph construction] Mapping between graphs constructed in 0.082363831 sec
[2026-01-22 14:06:42.271] [trace] [Query graph construction] Slicing 625599 rows out of full annotation...
[2026-01-22 14:22:54.974] [trace] [Query graph construction] Query annotation with 625599 rows, 71758 labels, and 34935291 set bits constructed in 972.783420676 sec
[2026-01-22 14:22:55.084] [trace] Batch of 2454716 bp from '/data/random_stud/queries/100_studies_100_short.fq': Query graph constructed in 1108.80454 sec, redundancy: 3.92 bp/kmer, queried in 0.10824 sec. Batch query time: 1108.91278 sec, 2213.6 bp/s
[2026-01-22 14:22:55.204] [trace] File '/data/random_stud/queries/100_studies_100_short.fq' with 2454716 base pairs was processed in 1109.070 sec, throughput: 2213.3 bp/s

After:

srv-metagraph@mex:/scratch/nvme3/tara_assemblies$ /usr/bin/time -v ./metagraph_DNA query /data/random_stud/queries/100_studies_100_short.fq --mmap -p 40 -v -i /scratch/nvme7/refseq_33m/graph_k31.indexed.dbg -a /scratch/nvme6/refseq/annotation.relaxed.relabeled.row_diff_brwt_coord.annodbg -v --batch-size 10000000000 --query-mode matches --no-coord-mapping --min-kmers-fraction-label 0 > /dev/null
[2026-01-22 14:03:08.187] [trace] Metagraph started
[2026-01-22 14:04:26.115] [trace] Parsing sequences from file '/data/random_stud/queries/100_studies_100_short.fq'
[2026-01-22 14:04:26.169] [trace] [Query graph construction] Building the batch graph...
[2026-01-22 14:04:26.432] [trace] [Query graph construction] Batch graph contains 1674535 k-mers and took 0.263120083 sec to construct
[2026-01-22 14:04:26.934] [trace] [Query graph construction] Contig extraction took 0.502187097 sec
[2026-01-22 14:04:26.934] [trace] [Query graph construction] Mapping k-mers back to full graph...
[2026-01-22 14:06:42.116] [trace] [Query graph construction] Contigs mapped to the full graph (found 625599 / 1674535 k-mers) in 135.167619026 sec
[2026-01-22 14:06:42.132] [trace] [Query graph construction] Building the query graph...
[2026-01-22 14:06:42.253] [trace] [Query graph construction] Query graph contains 625599 k-mers and took 0.120320957 sec to construct
[2026-01-22 14:06:42.253] [trace] [Query graph construction] Mapping the contigs back to the query graph...
[2026-01-22 14:06:42.327] [trace] [Query graph construction] Mapping between graphs constructed in 0.073650761 sec
[2026-01-22 14:06:42.329] [trace] [Query graph construction] Slicing 625599 rows out of full annotation...
[2026-01-22 14:10:53.824] [trace] [Query graph construction] Query annotation with 625599 rows, 85375 labels, and 34935291 set bits constructed in 251.552649344 sec
[2026-01-22 14:10:54.172] [trace] Batch of 2454716 bp from '/data/random_stud/queries/100_studies_100_short.fq': Query graph constructed in 387.66273 sec, redundancy: 3.92 bp/kmer, queried in 0.34089 sec. Batch query time: 388.00362 sec, 6326.5 bp/s
[2026-01-22 14:10:54.187] [trace] File '/data/random_stud/queries/100_studies_100_short.fq' with 2454716 base pairs was processed in 388.072 sec, throughput: 6325.4 bp/s

@karasikov karasikov requested a review from adamant-pwn January 22, 2026 09:12
Copy link
Contributor

@adamant-pwn adamant-pwn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request! Nice to see the time improvements. Could you please help me figure out where "Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)" happened? I might have missed it somehow

@karasikov
Copy link
Member Author

Thanks for the pull request! Nice to see the time improvements. Could you please help me figure out where "Parallel annotation query for annotations with coords and counts in binary mode (previously was single-threaded)" happened? I might have missed it somehow

Yeah, it was calling a single-threaded get_rows() for all annotations with counds or coords: https://github.com/ratschlab/metagraph/pull/582/changes#diff-ef08b77b860d3d9169dfced0cca32466901b6f4dea22f6c66bc8d05d1ce97638L602-L616

Copy link
Contributor

@adamant-pwn adamant-pwn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See new comments for int_matrix.cpp and algorithms.hpp

Copy link
Contributor

@adamant-pwn adamant-pwn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karasikov karasikov merged commit cf7b1db into master Feb 11, 2026
53 of 57 checks passed
@karasikov karasikov deleted the mk/anno branch February 11, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants