Safe-guard offsets_builder.set with mutex #83

adamant-pwn · 2025-10-06T16:16:49Z

Hi @jermp, our favorite topic of "issues that happen on small inputs" again 😄

As I investigated our errors further, I narrowed it down to the following:

They only get triggered on arm64 runs, but not on x86_64 runs. This means the system actually running the binary, not the one building it. Cross-compiling on arm64 and then running on x86_64 turned out fine.
Tests were also passing on x86_64 version of MacOS, so it's really about architecture, rather than OS.
They only get triggered when running multiple threads.

The latter prompted me to also run our tests under thread sanitizers, and they detected race condition in build_sparse_index. This is because bits::compact_vector is on its own not thread-safe and relies on external management of this stuff. I assume operations like &= and |= are more likely to behave as atomic on x86_64, while arm64 is more strict about it.

I suppose the actual issue only happens when two adjacent threads update bits on the borders of their assigned ranges, and it may sometimes fall into the same 64-bit word, despite threads being responsible for different parts of the word.

... Or if num_minimizer_positions is very small, so everything happens on just 1-2 words or so, which is likely to be the case in unit tests 🥲

I added a mutex to safe-guard this situation, and our tests seem to be passing now (well, Workflow is not fully finished yet, so there may be surprises), but this actually appears to be the latest issue we faced on our end.

I'm not sure how time-critical this particular loop in sshash is, so let me know if you would like to do some other solution here.

adamant-pwn · 2025-10-06T17:11:31Z

I added another commit in reference to what I mentioned in jermp/bits#11 (comment). It fixes processing of kmer_t with more than 64 bits when appending guard bits to the bit_vector.

include/builder/parse_file.hpp

jermp · 2025-10-06T18:34:33Z

I suppose the actual issue only happens when two adjacent threads update bits on the borders of their assigned ranges, and it may sometimes fall into the same 64-bit word, despite threads being responsible for different parts of the word.

Yes, that's correct. The point is that, as long as threads process a sufficiently long range (say, spanning at least a cache line, so that there is no contention), then this is perfectly safe.

I'm not sure how time-critical this particular loop in sshash is, so let me know if you would like to do some other solution here.

It really is, and locking each set operation is going to kill performance.

Shall we just default to 1 single thread is the data is tiny? And leave the current code as it was before -- without locking -- for the most general case where each thread works independently on a large range?

jermp · 2025-10-06T18:35:19Z

Also, of course, thank you very much for investigating this on multiple architectures!

adamant-pwn · 2025-10-07T09:56:36Z

I'm really not a fan of relying on heuristics for this, and would rather try to design the function in a way that actually guarantees the threads have independent word ranges. That said, I'm looking at the code now and it makes me slightly confused, don't you essentially already go over all minimizers in a single thread when doing this?

        m_buckets.bucket_sizes.encode(iterator, num_buckets + 1,
                                      num_minimizer_positions - num_buckets);

jermp · 2025-10-07T10:03:56Z

I agree in general but here this is a rather critical loop and sacrificing performance when the issue can be fixed without locking would be a shame :(

The code you're looking at is single thread. The critical one comes after, when we use set for the offsets.

adamant-pwn · 2025-10-07T10:15:30Z

The code you're looking at is single thread.

Yes, that's what I'm asking about... It seems to be doing the same number of operations as the critical one, as both cumulatively go over all minimizer positions. Which poses the question, which specific part in the multi-threaded loop is so heavy? Writing stuff in the compressed vector?

jermp · 2025-10-07T10:32:14Z

Yes, setting the positions was very heavy before, and that's why I've made it parallel.
"Heavy" because the positions to set were random, as assigned by the MPHF, causing a cache miss per set operation basically.

To make it parallel, I've resorted the pairs to minimizer id (as assigned by the MPHF).

Actually, now that the ids are parallel and we set consecutive positions... it could be already fast on a single thread :)
Is that what you're asking about? Can you make this experiment?
I'll also do it after lunch :)

adamant-pwn · 2025-10-07T10:35:27Z

Sure, I can try. Can you give some quick instructions to benchmark this loop? Ideally, just a few bash commands I can copy-paste 🙂

jermp · 2025-10-07T10:40:49Z

Sure, thanks!

You just need these two (download + build):

wget https://zenodo.org/records/7239205/files/human.k31.unitigs.fa.ust.fa.gz
./sshash build -i human.k31.unitigs.fa.ust.fa.gz -k 31 -m 21 -t 8 -g 16 --verbose -o human.k31.sshash -d tmp_dir > build.log

assuming the codebase is compiled for k <= 31 and in release mode.

Looking at the latest benchmarks, here https://github.com/jermp/sshash/tree/bench/benchmarks, with 8 threads and under Linux we are around 1m 50s for a human index (assuming the file is compressed with gzip).

The line we are interested in in the build.log file is

computing minimizers offsets: xxxxxxx [sec]

which currently is

computing minimizers offsets: 4.9896 [sec]

adamant-pwn · 2025-10-07T11:18:18Z

I compared running this on 32 threads with mutex:

computing minimizers offsets: 32.3923 [sec]

and without mutex:

computing minimizers offsets: 0.321266 [sec]

So, yeah, it really kills performance here. At the same time, just changing num_threads to 1 leads to

computing minimizers offsets: 3.03871 [sec]

Correspondingly, with 8 threads my machine processes it in

computing minimizers offsets: 0.413024 [sec]

For comparison, the single-threaded part that is already there runs on the same magnitude:

encoding bucket sizes: 1.17531 [sec]

jermp · 2025-10-07T12:10:10Z

Oooook, then everything is as I expected. So, in the end, we could just really go single-threaded there :)
And perhaps even use push_back for setting the values, which might even be a little faster.

What do you think? I'll try this myself asap.

jermp · 2025-10-07T12:42:13Z

Small edit: we can't use push_back because we might skip some positions.

jermp · 2025-10-07T12:57:46Z

Ok, so... on my end, using a single thread I have

computing minimizers offsets: 6.40686 [sec]

compared to

computing minimizers offsets: 4.9896 [sec]

using multiple threads :D So I think we can forget about multi-threading here. The speed up entirely comes from avoiding cache-misses in the first place by re-sorting the tuples, as said before.

adamant-pwn · 2025-10-07T16:19:03Z

According to my understanding of the code, using push_back here should be perfectly fine. I implemented a single-threaded version of this whole loop (see the new commit). It passes Metagraph's relevant unit tests, both in regular and canonical mode, so I believe it should be functionally identical to what you had before. Performance-wise, I get

computing minimizers offsets: 2.80294 [sec]

Which is a slight improvement of the previous 3.03 [sec], but now with more straight-forward single-threaded code.

jermp · 2025-10-07T16:36:28Z

Yes, I've also updated the code in the bench branch. Consider that the bench branch will be eventually merged into the master; sorry for this :) Thanks for pushing this into master!

jermp · 2025-10-07T17:14:18Z

@adamant-pwn can you review my comments above? Thanks!

adamant-pwn · 2025-10-07T17:31:57Z

Which comments? 🤔

If you used GitHub Review feature, you also need to publish it, otherwise it's visible to you only.

include/builder/parse_file.hpp

include/builder/build_sparse_index.hpp

include/builder/parse_file.hpp

jermp · 2025-10-07T18:11:46Z

ahah daamn, I thought I did it. Do you see them now?

jermp · 2025-10-07T18:13:48Z

Ah right, because comments on commits are directly visible, but comments on PRs are not. Bah!

include/builder/build_sparse_index.hpp

jermp · 2025-10-07T20:43:12Z

According to my understanding of the code, using push_back here should be perfectly fine.

Yes, for the version in the master branch, it's absolutely fine. (I was referring to the one in the bench branch, sorry.)

adamant-pwn added 2 commits October 6, 2025 18:06

Try safe-guarding offsets_builder.set with mutex

9dc43a4

fix kmer_t processing for uint_kmer_bits > 64

a01f0f8

adamant-pwn commented Oct 6, 2025

View reviewed changes

include/builder/parse_file.hpp Outdated Show resolved Hide resolved

Single-threaded build_sparse_index

6578ddd

jermp self-requested a review October 7, 2025 17:12

jermp reviewed Oct 7, 2025

View reviewed changes

include/builder/parse_file.hpp Outdated Show resolved Hide resolved

include/builder/build_sparse_index.hpp Outdated Show resolved Hide resolved

include/builder/parse_file.hpp Outdated Show resolved Hide resolved

adamant-pwn added 2 commits October 7, 2025 20:40

fixes per review

b47bec1

return newline after pragma once

96cecad

adamant-pwn commented Oct 7, 2025

View reviewed changes

include/builder/build_sparse_index.hpp Show resolved Hide resolved

jermp approved these changes Oct 7, 2025

View reviewed changes

jermp merged commit 4e86487 into jermp:master Oct 7, 2025
9 checks passed

Safe-guard offsets_builder.set with mutex #83

Safe-guard offsets_builder.set with mutex #83

Uh oh!

Conversation

adamant-pwn commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamant-pwn commented Oct 6, 2025

Uh oh!

Uh oh!

jermp commented Oct 6, 2025

Uh oh!

jermp commented Oct 6, 2025

Uh oh!

adamant-pwn commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

adamant-pwn commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamant-pwn commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamant-pwn commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamant-pwn commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

adamant-pwn commented Oct 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

Uh oh!

jermp commented Oct 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adamant-pwn commented Oct 6, 2025 •

edited

Loading

jermp commented Oct 7, 2025 •

edited

Loading

jermp commented Oct 7, 2025 •

edited

Loading

adamant-pwn commented Oct 7, 2025 •

edited

Loading

jermp commented Oct 7, 2025 •

edited

Loading

adamant-pwn commented Oct 7, 2025 •

edited

Loading