Skip to content

Conversation

@adamant-pwn
Copy link
Contributor

@adamant-pwn adamant-pwn commented Oct 6, 2025

Hi @jermp, our favorite topic of "issues that happen on small inputs" again 😄

As I investigated our errors further, I narrowed it down to the following:

  • They only get triggered on arm64 runs, but not on x86_64 runs. This means the system actually running the binary, not the one building it. Cross-compiling on arm64 and then running on x86_64 turned out fine.
  • Tests were also passing on x86_64 version of MacOS, so it's really about architecture, rather than OS.
  • They only get triggered when running multiple threads.

The latter prompted me to also run our tests under thread sanitizers, and they detected race condition in build_sparse_index. This is because bits::compact_vector is on its own not thread-safe and relies on external management of this stuff. I assume operations like &= and |= are more likely to behave as atomic on x86_64, while arm64 is more strict about it.

I suppose the actual issue only happens when two adjacent threads update bits on the borders of their assigned ranges, and it may sometimes fall into the same 64-bit word, despite threads being responsible for different parts of the word.

... Or if num_minimizer_positions is very small, so everything happens on just 1-2 words or so, which is likely to be the case in unit tests 🥲

I added a mutex to safe-guard this situation, and our tests seem to be passing now (well, Workflow is not fully finished yet, so there may be surprises), but this actually appears to be the latest issue we faced on our end.

I'm not sure how time-critical this particular loop in sshash is, so let me know if you would like to do some other solution here.

@adamant-pwn
Copy link
Contributor Author

I added another commit in reference to what I mentioned in jermp/bits#11 (comment). It fixes processing of kmer_t with more than 64 bits when appending guard bits to the bit_vector.

@jermp
Copy link
Owner

jermp commented Oct 6, 2025

I suppose the actual issue only happens when two adjacent threads update bits on the borders of their assigned ranges, and it may sometimes fall into the same 64-bit word, despite threads being responsible for different parts of the word.

Yes, that's correct. The point is that, as long as threads process a sufficiently long range (say, spanning at least a cache line, so that there is no contention), then this is perfectly safe.

I'm not sure how time-critical this particular loop in sshash is, so let me know if you would like to do some other solution here.

It really is, and locking each set operation is going to kill performance.

Shall we just default to 1 single thread is the data is tiny? And leave the current code as it was before -- without locking -- for the most general case where each thread works independently on a large range?

@jermp
Copy link
Owner

jermp commented Oct 6, 2025

Also, of course, thank you very much for investigating this on multiple architectures!

@adamant-pwn
Copy link
Contributor Author

I'm really not a fan of relying on heuristics for this, and would rather try to design the function in a way that actually guarantees the threads have independent word ranges. That said, I'm looking at the code now and it makes me slightly confused, don't you essentially already go over all minimizers in a single thread when doing this?

        m_buckets.bucket_sizes.encode(iterator, num_buckets + 1,
                                      num_minimizer_positions - num_buckets);

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

I agree in general but here this is a rather critical loop and sacrificing performance when the issue can be fixed without locking would be a shame :(

The code you're looking at is single thread. The critical one comes after, when we use set for the offsets.

@adamant-pwn
Copy link
Contributor Author

The code you're looking at is single thread.

Yes, that's what I'm asking about... It seems to be doing the same number of operations as the critical one, as both cumulatively go over all minimizer positions. Which poses the question, which specific part in the multi-threaded loop is so heavy? Writing stuff in the compressed vector?

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Yes, setting the positions was very heavy before, and that's why I've made it parallel.
"Heavy" because the positions to set were random, as assigned by the MPHF, causing a cache miss per set operation basically.

To make it parallel, I've resorted the pairs to minimizer id (as assigned by the MPHF).

Actually, now that the ids are parallel and we set consecutive positions... it could be already fast on a single thread :)
Is that what you're asking about? Can you make this experiment?
I'll also do it after lunch :)

@adamant-pwn
Copy link
Contributor Author

Sure, I can try. Can you give some quick instructions to benchmark this loop? Ideally, just a few bash commands I can copy-paste 🙂

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Sure, thanks!

You just need these two (download + build):

wget https://zenodo.org/records/7239205/files/human.k31.unitigs.fa.ust.fa.gz
./sshash build -i human.k31.unitigs.fa.ust.fa.gz -k 31 -m 21 -t 8 -g 16 --verbose -o human.k31.sshash -d tmp_dir > build.log

assuming the codebase is compiled for k <= 31 and in release mode.

Looking at the latest benchmarks, here https://github.com/jermp/sshash/tree/bench/benchmarks, with 8 threads and under Linux we are around 1m 50s for a human index (assuming the file is compressed with gzip).

The line we are interested in in the build.log file is

computing minimizers offsets: xxxxxxx [sec]

which currently is

computing minimizers offsets: 4.9896 [sec]

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Oct 7, 2025

I compared running this on 32 threads with mutex:

computing minimizers offsets: 32.3923 [sec]

and without mutex:

computing minimizers offsets: 0.321266 [sec]

So, yeah, it really kills performance here. At the same time, just changing num_threads to 1 leads to

computing minimizers offsets: 3.03871 [sec]

Correspondingly, with 8 threads my machine processes it in

computing minimizers offsets: 0.413024 [sec]

For comparison, the single-threaded part that is already there runs on the same magnitude:

encoding bucket sizes: 1.17531 [sec]

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Oooook, then everything is as I expected. So, in the end, we could just really go single-threaded there :)
And perhaps even use push_back for setting the values, which might even be a little faster.

What do you think? I'll try this myself asap.

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Small edit: we can't use push_back because we might skip some positions.

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Ok, so... on my end, using a single thread I have

computing minimizers offsets: 6.40686 [sec] 

compared to

computing minimizers offsets: 4.9896 [sec]

using multiple threads :D So I think we can forget about multi-threading here. The speed up entirely comes from avoiding cache-misses in the first place by re-sorting the tuples, as said before.

@adamant-pwn
Copy link
Contributor Author

adamant-pwn commented Oct 7, 2025

According to my understanding of the code, using push_back here should be perfectly fine. I implemented a single-threaded version of this whole loop (see the new commit). It passes Metagraph's relevant unit tests, both in regular and canonical mode, so I believe it should be functionally identical to what you had before. Performance-wise, I get

computing minimizers offsets: 2.80294 [sec]

Which is a slight improvement of the previous 3.03 [sec], but now with more straight-forward single-threaded code.

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Yes, I've also updated the code in the bench branch. Consider that the bench branch will be eventually merged into the master; sorry for this :) Thanks for pushing this into master!

@jermp jermp self-requested a review October 7, 2025 17:12
@jermp
Copy link
Owner

jermp commented Oct 7, 2025

@adamant-pwn can you review my comments above? Thanks!

@adamant-pwn
Copy link
Contributor Author

Which comments? 🤔

If you used GitHub Review feature, you also need to publish it, otherwise it's visible to you only.

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

ahah daamn, I thought I did it. Do you see them now?

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

Ah right, because comments on commits are directly visible, but comments on PRs are not. Bah!

@jermp
Copy link
Owner

jermp commented Oct 7, 2025

According to my understanding of the code, using push_back here should be perfectly fine.

Yes, for the version in the master branch, it's absolutely fine. (I was referring to the one in the bench branch, sorry.)

@jermp jermp merged commit 4e86487 into jermp:master Oct 7, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants