-
Notifications
You must be signed in to change notification settings - Fork 18
Safe-guard offsets_builder.set with mutex #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I added another commit in reference to what I mentioned in jermp/bits#11 (comment). It fixes processing of |
Yes, that's correct. The point is that, as long as threads process a sufficiently long range (say, spanning at least a cache line, so that there is no contention), then this is perfectly safe.
It really is, and locking each Shall we just default to 1 single thread is the data is tiny? And leave the current code as it was before -- without locking -- for the most general case where each thread works independently on a large range? |
|
Also, of course, thank you very much for investigating this on multiple architectures! |
|
I'm really not a fan of relying on heuristics for this, and would rather try to design the function in a way that actually guarantees the threads have independent word ranges. That said, I'm looking at the code now and it makes me slightly confused, don't you essentially already go over all minimizers in a single thread when doing this? m_buckets.bucket_sizes.encode(iterator, num_buckets + 1,
num_minimizer_positions - num_buckets); |
|
I agree in general but here this is a rather critical loop and sacrificing performance when the issue can be fixed without locking would be a shame :( The code you're looking at is single thread. The critical one comes after, when we use |
Yes, that's what I'm asking about... It seems to be doing the same number of operations as the critical one, as both cumulatively go over all minimizer positions. Which poses the question, which specific part in the multi-threaded loop is so heavy? Writing stuff in the compressed vector? |
|
Yes, setting the positions was very heavy before, and that's why I've made it parallel. To make it parallel, I've resorted the pairs to minimizer id (as assigned by the MPHF). Actually, now that the ids are parallel and we set consecutive positions... it could be already fast on a single thread :) |
|
Sure, I can try. Can you give some quick instructions to benchmark this loop? Ideally, just a few bash commands I can copy-paste 🙂 |
|
Sure, thanks! You just need these two (download + build): wget https://zenodo.org/records/7239205/files/human.k31.unitigs.fa.ust.fa.gz
./sshash build -i human.k31.unitigs.fa.ust.fa.gz -k 31 -m 21 -t 8 -g 16 --verbose -o human.k31.sshash -d tmp_dir > build.logassuming the codebase is compiled for k <= 31 and in release mode. Looking at the latest benchmarks, here https://github.com/jermp/sshash/tree/bench/benchmarks, with 8 threads and under Linux we are around 1m 50s for a human index (assuming the file is compressed with gzip). The line we are interested in in the which currently is |
|
I compared running this on 32 threads with mutex: and without mutex: So, yeah, it really kills performance here. At the same time, just changing Correspondingly, with 8 threads my machine processes it in For comparison, the single-threaded part that is already there runs on the same magnitude: |
|
Oooook, then everything is as I expected. So, in the end, we could just really go single-threaded there :) What do you think? I'll try this myself asap. |
|
Small edit: we can't use |
|
Ok, so... on my end, using a single thread I have compared to using multiple threads :D So I think we can forget about multi-threading here. The speed up entirely comes from avoiding cache-misses in the first place by re-sorting the tuples, as said before. |
|
According to my understanding of the code, using Which is a slight improvement of the previous |
|
Yes, I've also updated the code in the |
|
@adamant-pwn can you review my comments above? Thanks! |
|
Which comments? 🤔 If you used GitHub Review feature, you also need to publish it, otherwise it's visible to you only. |
|
ahah daamn, I thought I did it. Do you see them now? |
|
Ah right, because comments on commits are directly visible, but comments on PRs are not. Bah! |
Yes, for the version in the master branch, it's absolutely fine. (I was referring to the one in the |
Hi @jermp, our favorite topic of "issues that happen on small inputs" again 😄
As I investigated our errors further, I narrowed it down to the following:
arm64runs, but not onx86_64runs. This means the system actually running the binary, not the one building it. Cross-compiling onarm64and then running onx86_64turned out fine.x86_64version ofMacOS, so it's really about architecture, rather than OS.The latter prompted me to also run our tests under thread sanitizers, and they detected race condition in
build_sparse_index. This is becausebits::compact_vectoris on its own not thread-safe and relies on external management of this stuff. I assume operations like&=and|=are more likely to behave as atomic onx86_64, whilearm64is more strict about it.I suppose the actual issue only happens when two adjacent threads update bits on the borders of their assigned ranges, and it may sometimes fall into the same 64-bit word, despite threads being responsible for different parts of the word.
... Or if
num_minimizer_positionsis very small, so everything happens on just 1-2 words or so, which is likely to be the case in unit tests 🥲I added a mutex to safe-guard this situation, and our tests seem to be passing now (well, Workflow is not fully finished yet, so there may be surprises), but this actually appears to be the latest issue we faced on our end.
I'm not sure how time-critical this particular loop in
sshashis, so let me know if you would like to do some other solution here.