-
Notifications
You must be signed in to change notification settings - Fork 8
Add NCCL backend #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add NCCL backend #55
Conversation
Mostly just copy MPI implementation to a new directory, not functional.
|
|
||
| template<typename T> | ||
| recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag) | ||
| recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag, void* stream = nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good API?
This means that for NCCL the default stream is used if nothing is specified (a stream is always required for NCCL). For other backends the stream is ignored.
|
|
||
| template<typename T, typename CallBack> | ||
| recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback) | ||
| recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These signatures can lead to ambiguous calls: leaving out the callback but supplying a stream can match this overload as well with the stream taking the place of CallBack. Is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit unfortunate. There's OOMPH_CHECK_CALLBACK* that's used essentially for that in the body of the functions, but that's not SFINAE. Also unsure what's best here.
e81de62 to
8a854e4
Compare
| // TODO: The sreq.wait was previously called immediately. With NCCL | ||
| // groups can't call wait so early (communication hasn't started yet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the semantic change here: If one attempts to call env.comm.send(...).wait() within the NCCL group it will hang. wait will block forever since the group never starts. Should that just throw an exception instead (we can easily query whether the group has already been ended)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it should throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll (try to) add that.
| bool user_alloc) | ||
| { | ||
| if (ctxt.get_transport_option("name") == std::string("nccl")) { | ||
| // Skip for NCCL. Recursive comms hangs. TODO: Does it have to hang? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this.
|
This now seems to work in icon fortran. While I still have some open TODOs I'd be grateful for feedback on this already. The general implementation is pretty much what I want it to be, though I still have some profiling to do with NCCL to check if I'm missing some additional low hanging fruit. Besides any comments you may have on the implementation itself (in particular I'm grateful if you have comments on me misunderstanding oomph requirements for backends) I guess we may need to discuss some sort of CI for the NCCL backend... I can't request reviews so pinging @boeschf @biddisco @philip-paul-mueller. |
philip-paul-mueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments/suggestions, but I am not sure what they are worth; probably not much.
|
|
||
| template<typename T, typename CallBack> | ||
| recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback) | ||
| recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.
| send_request send(context_impl::heap_type::pointer const& ptr, std::size_t size, rank_type dst, | ||
| oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb, | ||
| std::size_t* scheduled) | ||
| std::size_t* scheduled, void*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::size_t* scheduled, void*) | |
| std::size_t* scheduled, void* /*stream*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted, thanks. I think I prefer [[maybe_unused]] void* stream, but can go with either. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but I think the rest is [[maybe_unused]].
| recv_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src, | ||
| oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb, | ||
| std::size_t* scheduled) | ||
| std::size_t* scheduled, void*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::size_t* scheduled, void*) | |
| std::size_t* scheduled, void* /*stream*/) |
| } | ||
|
|
||
| nccl_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src, | ||
| [[maybe_unused]] tag_type tag, void* stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you added the stream to the other backends you essentially added an unnamed void* argument (which is okay), but here you used [[maybe_unused]].
I would not use both but stick to one style, i.e. "unnamed argument" vs. [[maybe_unused]].
But this is super unrelated to anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. Question in #55 (comment).
| void* stream) | ||
| { | ||
| auto req = send(ptr, size, dst, tag, stream); | ||
| auto s = m_req_state_factory.make(m_context, this, scheduled, dst, tag, std::move(cb), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. State? It's the same variable name as in other backends.
| static cuda_event_pool pool{128}; | ||
| return pool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static cuda_event_pool pool{128}; | |
| return pool; | |
| static cuda_event_pool* pool new cuda_event_pool(128); | |
| return *pool; |
See: hhttps://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I disagree with that motivation, or at least the solution. IMO if the events outlive the pool, then the events should be returned earlier, not the pool leaked. But I can be convinced otherwise...
| ncclResult_t result; | ||
| do { | ||
| OOMPH_CHECK_NCCL_RESULT(ncclCommGetAsyncError(m_comm, &result)); | ||
| } while (result == ncclInProgress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more of a question for myself, but this can technically go on indefinitely.
So would it be a good idea to include a timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think NCCL internally has enough timeouts that this should not be a problem, but not completely sure... If there's a timeout, the question is what value it should be and how it's configured.
| // | ||
| // Same semantics as cuda_event, but the event is retrieved from a static | ||
| // cuda_event_pool on construction and returned to the pool on destruction. | ||
| struct cached_cuda_event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like this idea.
| { | ||
| } | ||
|
|
||
| void progress(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I miss something, but where is the implementation of this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's this:
Lines 144 to 148 in 2814e2a
| void | |
| detail::request_state::progress() | |
| { | |
| m_comm->progress(); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you created this new struct so the function must also need a new definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. That's the "think" part. I think that definition is shared among backends and only one set of headers is pulled in with the declaration. I wish I udnerstood it better.
Maybe @boeschf can confirm or refute this theory?
| // TODO: The sreq.wait was previously called immediately. With NCCL | ||
| // groups can't call wait so early (communication hasn't started yet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it should throw an exception.
This adds a NCCL backend, with some strong constraints compared to the MPI, libfabric, and UCX backends:
If one sticks to these requirements one should be able to use any backend. If one needs any of the above features, NCCL can't be used.
Adds a few extra features to communicators:
start_group/end_group: These map toncclGroupStart/ncclGroupEndfor NCCL, and no-ops for other backends.is_stream_aware: The NCCL backend is the only one that returnstruefor this. If a backendis_stream_awareit will take into account the optionalstreamargument that can be passed tosend/recv.