Skip to content

Conversation

@msimberg
Copy link
Contributor

@msimberg msimberg commented Dec 22, 2025

This adds a NCCL backend, with some strong constraints compared to the MPI, libfabric, and UCX backends:

  • Cancellation isn't supported
  • Tags aren't supported (they are ignored)
  • Send/recv submission requirements are stronger (communication should mostly be launched within groups)
  • Recursive communication in callbacks maybe doesn't work with NCCL (TODO: check)
  • Multithreading is not allowed with NCCL

If one sticks to these requirements one should be able to use any backend. If one needs any of the above features, NCCL can't be used.

Adds a few extra features to communicators:

  • start_group/end_group: These map to ncclGroupStart/ncclGroupEnd for NCCL, and no-ops for other backends.
  • is_stream_aware: The NCCL backend is the only one that returns true for this. If a backend is_stream_aware it will take into account the optional stream argument that can be passed to send/recv.


template<typename T>
recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag)
recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag, void* stream = nullptr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good API?

This means that for NCCL the default stream is used if nothing is specified (a stream is always required for NCCL). For other backends the stream is ignored.


template<typename T, typename CallBack>
recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These signatures can lead to ambiguous calls: leaving out the callback but supplying a stream can match this overload as well with the stream taking the place of CallBack. Is this ok?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit unfortunate. There's OOMPH_CHECK_CALLBACK* that's used essentially for that in the body of the functions, but that's not SFINAE. Also unsure what's best here.

Comment on lines +257 to +258
// TODO: The sreq.wait was previously called immediately. With NCCL
// groups can't call wait so early (communication hasn't started yet).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the semantic change here: If one attempts to call env.comm.send(...).wait() within the NCCL group it will hang. wait will block forever since the group never starts. Should that just throw an exception instead (we can easily query whether the group has already been ended)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it should throw an exception.

Copy link
Contributor Author

@msimberg msimberg Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll (try to) add that.

bool user_alloc)
{
if (ctxt.get_transport_option("name") == std::string("nccl")) {
// Skip for NCCL. Recursive comms hangs. TODO: Does it have to hang?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this.

@msimberg
Copy link
Contributor Author

msimberg commented Jan 8, 2026

This now seems to work in icon fortran. While I still have some open TODOs I'd be grateful for feedback on this already. The general implementation is pretty much what I want it to be, though I still have some profiling to do with NCCL to check if I'm missing some additional low hanging fruit.

Besides any comments you may have on the implementation itself (in particular I'm grateful if you have comments on me misunderstanding oomph requirements for backends) I guess we may need to discuss some sort of CI for the NCCL backend...

I can't request reviews so pinging @boeschf @biddisco @philip-paul-mueller.

@msimberg msimberg marked this pull request as ready for review January 9, 2026 12:46
Copy link

@philip-paul-mueller philip-paul-mueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments/suggestions, but I am not sure what they are worth; probably not much.


template<typename T, typename CallBack>
recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

send_request send(context_impl::heap_type::pointer const& ptr, std::size_t size, rank_type dst,
oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb,
std::size_t* scheduled)
std::size_t* scheduled, void*)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t* scheduled, void*)
std::size_t* scheduled, void* /*stream*/)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, thanks. I think I prefer [[maybe_unused]] void* stream, but can go with either. Do you have a preference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but I think the rest is [[maybe_unused]].

recv_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src,
oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb,
std::size_t* scheduled)
std::size_t* scheduled, void*)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::size_t* scheduled, void*)
std::size_t* scheduled, void* /*stream*/)

}

nccl_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src,
[[maybe_unused]] tag_type tag, void* stream)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you added the stream to the other backends you essentially added an unnamed void* argument (which is okay), but here you used [[maybe_unused]].
I would not use both but stick to one style, i.e. "unnamed argument" vs. [[maybe_unused]].
But this is super unrelated to anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. Question in #55 (comment).

void* stream)
{
auto req = send(ptr, size, dst, tag, stream);
auto s = m_req_state_factory.make(m_context, this, scheduled, dst, tag, std::move(cb),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is s?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. State? It's the same variable name as in other backends.

Comment on lines +18 to +19
static cuda_event_pool pool{128};
return pool;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static cuda_event_pool pool{128};
return pool;
static cuda_event_pool* pool new cuda_event_pool(128);
return *pool;

See: hhttps://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I disagree with that motivation, or at least the solution. IMO if the events outlive the pool, then the events should be returned earlier, not the pool leaked. But I can be convinced otherwise...

ncclResult_t result;
do {
OOMPH_CHECK_NCCL_RESULT(ncclCommGetAsyncError(m_comm, &result));
} while (result == ncclInProgress);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question for myself, but this can technically go on indefinitely.
So would it be a good idea to include a timeout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NCCL internally has enough timeouts that this should not be a problem, but not completely sure... If there's a timeout, the question is what value it should be and how it's configured.

//
// Same semantics as cuda_event, but the event is retrieved from a static
// cuda_event_pool on construction and returned to the pool on destruction.
struct cached_cuda_event

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like this idea.

{
}

void progress();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I miss something, but where is the implementation of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's this:

oomph/src/request.cpp

Lines 144 to 148 in 2814e2a

void
detail::request_state::progress()
{
m_comm->progress();
}
(backend independent).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you created this new struct so the function must also need a new definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. That's the "think" part. I think that definition is shared among backends and only one set of headers is pulled in with the declaration. I wish I udnerstood it better.

Maybe @boeschf can confirm or refute this theory?

Comment on lines +257 to +258
// TODO: The sreq.wait was previously called immediately. With NCCL
// groups can't call wait so early (communication hasn't started yet).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it should throw an exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants