Add NCCL backend #55

msimberg · 2025-12-22T14:21:51Z

This adds a NCCL backend, with some strong constraints compared to the MPI, libfabric, and UCX backends:

Cancellation isn't supported
Tags aren't supported (they are ignored)
Send/recv submission requirements are stronger (communication should mostly be launched within groups)
Recursive communication in callbacks maybe doesn't work with NCCL (TODO: check)
Multithreading is not allowed with NCCL

If one sticks to these requirements one should be able to use any backend. If one needs any of the above features, NCCL can't be used.

Adds a few extra features to communicators:

start_group/end_group: These map to ncclGroupStart/ncclGroupEnd for NCCL, and no-ops for other backends.
is_stream_aware: The NCCL backend is the only one that returns true for this. If a backend is_stream_aware it will take into account the optional stream argument that can be passed to send/recv.

Mostly just copy MPI implementation to a new directory, not functional.

…nality to nccl

…f debugging

cmake/FindNCCL.cmake

msimberg · 2025-12-22T14:26:52Z

include/oomph/communicator.hpp


    template<typename T>
-    recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag)
+    recv_request recv(message_buffer<T>& msg, rank_type src, tag_type tag, void* stream = nullptr)


Is this a good API?

This means that for NCCL the default stream is used if nothing is specified (a stream is always required for NCCL). For other backends the stream is ignored.

msimberg · 2025-12-22T14:28:17Z

include/oomph/communicator.hpp


    template<typename T, typename CallBack>
-    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
+    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)


These signatures can lead to ambiguous calls: leaving out the callback but supplying a stream can match this overload as well with the stream taking the place of CallBack. Is this ok?

I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

Yeah, it's a bit unfortunate. There's OOMPH_CHECK_CALLBACK* that's used essentially for that in the body of the functions, but that's not SFINAE. Also unsure what's best here.

src/nccl/communicator.hpp

test/test_locality.cpp

msimberg · 2025-12-22T14:55:56Z

test/test_send_recv.cpp

+        // TODO: The sreq.wait was previously called immediately. With NCCL
+        // groups can't call wait so early (communication hasn't started yet).


Note the semantic change here: If one attempts to call env.comm.send(...).wait() within the NCCL group it will hang. wait will block forever since the group never starts. Should that just throw an exception instead (we can easily query whether the group has already been ended)?

I would say it should throw an exception.

Sounds good, I'll (try to) add that.

test/test_send_recv.cpp

msimberg · 2025-12-22T14:56:17Z

test/test_send_recv.cpp

    bool user_alloc)
 {
+    if (ctxt.get_transport_option("name") == std::string("nccl")) {
+        // Skip for NCCL. Recursive comms hangs. TODO: Does it have to hang?


Check this.

msimberg · 2026-01-08T12:55:54Z

This now seems to work in icon fortran. While I still have some open TODOs I'd be grateful for feedback on this already. The general implementation is pretty much what I want it to be, though I still have some profiling to do with NCCL to check if I'm missing some additional low hanging fruit.

Besides any comments you may have on the implementation itself (in particular I'm grateful if you have comments on me misunderstanding oomph requirements for backends) I guess we may need to discuss some sort of CI for the NCCL backend...

I can't request reviews so pinging @boeschf @biddisco @philip-paul-mueller.

philip-paul-mueller

I have some comments/suggestions, but I am not sure what they are worth; probably not much.

philip-paul-mueller · 2026-01-13T07:41:03Z

include/oomph/communicator.hpp


    template<typename T, typename CallBack>
-    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback)
+    recv_request recv(message_buffer<T>&& msg, rank_type src, tag_type tag, CallBack&& callback, void* stream = nullptr)


I would add some SFINE tests such as std::enable_if_t<std::is_invocable_v<CallBack>> but I am not sure if this is a good idea.

philip-paul-mueller · 2026-01-13T07:53:12Z

src/libfabric/communicator.hpp

    send_request send(context_impl::heap_type::pointer const& ptr, std::size_t size, rank_type dst,
        oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb,
-        std::size_t* scheduled)
+        std::size_t* scheduled, void*)


Suggested change

std::size_t* scheduled, void*)

std::size_t* scheduled, void* /*stream*/)

Well spotted, thanks. I think I prefer [[maybe_unused]] void* stream, but can go with either. Do you have a preference?

Not really, but I think the rest is [[maybe_unused]].

philip-paul-mueller · 2026-01-13T07:53:40Z

src/libfabric/communicator.hpp

    recv_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src,
        oomph::tag_type tag, util::unique_function<void(rank_type, oomph::tag_type)>&& cb,
-        std::size_t* scheduled)
+        std::size_t* scheduled, void*)


Suggested change

std::size_t* scheduled, void*)

std::size_t* scheduled, void* /*stream*/)

philip-paul-mueller · 2026-01-14T09:13:58Z

src/nccl/communicator.hpp

+    }
+
+    nccl_request recv(context_impl::heap_type::pointer& ptr, std::size_t size, rank_type src,
+        [[maybe_unused]] tag_type tag, void* stream)


When you added the stream to the other backends you essentially added an unnamed void* argument (which is okay), but here you used [[maybe_unused]].
I would not use both but stick to one style, i.e. "unnamed argument" vs. [[maybe_unused]].
But this is super unrelated to anything.

It's a good point. Question in #55 (comment).

philip-paul-mueller · 2026-01-14T09:15:24Z

src/nccl/communicator.hpp

+        void* stream)
+    {
+        auto req = send(ptr, size, dst, tag, stream);
+        auto s = m_req_state_factory.make(m_context, this, scheduled, dst, tag, std::move(cb),


What is s?

Good question. State? It's the same variable name as in other backends.

philip-paul-mueller · 2026-01-14T09:43:14Z

src/nccl/cuda_event_pool.cpp

+    static cuda_event_pool pool{128};
+    return pool;


Suggested change

static cuda_event_pool pool{128};

return pool;

static cuda_event_pool* pool new cuda_event_pool(128);

return *pool;

See: hhttps://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2

I have to say I disagree with that motivation, or at least the solution. IMO if the events outlive the pool, then the events should be returned earlier, not the pool leaked. But I can be convinced otherwise...

philip-paul-mueller · 2026-01-14T09:49:15Z

src/nccl/nccl_communicator.hpp

+        ncclResult_t result;
+        do {
+            OOMPH_CHECK_NCCL_RESULT(ncclCommGetAsyncError(m_comm, &result));
+        } while (result == ncclInProgress);


This is more of a question for myself, but this can technically go on indefinitely.
So would it be a good idea to include a timeout?

I think NCCL internally has enough timeouts that this should not be a problem, but not completely sure... If there's a timeout, the question is what value it should be and how it's configured.

philip-paul-mueller · 2026-01-14T09:53:29Z

src/nccl/cached_cuda_event.hpp

+//
+// Same semantics as cuda_event, but the event is retrieved from a static
+// cuda_event_pool on construction and returned to the pool on destruction.
+struct cached_cuda_event


I kind of like this idea.

philip-paul-mueller · 2026-01-14T10:06:40Z

src/nccl/request_state.hpp

+    {
+    }
+
+    void progress();


Maybe I miss something, but where is the implementation of this function?

I think it's this:

oomph/src/request.cpp

Lines 144 to 148 in 2814e2a

void

detail::request_state::progress()

{

m_comm->progress();

}

(backend independent).

But you created this new struct so the function must also need a new definition.

Right. That's the "think" part. I think that definition is shared among backends and only one set of headers is pulled in with the declaration. I wish I udnerstood it better.

Maybe @boeschf can confirm or refute this theory?

philip-paul-mueller · 2026-01-14T10:17:01Z

test/test_send_recv.cpp

+        // TODO: The sreq.wait was previously called immediately. With NCCL
+        // groups can't call wait so early (communication hasn't started yet).


I would say it should throw an exception.

msimberg and others added 15 commits November 3, 2025 16:01

Add first dummy version of NCCL backend

1ff1218

Mostly just copy MPI implementation to a new directory, not functional.

Clean up some unnecessary nccl files and try to port more mpi functio…

76a8d17

…nality to nccl

Add todos

f185ce6

Update nccl support

d4909b3

Slightly more working nccl backend with events as requests and lots o…

2349474

…f debugging

Enable one more nccl test

d3a4b04

Remove TODOs

4deaf82

Add is_stream_aware, start_group, end_group to all backends

69a46aa

Clean up nccl event/request handling

56a0159

Remove debugging print

c8e91c1

cleap

90933b3

Clean up and disable some tests with NCCL

f8c3258

Remove TODO

5908b18

Add missing cuda_event.hpp file

e7f6fbb

Update hwmalloc

8eb0cec