Adding CUDA graph capture feature for supervised learning. #98

romerojosh · 2025-12-03T23:47:59Z

This PR adds CUDA graphs support for supervised learning problems. The feature is enabled via a new general configuration entry: enable_cuda_graphs, see updated documentation.

Since we are targeting high-performance use cases, this functionality is made to be fairly minimal in terms of features. In particular, we do not maintain internal static entry points to the captured graphs, allow graph recapture for dynamic shapes, etc. Instead, we expect users to provide consistent input data (memory locations, shapes) to be compatible with the CUDA graphs operating model.

Marking this as a draft for now as I still need to implement some tests.

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2025-12-03T23:52:56Z

/build_and_test

github-actions · 2025-12-03T23:53:04Z

🚀 Build workflow triggered! View run

github-actions · 2025-12-04T00:05:08Z

✅ Build workflow passed! View run

Signed-off-by: Josh Romero <joshr@nvidia.com>

azrael417

looks good, thanks a lot. much cleaner but I have a few comments still.

azrael417 · 2025-12-05T10:21:54Z

src/csrc/include/internal/cuda_graphs.h

+
+private:
+  // Input signature for validating consistent inputs
+  struct InputSignature {


This structure is the same for all the states, wouldn't it be better to move it outside?

azrael417 · 2025-12-05T10:27:38Z

src/csrc/include/internal/cuda_graphs.h

+namespace torchfort {
+
+// Action to take for current iteration
+enum class GraphAction {


do we need this outside the ENABLE_GPU context?

azrael417 · 2025-12-05T10:29:50Z

src/csrc/include/internal/cuda_graphs.h

+  void launch(cudaStream_t stream) { CHECK_CUDA(cudaGraphLaunch(graph_exec_.get(), stream)); }
+
+  // Get static loss (valid after CAPTURE or REPLAY)
+  const torch::Tensor& get_loss() const { return static_loss_; }


shall we add asserts to return error when not fully initialized?

azrael417 · 2025-12-05T10:40:14Z

src/csrc/training.cpp

  }

+  // Extract loss value
+  *loss_val = loss.item<float>();


does this work? .item copies the loss back to the CPU, and then all reduce needs a tensor, right? We can just clone the loss tensor, call all reduce on it and then extract the scalar with .item?

romerojosh added 4 commits December 3, 2025 15:48

Adding support for CUDA graph capture.

5a7a566

Signed-off-by: Josh Romero <joshr@nvidia.com>

Simplifying and cleaning up implementation.

ebcdaa5

Signed-off-by: Josh Romero <joshr@nvidia.com>

Adding graph support for grad accumulation. Cleaning up some ifdefs.

e5bb25b

Signed-off-by: Josh Romero <joshr@nvidia.com>

Update docs.

76b058e

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh force-pushed the cuda_graphs_2 branch from 434417a to 76b058e Compare December 3, 2025 23:48

Formatting.

02289e7

Signed-off-by: Josh Romero <joshr@nvidia.com>

Move loss D2H copy and allreduce after optimizer step call.

9e890ec

Signed-off-by: Josh Romero <joshr@nvidia.com>

azrael417 reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding CUDA graph capture feature for supervised learning. #98

Adding CUDA graph capture feature for supervised learning. #98

Uh oh!

romerojosh commented Dec 3, 2025

Uh oh!

romerojosh commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

azrael417 left a comment

Uh oh!

azrael417 Dec 5, 2025

Uh oh!

azrael417 Dec 5, 2025

Uh oh!

azrael417 Dec 5, 2025

Uh oh!

azrael417 Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Adding CUDA graph capture feature for supervised learning. #98

Are you sure you want to change the base?

Adding CUDA graph capture feature for supervised learning. #98

Uh oh!

Conversation

romerojosh commented Dec 3, 2025

Uh oh!

romerojosh commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

azrael417 left a comment

Choose a reason for hiding this comment

Uh oh!

azrael417 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

azrael417 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

azrael417 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

azrael417 Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants