Skip to content

Conversation

@niehao100
Copy link
Collaborator

Add CUDA Graph Support for Push-Pull Operations with Wait Kernel

Summary

This MR introduces CUDA Graph support for push-pull operations by implementing a wait kernel mechanism that enables synchronization between CPU and GPU threads in a CUDA Graph-compatible manner. The implementation consists of two main commits that progressively add the wait kernel functionality and enhance it with CUDA Graph support.

Changes

Core Features:

  • Implemented CUDA kernels for flag-based synchronization:
    • write_flag_kernel: Writes a sequence number to a flag with system-level memory fence
    • wait_flag_kernel: Waits until a flag reaches a target sequence number
  • Added utility functions:
    • map_pinned_tensor: Maps pinned host memory to device memory for zero-copy access
    • write_flag: Host interface for writing flags on GPU
    • wait_flag: Host interface for waiting on flags on GPU
  • Refactored header files:
    • Renamed util.h to util.hpp for consistency
    • Added conditional compilation for CUDA-dependent code
  • Modified kernel signatures to use torch::Tensor instead of int64_t for sequence numbers:
    • This enables CUDA Graph capture since Python integers cannot be captured in graphs
    • Updated write_flag and wait_flag to accept tensor-based sequence numbers
  • Added seq_add_one kernel for incrementing sequence numbers within CUDA Graph
  • Enhanced push_pull function:
    • Added optional need_event parameter (default: true)
    • Allows disabling event recording when used inside CUDA Graph
    • Enables more efficient graph execution without unnecessary event overhead

Testing

Run the test suite:

ROLE=joint  RNIC=brainpf0  BIN=../fserver/test_kernel_wait bash tests/fserver/run_single_gpu.sh

@niehao100 niehao100 changed the title Feat/wait kernel Feat: wait kernel Nov 13, 2025
@niehao100 niehao100 merged commit 740ac12 into stepfun-ai:main Nov 13, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant