Feat: wait kernel #41
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add CUDA Graph Support for Push-Pull Operations with Wait Kernel
Summary
This MR introduces CUDA Graph support for push-pull operations by implementing a wait kernel mechanism that enables synchronization between CPU and GPU threads in a CUDA Graph-compatible manner. The implementation consists of two main commits that progressively add the wait kernel functionality and enhance it with CUDA Graph support.
Changes
Core Features:
write_flag_kernel: Writes a sequence number to a flag with system-level memory fencewait_flag_kernel: Waits until a flag reaches a target sequence numbermap_pinned_tensor: Maps pinned host memory to device memory for zero-copy accesswrite_flag: Host interface for writing flags on GPUwait_flag: Host interface for waiting on flags on GPUutil.htoutil.hppfor consistencytorch::Tensorinstead ofint64_tfor sequence numbers:write_flagandwait_flagto accept tensor-based sequence numbersseq_add_onekernel for incrementing sequence numbers within CUDA Graphpush_pullfunction:need_eventparameter (default:true)Testing
Run the test suite: