-
Notifications
You must be signed in to change notification settings - Fork 95
Issue/791 增加add_rms_norm融合算子 #842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
19987d2 to
7712471
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a fused add_rms_norm operator that combines element-wise addition with RMS normalization in a single operation. This fusion improves performance by reducing memory bandwidth requirements and kernel launch overhead, particularly beneficial for transformer architectures where residual connections and normalization are common patterns.
Key changes:
- Implements the fused operator for both CPU and NVIDIA GPU backends with support for FP16, BF16, and FP32 data types
- Adds comprehensive test coverage for both infiniop and infinicore layers with various tensor shapes and strides
- Provides Python bindings and high-level API in infinicore with both out-of-place and in-place variants
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| test/infiniop/libinfiniop/op_register.py | Registers C API bindings for the new add_rms_norm operator |
| test/infiniop/add_rms_norm.py | Comprehensive test suite for infiniop implementation with multiple tensor shapes, strides, and data types |
| test/infinicore/ops/add_rms_norm.py | Framework-integrated test suite for infinicore with out-of-place and in-place operation tests |
| src/infiniop/ops/add_rms_norm/operator.cc | Device dispatch logic for operator creation, execution, and destruction |
| src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cuh | NVIDIA GPU implementation header |
| src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cu | NVIDIA GPU kernel launcher supporting multiple data type combinations |
| src/infiniop/ops/add_rms_norm/info.h | Validation and metadata management for operator parameters |
| src/infiniop/ops/add_rms_norm/cuda/kernel.cuh | CUDA kernel implementation computing fused add and RMS normalization |
| src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.h | CPU implementation header |
| src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.cc | CPU implementation with OpenMP parallelization |
| src/infiniop/ops/add_rms_norm/add_rms_norm.h | Descriptor template macro definition |
| src/infinicore/pybind11/ops/add_rms_norm.hpp | Python binding definitions for both functional and in-place APIs |
| src/infinicore/pybind11/ops.hpp | Registers add_rms_norm bindings in the ops module |
| src/infinicore/ops/add_rms_norm/add_rms_norm_infiniop.cc | Infinicore implementation using infiniop backend with caching |
| src/infinicore/ops/add_rms_norm/add_rms_norm.cc | Infinicore dispatcher and high-level API implementation |
| python/infinicore/ops/add_rms_norm.py | Python wrapper providing functional and in-place interfaces |
| python/infinicore/init.py | Exports add_rms_norm and add_rms_norm_ to public API |
| include/infiniop/ops/add_rms_norm.h | Public C API header defining operator functions |
| include/infiniop.h | Includes add_rms_norm header in main public API |
| include/infinicore/ops/add_rms_norm.hpp | Public C++ API header for infinicore |
| include/infinicore/ops.hpp | Includes add_rms_norm in ops header |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| INFINIOP_CUDA_KERNEL add_rmsnormKernel( | ||
| Tdata *__restrict__ y, | ||
| Tdata *__restrict__ residual_out, | ||
| ptrdiff_t stride_y_batch, | ||
| ptrdiff_t stride_y_nhead, | ||
| ptrdiff_t stride_residual_out_batch, | ||
| ptrdiff_t stride_residual_out_nhead, | ||
| const Tdata *__restrict__ a, | ||
| ptrdiff_t stride_a_batch, | ||
| ptrdiff_t stride_a_nhead, | ||
| const Tdata *__restrict__ b, | ||
| ptrdiff_t stride_b_batch, | ||
| ptrdiff_t stride_b_nhead, | ||
| const Tweight *__restrict__ w, | ||
| size_t nhead, | ||
| size_t dim, | ||
| float epsilon) { | ||
| add_rmsnormBlock<BLOCK_SIZE, Tcompute>( |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel function is named add_rmsnormKernel but the device function is named add_rmsnormBlock. The naming convention is inconsistent - one uses 'Kernel' suffix and the other uses 'Block' suffix. Consider standardizing to either add_rms_norm_kernel and add_rms_norm_block (snake_case) or addRmsNormKernel and addRmsNormBlock (camelCase) for consistency with the codebase style.
|
|
||
|
|
||
| def add_rms_norm_(y, residual_out, a, b, weight, epsilon=1e-5): | ||
| """In-place Fused Add and RMS Normalization.""" |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for the in-place function add_rms_norm_ lacks parameter documentation. While the out-of-place version has detailed documentation for all parameters and return values, the in-place version only has a one-line description. Add parameter documentation (y, residual_out, a, b, weight, epsilon) for consistency and to help users understand the API.
| """In-place Fused Add and RMS Normalization.""" | |
| """ | |
| In-place Fused Add and RMS Normalization. | |
| Args: | |
| y: Output tensor for the normalized result (RMSNorm(a + b) * weight). | |
| residual_out: Output tensor for the residual result (a + b). | |
| a: First input tensor. | |
| b: Second input tensor. | |
| weight: Scale weights applied after RMS normalization. | |
| epsilon: Small constant for numerical stability, default is 1e-5. | |
| Returns: | |
| None. The results are written in-place into ``y`` and ``residual_out``. | |
| """ |
| b = TestTensor(b_shape, b_stride, dtype, device, scale=0.01) | ||
| w = TestTensor(w_shape, None, w_dtype, device) | ||
|
|
||
| eps = 1e-6 |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The epsilon value differs between infiniop tests (1e-6) and infinicore tests (1e-5). This inconsistency could lead to different numerical behavior and test results between the two implementations. Consider using the same epsilon value across all test files for consistency, or document why different values are needed.
| eps = 1e-6 | |
| eps = get_tolerance(dtype) |
| epsilon: Small constant for numerical stability, default is 1e-5 | ||
|
|
||
| Returns: | ||
| Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature in the Python binding documentation states the return type as "Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)", but the actual computation includes the weight multiplication in the normalized result. The documentation should clarify that the normalized result is (a + b) * weight / sqrt(mean((a+b)^2) + epsilon), not just RMSNorm(a + b) * weight, to avoid ambiguity about whether weight is inside or outside the normalization.
| Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b) | |
| Tuple of (normalized_result, add_result): | |
| normalized_result = (a + b) * weight / sqrt(mean((a + b)^2) + epsilon) | |
| add_result = a + b |
| b.data(), | ||
| w.data(), | ||
| residual_out.data(), | ||
| None, |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last parameter to infiniopAddRMSNorm is being passed as None (Python's None), but the function signature expects a void *stream parameter. While this may work if the underlying C code accepts NULL, it would be clearer to explicitly pass ctypes.c_void_p(0) or handle the stream parameter properly. This is inconsistent with how other operators might handle stream parameters.
| None, | |
| ctypes.c_void_p(0), |
| # Check if tensors support in-place operations | ||
| a_supports_inplace = not is_broadcast(a_strides) | ||
| b_supports_inplace = not is_broadcast(b_strides) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable a_supports_inplace is not used.
| # Check if tensors support in-place operations | |
| a_supports_inplace = not is_broadcast(a_strides) | |
| b_supports_inplace = not is_broadcast(b_strides) | |
| # Check if output tensor supports in-place operations |
| a_supports_inplace = not is_broadcast(a_strides) | ||
| b_supports_inplace = not is_broadcast(b_strides) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable b_supports_inplace is not used.
| a_supports_inplace = not is_broadcast(a_strides) | |
| b_supports_inplace = not is_broadcast(b_strides) |
| @@ -0,0 +1,185 @@ | |||
| import torch | |||
| import ctypes | |||
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module 'ctypes' is imported with both 'import' and 'import from'.
| import ctypes |
infiniop算子测试




cpu
gpu
infinicore测试
cpu
gpu