Issue/791 增加add_rms_norm融合算子 #842

gongchensu · 2025-12-25T02:43:23Z

infiniop算子测试
cpu

gpu

infinicore测试
cpu

gpu

…air.

…required.

Copilot

Pull request overview

This PR adds a fused add_rms_norm operator that combines element-wise addition with RMS normalization in a single operation. This fusion improves performance by reducing memory bandwidth requirements and kernel launch overhead, particularly beneficial for transformer architectures where residual connections and normalization are common patterns.

Key changes:

Implements the fused operator for both CPU and NVIDIA GPU backends with support for FP16, BF16, and FP32 data types
Adds comprehensive test coverage for both infiniop and infinicore layers with various tensor shapes and strides
Provides Python bindings and high-level API in infinicore with both out-of-place and in-place variants

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
test/infiniop/libinfiniop/op_register.py	Registers C API bindings for the new add_rms_norm operator
test/infiniop/add_rms_norm.py	Comprehensive test suite for infiniop implementation with multiple tensor shapes, strides, and data types
test/infinicore/ops/add_rms_norm.py	Framework-integrated test suite for infinicore with out-of-place and in-place operation tests
src/infiniop/ops/add_rms_norm/operator.cc	Device dispatch logic for operator creation, execution, and destruction
src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cuh	NVIDIA GPU implementation header
src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cu	NVIDIA GPU kernel launcher supporting multiple data type combinations
src/infiniop/ops/add_rms_norm/info.h	Validation and metadata management for operator parameters
src/infiniop/ops/add_rms_norm/cuda/kernel.cuh	CUDA kernel implementation computing fused add and RMS normalization
src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.h	CPU implementation header
src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.cc	CPU implementation with OpenMP parallelization
src/infiniop/ops/add_rms_norm/add_rms_norm.h	Descriptor template macro definition
src/infinicore/pybind11/ops/add_rms_norm.hpp	Python binding definitions for both functional and in-place APIs
src/infinicore/pybind11/ops.hpp	Registers add_rms_norm bindings in the ops module
src/infinicore/ops/add_rms_norm/add_rms_norm_infiniop.cc	Infinicore implementation using infiniop backend with caching
src/infinicore/ops/add_rms_norm/add_rms_norm.cc	Infinicore dispatcher and high-level API implementation
python/infinicore/ops/add_rms_norm.py	Python wrapper providing functional and in-place interfaces
python/infinicore/init.py	Exports add_rms_norm and add_rms_norm_ to public API
include/infiniop/ops/add_rms_norm.h	Public C API header defining operator functions
include/infiniop.h	Includes add_rms_norm header in main public API
include/infinicore/ops/add_rms_norm.hpp	Public C++ API header for infinicore
include/infinicore/ops.hpp	Includes add_rms_norm in ops header

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/infinicore/ops/add_rms_norm.hpp

Copilot · 2026-01-07T01:02:14Z

src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cu

+INFINIOP_CUDA_KERNEL add_rmsnormKernel(
+    Tdata *__restrict__ y,
+    Tdata *__restrict__ residual_out,
+    ptrdiff_t stride_y_batch,
+    ptrdiff_t stride_y_nhead,
+    ptrdiff_t stride_residual_out_batch,
+    ptrdiff_t stride_residual_out_nhead,
+    const Tdata *__restrict__ a,
+    ptrdiff_t stride_a_batch,
+    ptrdiff_t stride_a_nhead,
+    const Tdata *__restrict__ b,
+    ptrdiff_t stride_b_batch,
+    ptrdiff_t stride_b_nhead,
+    const Tweight *__restrict__ w,
+    size_t nhead,
+    size_t dim,
+    float epsilon) {
+    add_rmsnormBlock<BLOCK_SIZE, Tcompute>(


The kernel function is named add_rmsnormKernel but the device function is named add_rmsnormBlock. The naming convention is inconsistent - one uses 'Kernel' suffix and the other uses 'Block' suffix. Consider standardizing to either add_rms_norm_kernel and add_rms_norm_block (snake_case) or addRmsNormKernel and addRmsNormBlock (camelCase) for consistency with the codebase style.

Copilot · 2026-01-07T01:02:14Z

python/infinicore/ops/add_rms_norm.py

+
+
+def add_rms_norm_(y, residual_out, a, b, weight, epsilon=1e-5):
+    """In-place Fused Add and RMS Normalization."""


The docstring for the in-place function add_rms_norm_ lacks parameter documentation. While the out-of-place version has detailed documentation for all parameters and return values, the in-place version only has a one-line description. Add parameter documentation (y, residual_out, a, b, weight, epsilon) for consistency and to help users understand the API.

Suggested change

"""In-place Fused Add and RMS Normalization."""

"""

In-place Fused Add and RMS Normalization.

Args:

y: Output tensor for the normalized result (RMSNorm(a + b) * weight).

residual_out: Output tensor for the residual result (a + b).

a: First input tensor.

b: Second input tensor.

weight: Scale weights applied after RMS normalization.

epsilon: Small constant for numerical stability, default is 1e-5.

Returns:

None. The results are written in-place into ``y`` and ``residual_out``.

"""

Copilot · 2026-01-07T01:02:15Z

test/infiniop/add_rms_norm.py

+    b = TestTensor(b_shape, b_stride, dtype, device, scale=0.01)
+    w = TestTensor(w_shape, None, w_dtype, device)
+
+    eps = 1e-6


The epsilon value differs between infiniop tests (1e-6) and infinicore tests (1e-5). This inconsistency could lead to different numerical behavior and test results between the two implementations. Consider using the same epsilon value across all test files for consistency, or document why different values are needed.

Suggested change

eps = 1e-6

eps = get_tolerance(dtype)

Copilot · 2026-01-07T01:02:15Z

src/infinicore/pybind11/ops/add_rms_norm.hpp

+    epsilon: Small constant for numerical stability, default is 1e-5
+
+Returns:
+    Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)


The function signature in the Python binding documentation states the return type as "Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)", but the actual computation includes the weight multiplication in the normalized result. The documentation should clarify that the normalized result is (a + b) * weight / sqrt(mean((a+b)^2) + epsilon), not just RMSNorm(a + b) * weight, to avoid ambiguity about whether weight is inside or outside the normalization.

Suggested change

Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)

Tuple of (normalized_result, add_result):

normalized_result = (a + b) * weight / sqrt(mean((a + b)^2) + epsilon)

add_result = a + b

test/infiniop/add_rms_norm.py

Copilot · 2026-01-07T01:02:17Z

test/infiniop/add_rms_norm.py

+                b.data(),
+                w.data(),
+                residual_out.data(),
+                None,


The last parameter to infiniopAddRMSNorm is being passed as None (Python's None), but the function signature expects a void *stream parameter. While this may work if the underlying C code accepts NULL, it would be clearer to explicitly pass ctypes.c_void_p(0) or handle the stream parameter properly. This is inconsistent with how other operators might handle stream parameters.

Suggested change

None,

ctypes.c_void_p(0),

Copilot · 2026-01-07T01:02:17Z

test/infinicore/ops/add_rms_norm.py

+        # Check if tensors support in-place operations
+        a_supports_inplace = not is_broadcast(a_strides)
+        b_supports_inplace = not is_broadcast(b_strides)


Variable a_supports_inplace is not used.

Suggested change

# Check if tensors support in-place operations

a_supports_inplace = not is_broadcast(a_strides)

b_supports_inplace = not is_broadcast(b_strides)

# Check if output tensor supports in-place operations

Copilot · 2026-01-07T01:02:17Z

test/infinicore/ops/add_rms_norm.py

+        a_supports_inplace = not is_broadcast(a_strides)
+        b_supports_inplace = not is_broadcast(b_strides)


Variable b_supports_inplace is not used.

Suggested change

a_supports_inplace = not is_broadcast(a_strides)

b_supports_inplace = not is_broadcast(b_strides)

Copilot · 2026-01-07T01:02:18Z

test/infiniop/add_rms_norm.py

@@ -0,0 +1,185 @@
+import torch
+import ctypes


Module 'ctypes' is imported with both 'import' and 'import from'.

Suggested change

import ctypes

zhuyue and others added 2 commits December 24, 2025 09:11

增加cpu的add rms_norm算子,c++和python接口

7d60e5b

Unify add_rms_norm to always return (normalized_result, add_result) p…

2a432b3

…air.

gongchensu requested a review from PanZezhong1725 December 25, 2025 02:43

gongchensu self-assigned this Dec 25, 2025

gongchensu linked an issue Dec 25, 2025 that may be closed by this pull request

[DEV] Add RmsNorm融合算子 #791

Closed

Add NVIDIA GPU implementation for add_rms_norm and make residual_out …

7712471

…required.

gongchensu force-pushed the Issue/791 branch from 19987d2 to 7712471 Compare December 25, 2025 02:59

PanZezhong1725 requested a review from a team December 30, 2025 12:19

gongchensu requested a review from whjthu January 4, 2026 03:01

PanZezhong1725 approved these changes Jan 6, 2026

View reviewed changes

whjthu requested a review from kilinchange January 6, 2026 06:34

wooway777 requested a review from Copilot January 7, 2026 00:56

Copilot started reviewing on behalf of wooway777 January 7, 2026 00:57 View session

Copilot AI reviewed Jan 7, 2026

View reviewed changes

whjthu merged commit 3b5afff into InfiniTensor:main Jan 7, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue/791 增加add_rms_norm融合算子 #842

Issue/791 增加add_rms_norm融合算子 #842

Uh oh!

gongchensu commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Copilot AI Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		def add_rms_norm_(y, residual_out, a, b, weight, epsilon=1e-5):
		"""In-place Fused Add and RMS Normalization."""

-    """In-place Fused Add and RMS Normalization."""
+    """
+    In-place Fused Add and RMS Normalization.
+    Args:
+        y: Output tensor for the normalized result (RMSNorm(a + b) * weight).
+        residual_out: Output tensor for the residual result (a + b).
+        a: First input tensor.
+        b: Second input tensor.
+        weight: Scale weights applied after RMS normalization.
+        epsilon: Small constant for numerical stability, default is 1e-5.
+    Returns:
+        None. The results are written in-place into ``y`` and ``residual_out``.
+    """

		a_supports_inplace = not is_broadcast(a_strides)
		b_supports_inplace = not is_broadcast(b_strides)

Issue/791 增加add_rms_norm融合算子 #842

Issue/791 增加add_rms_norm融合算子 #842

Uh oh!

Conversation

gongchensu commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants