Skip to content

Conversation

@gongchensu
Copy link
Collaborator

infiniop算子测试
cpu
image
gpu
image
infinicore测试
cpu
image
gpu
image

@gongchensu gongchensu self-assigned this Dec 25, 2025
@gongchensu gongchensu linked an issue Dec 25, 2025 that may be closed by this pull request
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a fused add_rms_norm operator that combines element-wise addition with RMS normalization in a single operation. This fusion improves performance by reducing memory bandwidth requirements and kernel launch overhead, particularly beneficial for transformer architectures where residual connections and normalization are common patterns.

Key changes:

  • Implements the fused operator for both CPU and NVIDIA GPU backends with support for FP16, BF16, and FP32 data types
  • Adds comprehensive test coverage for both infiniop and infinicore layers with various tensor shapes and strides
  • Provides Python bindings and high-level API in infinicore with both out-of-place and in-place variants

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
test/infiniop/libinfiniop/op_register.py Registers C API bindings for the new add_rms_norm operator
test/infiniop/add_rms_norm.py Comprehensive test suite for infiniop implementation with multiple tensor shapes, strides, and data types
test/infinicore/ops/add_rms_norm.py Framework-integrated test suite for infinicore with out-of-place and in-place operation tests
src/infiniop/ops/add_rms_norm/operator.cc Device dispatch logic for operator creation, execution, and destruction
src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cuh NVIDIA GPU implementation header
src/infiniop/ops/add_rms_norm/nvidia/add_rms_norm_nvidia.cu NVIDIA GPU kernel launcher supporting multiple data type combinations
src/infiniop/ops/add_rms_norm/info.h Validation and metadata management for operator parameters
src/infiniop/ops/add_rms_norm/cuda/kernel.cuh CUDA kernel implementation computing fused add and RMS normalization
src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.h CPU implementation header
src/infiniop/ops/add_rms_norm/cpu/add_rms_norm_cpu.cc CPU implementation with OpenMP parallelization
src/infiniop/ops/add_rms_norm/add_rms_norm.h Descriptor template macro definition
src/infinicore/pybind11/ops/add_rms_norm.hpp Python binding definitions for both functional and in-place APIs
src/infinicore/pybind11/ops.hpp Registers add_rms_norm bindings in the ops module
src/infinicore/ops/add_rms_norm/add_rms_norm_infiniop.cc Infinicore implementation using infiniop backend with caching
src/infinicore/ops/add_rms_norm/add_rms_norm.cc Infinicore dispatcher and high-level API implementation
python/infinicore/ops/add_rms_norm.py Python wrapper providing functional and in-place interfaces
python/infinicore/init.py Exports add_rms_norm and add_rms_norm_ to public API
include/infiniop/ops/add_rms_norm.h Public C API header defining operator functions
include/infiniop.h Includes add_rms_norm header in main public API
include/infinicore/ops/add_rms_norm.hpp Public C++ API header for infinicore
include/infinicore/ops.hpp Includes add_rms_norm in ops header

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +12 to +29
INFINIOP_CUDA_KERNEL add_rmsnormKernel(
Tdata *__restrict__ y,
Tdata *__restrict__ residual_out,
ptrdiff_t stride_y_batch,
ptrdiff_t stride_y_nhead,
ptrdiff_t stride_residual_out_batch,
ptrdiff_t stride_residual_out_nhead,
const Tdata *__restrict__ a,
ptrdiff_t stride_a_batch,
ptrdiff_t stride_a_nhead,
const Tdata *__restrict__ b,
ptrdiff_t stride_b_batch,
ptrdiff_t stride_b_nhead,
const Tweight *__restrict__ w,
size_t nhead,
size_t dim,
float epsilon) {
add_rmsnormBlock<BLOCK_SIZE, Tcompute>(
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel function is named add_rmsnormKernel but the device function is named add_rmsnormBlock. The naming convention is inconsistent - one uses 'Kernel' suffix and the other uses 'Block' suffix. Consider standardizing to either add_rms_norm_kernel and add_rms_norm_block (snake_case) or addRmsNormKernel and addRmsNormBlock (camelCase) for consistency with the codebase style.

Copilot uses AI. Check for mistakes.


def add_rms_norm_(y, residual_out, a, b, weight, epsilon=1e-5):
"""In-place Fused Add and RMS Normalization."""
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for the in-place function add_rms_norm_ lacks parameter documentation. While the out-of-place version has detailed documentation for all parameters and return values, the in-place version only has a one-line description. Add parameter documentation (y, residual_out, a, b, weight, epsilon) for consistency and to help users understand the API.

Suggested change
"""In-place Fused Add and RMS Normalization."""
"""
In-place Fused Add and RMS Normalization.
Args:
y: Output tensor for the normalized result (RMSNorm(a + b) * weight).
residual_out: Output tensor for the residual result (a + b).
a: First input tensor.
b: Second input tensor.
weight: Scale weights applied after RMS normalization.
epsilon: Small constant for numerical stability, default is 1e-5.
Returns:
None. The results are written in-place into ``y`` and ``residual_out``.
"""

Copilot uses AI. Check for mistakes.
b = TestTensor(b_shape, b_stride, dtype, device, scale=0.01)
w = TestTensor(w_shape, None, w_dtype, device)

eps = 1e-6
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epsilon value differs between infiniop tests (1e-6) and infinicore tests (1e-5). This inconsistency could lead to different numerical behavior and test results between the two implementations. Consider using the same epsilon value across all test files for consistency, or document why different values are needed.

Suggested change
eps = 1e-6
eps = get_tolerance(dtype)

Copilot uses AI. Check for mistakes.
epsilon: Small constant for numerical stability, default is 1e-5

Returns:
Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature in the Python binding documentation states the return type as "Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)", but the actual computation includes the weight multiplication in the normalized result. The documentation should clarify that the normalized result is (a + b) * weight / sqrt(mean((a+b)^2) + epsilon), not just RMSNorm(a + b) * weight, to avoid ambiguity about whether weight is inside or outside the normalization.

Suggested change
Tuple of (normalized_result, add_result): (RMSNorm(a + b) * weight, a + b)
Tuple of (normalized_result, add_result):
normalized_result = (a + b) * weight / sqrt(mean((a + b)^2) + epsilon)
add_result = a + b

Copilot uses AI. Check for mistakes.
b.data(),
w.data(),
residual_out.data(),
None,
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last parameter to infiniopAddRMSNorm is being passed as None (Python's None), but the function signature expects a void *stream parameter. While this may work if the underlying C code accepts NULL, it would be clearer to explicitly pass ctypes.c_void_p(0) or handle the stream parameter properly. This is inconsistent with how other operators might handle stream parameters.

Suggested change
None,
ctypes.c_void_p(0),

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +70
# Check if tensors support in-place operations
a_supports_inplace = not is_broadcast(a_strides)
b_supports_inplace = not is_broadcast(b_strides)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable a_supports_inplace is not used.

Suggested change
# Check if tensors support in-place operations
a_supports_inplace = not is_broadcast(a_strides)
b_supports_inplace = not is_broadcast(b_strides)
# Check if output tensor supports in-place operations

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +70
a_supports_inplace = not is_broadcast(a_strides)
b_supports_inplace = not is_broadcast(b_strides)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable b_supports_inplace is not used.

Suggested change
a_supports_inplace = not is_broadcast(a_strides)
b_supports_inplace = not is_broadcast(b_strides)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,185 @@
import torch
import ctypes
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module 'ctypes' is imported with both 'import' and 'import from'.

Suggested change
import ctypes

Copilot uses AI. Check for mistakes.
@whjthu whjthu merged commit 3b5afff into InfiniTensor:main Jan 7, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEV] Add RmsNorm融合算子

3 participants