use fp16 for rope frequencies to reduce rounding errors#60
Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes the data type used for ROPE (Rotary Position Embedding) frequency storage from BF16 to FP16 in BF16 models. The motivation is that sine/cosine values are bounded in [-1, 1], so the extended exponent range of BF16 is unnecessary, and FP16's better mantissa precision reduces rounding errors.
Changes:
- Modified frequency tensor allocation to use FP16 instead of BF16 when the model uses BF16
- Updated ROPE kernel templates to support mixed precision (BF16 I/O with FP16 frequencies)
- Updated function signatures across the codebase to reflect the new type pairing
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/models/llama_run_state.cpp | Allocates frequency tensors as FP16 for BF16 models; precomputes frequencies using FP16 |
| src/kernels/rope.cu | Adds FP16 precompute function; templates kernel to support different I/O and frequency types; updates function signatures |
| src/kernels/kernels.h | Updates function declarations to use half* for frequencies with BF16 I/O |
| src/kernels/kernels.cpp | Updates tensor dispatchers to extract frequencies as half when model is BF16 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
da1361c to
ea66cd6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/models/llama_run_state.cpp:84
- The function should handle unsupported dtypes. If Config.DType is set to an FP8 type or another unsupported dtype, the function will silently return an uninitialized tensor. Consider adding an else clause that throws an error for unsupported dtypes, similar to the pattern used in rope_forward and rope_backward functions.
if(dtype == ETensorDType::FP16) {
std::vector<half> freq_cpu(Config.MaxPositionEmbeddings * 2 * Config.head_size());
precompute_freqs_cis(freq_cpu.data(), Config.head_size(), Config.MaxPositionEmbeddings, Config.RopeTheta);
CUDA_CHECK(cudaMemcpy(freq.Data, freq_cpu.data(), freq_cpu.size() * sizeof(half), cudaMemcpyHostToDevice));
} else if (dtype == ETensorDType::FP32) {
std::vector<float> freq_cpu(Config.MaxPositionEmbeddings * 2 * Config.head_size());
precompute_freqs_cis(freq_cpu.data(), Config.head_size(), Config.MaxPositionEmbeddings, Config.RopeTheta);
CUDA_CHECK(cudaMemcpy(freq.Data, freq_cpu.data(), freq_cpu.size() * sizeof(float), cudaMemcpyHostToDevice));
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
due to the boundedness of sin/cos, we don't need bf16 range there and can instead invest in better precision. By keeping it 16 bit, no changes to the kernels are needed.