This project implements INT8 KV-cache quantization for transformer inference and benchmarks multiple CUDA kernel variants against a CPU baseline.
The core idea is per-channel (column-wise) symmetric linear quantization:
- Compute one scale per column [ s_d = \frac{\max_t |K_{t,d}|}{127} ]
- Quantize each element:
q = clamp(round(K / s), [-128, 127]) - Dequantize:
K̃ = q * s
This mirrors common inference-engine practice: keep KV compressed in memory, and reconstruct values when needed.
compute_scales(): OpenMP-parallel over columns (each thread scans one column)quantize_matrix()/dequantize_matrix(): serial reference loops (conservative baseline)
All GPU kernels treat quantization/dequantization as an embarrassingly parallel, memory-bound workload (one thread ≈ one element, or a small bundle of elements).
-
Naive kernel Each thread handles one
(row, col)element; good coalescing by mappingcolto the fast-changing x dimension. -
Tiled (shared memory) Loads a TILE into shared memory, then quantizes from shared. In this workload, there is little/no reuse, so tiling often helps minimally (or hurts).
-
Thread-coarsened Each thread processes
COARSENcolumns for the same row, reducing overhead and improving per-thread work. -
Vectorized (float4 / char4) Each thread processes 4 values per load/store using vector types, improving effective bandwidth. Requires
D % 4 == 0.
The benchmark runs multiple matrix sizes (from small synthetic to “realistic LLM workload” shapes) and reports:
- quant time, dequant time, total
- speedup over CPU
- reconstruction error (L2, max-abs)
- “attention surrogate” error: mean |Q·K − Q·K̃| over rows
The driver cycles through 4 GPU modes: Naive, Tiled, Coarsened, Vectorized.
Timing approach:
- 1 warmup iteration
- 3 timed iterations (average kernel milliseconds via CUDA events)
- NVIDIA GPU + CUDA toolkit
nvcc- OpenMP-capable host compiler (used for CPU scale computation)
Example (adjust filenames if needed):
nvcc -O3 -Xcompiler -fopenmp main.cu quant_gpu.cu quant_cpu.c matrix.c -o kv_quant_benchThe compiled binary kv_quant_bench will automatically run all stress test cases mentioned in the paper, covering a wide range of workloads from small synthetic tests to realistic large-context LLM scenarios. No additional arguments or configuration are required.
./kv_quant_benchThe benchmark evaluates performance across various KV-cache dimensions (
| Test Case | Dimensions ( |
Description |
|---|---|---|
| Trivial Correctness | Minimal case for basic validation. | |
| Small | Baseline synthetic workload. | |
| Medium | Intermediate synthetic workload. | |
| Large | Large context synthetic workload. | |
| Very Large | Extended context synthetic workload. | |
| Realistic Small LLM | Realistic hidden dimension for small models. | |
| Realistic Medium LLM | Realistic hidden dimension for medium models. | |
| Realistic Large LLM | Realistic hidden dimension for large models (e.g., Llama 2 70B). | |
| Realistic V. Large LLM | Estimate for very large models (e.g., Claude, GPT-4 class). | |
| Massive Attention | Testing extreme sequence length handling. |
nvcc -O3 -Xcompiler -fopenmp tests.cu quant_gpu.cu quant_cpu.c matrix.c -o unit_tests
./unit_testsTo ensure robustness and compliance with correctness requirements, tests.cu implements 25 distinct unit tests. These cover basic allocation, core logic, GPU kernel correctness, edge cases, pattern checks, and stress testing.
| Category | Count | Tests | Rationale/Description |
|---|---|---|---|
| Basic Allocations | 2 |
test_create_fp32test_create_int8
|
Verify that FP32 and INT8 matrix structs can be allocated, initialized with correct dimensions, and freed without errors. |
| Data Helpers | 2 |
test_fill_rangetest_query_fill
|
Ensure random number generators (matrix & vector) produce values strictly within specified min/max bounds. |
| Metric Identity | 3 |
test_l2_identicaltest_max_abs_identicaltest_attn_identical
|
Sanity check: The error between a matrix and itself must be 0 (or |
| CPU Core Logic | 3 |
test_compute_scales_simpletest_cpu_quant_valuestest_cpu_dequant_values
|
White-box testing of the reference CPU implementation. Checks if specific known inputs (e.g., 63.5) produce expected quantized integers (e.g., 64) and scales. |
| GPU Correctness | 4 |
test_gpu_naivetest_gpu_tiledtest_gpu_coarsenedtest_gpu_vectorized
|
Cross-implementation checks: Runs each GPU kernel variant on random data and verifies the output matches the CPU reference implementation within integer-rounding tolerance. |
| Edge Cases | 3 |
test_1x1_cputest_1x1_gpu_naivetest_1x4_gpu_vec
|
Boundary testing: Ensures the code handles minimal matrix sizes ( |
| Patterns | 3 |
test_all_zerostest_all_onestest_alternating
|
Deterministic patterns: Checks behavior on structured data (all 0s, max 127s, alternating |
| Consistency | 3 |
test_consistency_naive_tiledtest_consistency_naive_coarsenedtest_consistency_naive_vectorized
|
Kernel-to-kernel validation: Ensures that optimized kernels (Shared Mem, Coarsened, Vectorized) produce bit-exact (or near-exact) outputs compared to the simple Naive kernel. |
| Stress Tests | 2 |
test_stress_cpu_largetest_stress_gpu_large
|
Scalability check: Runs on larger matrices ( |
Total Tests: 25
Executing ./unit_tests runs all of the above and reports a PASS/FAIL status for each.
- GPU kernels achieve ~200× to ~1700× speedup vs CPU across tested sizes.
- Vectorized is consistently best (when alignment holds).
- Tiled provides minimal benefit, matching the “no reuse → tiling won’t help much” expectation.
| Test | KV Shape (T×D) | Best GPU Variant | Speedup |
|---|---|---|---|
| Small | 2048×128 | Vectorized | 211.97× |
| Medium | 16384×256 | Vectorized | 659.65× |
| Large | 65536×256 | Vectorized | 1175.89× |
| Very Large | 131072×256 | Vectorized | 1147.34× |
| Realistic Small LLM | 131072×1024 | Vectorized | 1494.63× |
| Realistic Medium LLM | 131072×2048 | Vectorized | 1600.12× |
| Realistic Large LLM | 131072×4096 | Vectorized | 1632.33× |
| Realistic V. Large LLM | 131072×8192 | Vectorized | 1694.08× |
Reconstruction and attention-surrogate errors match across CPU and GPU variants in the logs (functional equivalence), consistent with the report’s conclusion.
-
Vectorized kernel requires
D % 4 == 0(or you must pad). -
This benchmark measures kernel time (CUDA events). If you care about end-to-end performance, you’d also measure:
- H2D/D2H copies
- allocations/frees in wrappers
- interaction with downstream attention computation
-
Scale computation is CPU OpenMP-parallel over columns and is typically amortized over decoding steps (since scales can be reused until KV changes).
main.cu: benchmark harness + reportingquant_gpu.cu: CUDA kernels + wrappers (naive/tiled/coarsened/vectorized)matrix.c/.h: matrix alloc, random fill, error metrics, attention surrogate error