GPU-Accelerated KV Cache Quantization

This project implements INT8 KV-cache quantization for transformer inference and benchmarks multiple CUDA kernel variants against a CPU baseline.

The core idea is per-channel (column-wise) symmetric linear quantization:

Compute one scale per column [ s_d = \frac{\max_t |K_{t,d}|}{127} ]
Quantize each element: q = clamp(round(K / s), [-128, 127])
Dequantize: K̃ = q * s

This mirrors common inference-engine practice: keep KV compressed in memory, and reconstruct values when needed.

What’s in this repo

CPU implementation

compute_scales(): OpenMP-parallel over columns (each thread scans one column)
quantize_matrix() / dequantize_matrix(): serial reference loops (conservative baseline)

GPU implementation

All GPU kernels treat quantization/dequantization as an embarrassingly parallel, memory-bound workload (one thread ≈ one element, or a small bundle of elements).

Naive kernel Each thread handles one (row, col) element; good coalescing by mapping col to the fast-changing x dimension.
Tiled (shared memory) Loads a TILE into shared memory, then quantizes from shared. In this workload, there is little/no reuse, so tiling often helps minimally (or hurts).
Thread-coarsened Each thread processes COARSEN columns for the same row, reducing overhead and improving per-thread work.
Vectorized (float4 / char4) Each thread processes 4 values per load/store using vector types, improving effective bandwidth. Requires D % 4 == 0.

Benchmark driver

The benchmark runs multiple matrix sizes (from small synthetic to “realistic LLM workload” shapes) and reports:

quant time, dequant time, total
speedup over CPU
reconstruction error (L2, max-abs)
“attention surrogate” error: mean |Q·K − Q·K̃| over rows

The driver cycles through 4 GPU modes: Naive, Tiled, Coarsened, Vectorized.

Timing approach:

1 warmup iteration
3 timed iterations (average kernel milliseconds via CUDA events)

Build & run

Requirements

NVIDIA GPU + CUDA toolkit
nvcc
OpenMP-capable host compiler (used for CPU scale computation)

Compile

Example (adjust filenames if needed):

nvcc -O3 -Xcompiler -fopenmp main.cu quant_gpu.cu quant_cpu.c matrix.c -o kv_quant_bench

Run

Run Benchmarks (Stress Tests)

The compiled binary kv_quant_bench will automatically run all stress test cases mentioned in the paper, covering a wide range of workloads from small synthetic tests to realistic large-context LLM scenarios. No additional arguments or configuration are required.

./kv_quant_bench

Benchmark Inputs (Matrix Sizes)

The benchmark evaluates performance across various KV-cache dimensions ($T \times D$), representing different stages of inference and model sizes.

Test Case	Dimensions ($T \times D$)	Description
Trivial Correctness	$1,024 \times 64$	Minimal case for basic validation.
Small	$2,048 \times 128$	Baseline synthetic workload.
Medium	$16,384 \times 256$	Intermediate synthetic workload.
Large	$65,536 \times 256$	Large context synthetic workload.
Very Large	$131,072 \times 256$	Extended context synthetic workload.
Realistic Small LLM	$131,072 \times 1,024$	Realistic hidden dimension for small models.
Realistic Medium LLM	$131,072 \times 2,048$	Realistic hidden dimension for medium models.
Realistic Large LLM	$131,072 \times 4,096$	Realistic hidden dimension for large models (e.g., Llama 2 70B).
Realistic V. Large LLM	$131,072 \times 8,192$	Estimate for very large models (e.g., Claude, GPT-4 class).
Massive Attention	$262,144 \times 128$	Testing extreme sequence length handling.

Run Unit Tests (Correctness)

nvcc -O3 -Xcompiler -fopenmp tests.cu quant_gpu.cu quant_cpu.c matrix.c -o unit_tests
./unit_tests

Unit Tests Details (25 Correctness Tests)

To ensure robustness and compliance with correctness requirements, tests.cu implements 25 distinct unit tests. These cover basic allocation, core logic, GPU kernel correctness, edge cases, pattern checks, and stress testing.

Category	Count	Tests	Rationale/Description
Basic Allocations	2	`test_create_fp32` `test_create_int8`	Verify that FP32 and INT8 matrix structs can be allocated, initialized with correct dimensions, and freed without errors.
Data Helpers	2	`test_fill_range` `test_query_fill`	Ensure random number generators (matrix & vector) produce values strictly within specified min/max bounds.
Metric Identity	3	`test_l2_identical` `test_max_abs_identical` `test_attn_identical`	Sanity check: The error between a matrix and itself must be 0 (or $< 10^{-6}$). Validates the error-checking code itself.
CPU Core Logic	3	`test_compute_scales_simple` `test_cpu_quant_values` `test_cpu_dequant_values`	White-box testing of the reference CPU implementation. Checks if specific known inputs (e.g., 63.5) produce expected quantized integers (e.g., 64) and scales.
GPU Correctness	4	`test_gpu_naive` `test_gpu_tiled` `test_gpu_coarsened` `test_gpu_vectorized`	Cross-implementation checks: Runs each GPU kernel variant on random data and verifies the output matches the CPU reference implementation within integer-rounding tolerance.
Edge Cases	3	`test_1x1_cpu` `test_1x1_gpu_naive` `test_1x4_gpu_vec`	Boundary testing: Ensures the code handles minimal matrix sizes ($1 \times 1$) without segfaulting or producing NaNs.
Patterns	3	`test_all_zeros` `test_all_ones` `test_alternating`	Deterministic patterns: Checks behavior on structured data (all 0s, max 127s, alternating $\pm 127$) to catch logic errors that random data might miss.
Consistency	3	`test_consistency_naive_tiled` `test_consistency_naive_coarsened` `test_consistency_naive_vectorized`	Kernel-to-kernel validation: Ensures that optimized kernels (Shared Mem, Coarsened, Vectorized) produce bit-exact (or near-exact) outputs compared to the simple Naive kernel.
Stress Tests	2	`test_stress_cpu_large` `test_stress_gpu_large`	Scalability check: Runs on larger matrices ($2048 \times 128$ and $4096 \times 256$) to ensure no heap corruption or scaling artifacts occur.

Total Tests: 25

Executing ./unit_tests runs all of the above and reports a PASS/FAIL status for each.

Results

Key takeaway

GPU kernels achieve ~200× to ~1700× speedup vs CPU across tested sizes.
Vectorized is consistently best (when alignment holds).
Tiled provides minimal benefit, matching the “no reuse → tiling won’t help much” expectation.

Speedup summary (CPU vs GPU total quant+dequant)

Test	KV Shape (T×D)	Best GPU Variant	Speedup
Small	2048×128	Vectorized	211.97×
Medium	16384×256	Vectorized	659.65×
Large	65536×256	Vectorized	1175.89×
Very Large	131072×256	Vectorized	1147.34×
Realistic Small LLM	131072×1024	Vectorized	1494.63×
Realistic Medium LLM	131072×2048	Vectorized	1600.12×
Realistic Large LLM	131072×4096	Vectorized	1632.33×
Realistic V. Large LLM	131072×8192	Vectorized	1694.08×

Correctness

Reconstruction and attention-surrogate errors match across CPU and GPU variants in the logs (functional equivalence), consistent with the report’s conclusion.

Notes / gotchas

Vectorized kernel requires D % 4 == 0 (or you must pad).
This benchmark measures kernel time (CUDA events). If you care about end-to-end performance, you’d also measure:
- H2D/D2H copies
- allocations/frees in wrappers
- interaction with downstream attention computation
Scale computation is CPU OpenMP-parallel over columns and is typically amortized over decoding steps (since scales can be reused until KV changes).

File pointers (from this snapshot)

main.cu: benchmark harness + reporting
quant_gpu.cu: CUDA kernels + wrappers (naive/tiled/coarsened/vectorized)
matrix.c/.h: matrix alloc, random fill, error metrics, attention surrogate error

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
kv_cpu_test		kv_cpu_test
main.cu		main.cu
main_serial.c		main_serial.c
matrix.c		matrix.c
matrix.h		matrix.h
quant_cpu.c		quant_cpu.c
quant_cpu.h		quant_cpu.h
quant_gpu.cu		quant_gpu.cu
quant_gpu.cuh		quant_gpu.cuh
raw_results.txt		raw_results.txt
readme.md		readme.md
tests.cu		tests.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU-Accelerated KV Cache Quantization

What’s in this repo

CPU implementation

GPU implementation

Benchmark driver

Build & run

Requirements

Compile

Run

Run Benchmarks (Stress Tests)

Benchmark Inputs (Matrix Sizes)

Run Unit Tests (Correctness)

Unit Tests Details (25 Correctness Tests)

Results

Key takeaway

Speedup summary (CPU vs GPU total quant+dequant)

Correctness

Notes / gotchas

File pointers (from this snapshot)

About

Uh oh!

Releases

Packages

Languages

MaanasTaneja/cuda-kv-cache-compression

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated KV Cache Quantization

What’s in this repo

CPU implementation

GPU implementation

Benchmark driver

Build & run

Requirements

Compile

Run

Run Benchmarks (Stress Tests)

Benchmark Inputs (Matrix Sizes)

Run Unit Tests (Correctness)

Unit Tests Details (25 Correctness Tests)

Results

Key takeaway

Speedup summary (CPU vs GPU total quant+dequant)

Correctness

Notes / gotchas

File pointers (from this snapshot)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages