Custom CUDA megakernel for Qwen3-0.6B inference achieving 530 tok/s decode on RTX 3090 (3.9x faster than HuggingFace).
| Backend | Decode (tok/s) | Speedup |
|---|---|---|
| Megakernel | 531 | 3.9x |
| TensorRT-LLM | 355 | 2.6x |
| vLLM | 107 | 0.8x |
| SGLang | 107 | 0.8x |
| HuggingFace | 136 | 1.0x |
Note: Decode throughput depends on context length. At position 1: 525 tok/s, at position 200: 422 tok/s. See experiments/RESULTS.md for full benchmarks.
Credit where it's due: TensorRT-LLM, vLLM, SGLang, and other frameworks are excellently optimized for production workloads with dynamic shapes, variable batch sizes, and long contexts. This megakernel exploits several advantages they intentionally don't:
-
Static shapes: All dimensions (hidden size, head count, MLP width) are compile-time constants. Production frameworks must handle arbitrary model architectures at runtime.
-
Short context bias: The benchmarks favor position 1-100 where KV cache overhead is minimal. At longer contexts, TensorRT-LLM's consistent 355 tok/s beats the megakernel's degradation to 158 tok/s.
-
Single model, single GPU: No tensor parallelism, no continuous batching, no dynamic memory allocation. Real serving systems need all of these.
-
Learning exercise: This project was built to understand GPU optimization, not to replace production inference engines.
The speedup is real, but it comes from exploiting a narrow regime (batch=1, short context, static shapes) where the texture cache (__ldg()) provides massive benefits by keeping weights in the read-only cache path while L1/L2 handles activations. Production frameworks can't make these assumptions.
TL;DR: Use TensorRT-LLM or vLLM for production. Use this to learn how GPUs actually work.
A megakernel fuses an entire transformer block into a single CUDA kernel launch, eliminating kernel launch overhead and intermediate memory traffic. This implementation:
- Fuses RMSNorm, QKV projection, RoPE, attention, O projection, and MLP into one kernel
- Uses
__ldg()for cached weight reads via texture cache - Employs cooperative groups for grid-wide synchronization
- Implements online softmax for memory-efficient attention
- NVIDIA GPU with compute capability 8.6+ (RTX 3090, A100, etc.)
- CUDA 11.8+
- Python 3.10+
git clone https://github.com/Infatoshi/MegaQwen.git
cd MegaQwen
# Create virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
uv pip install transformers tritonpython chat.pypython benchmark_suite.pypython verify_correctness.pyAfter exhaustive optimization, we discovered the kernel is latency-bound by synchronization, not memory bandwidth:
- 5% memory bandwidth utilization (47 GB/s effective vs 936 GB/s peak)
- 140+
grid.sync()calls per token at ~0.7us each = ~100us sync overhead - ~530 tok/s is the architectural ceiling for batch=1 bf16 cooperative megakernels on RTX 3090
| Optimization | Impact |
|---|---|
| Block divergence + L2 prefetch | +2x |
| 128-bit vectorized loads | +3.5% |
| Optimization | Impact | Why |
|---|---|---|
| Warp producer/consumer split | 0% | Reduces compute parallelism |
| Shared memory caching | 0% | L1/L2 already effective |
| cp.async double-buffering | +1% | Can't overlap enough compute |
See DEVLOG.md for the complete optimization journey.
- Development Log - Complete optimization journey and learnings
- Benchmark Results - Full benchmark data
- Memory Analysis - Bandwidth and SASS analysis
- Architecture - Kernel architecture details
- Specification - Technical specification
MIT