MegaQwen

Custom CUDA megakernel for Qwen3-0.6B inference achieving 530 tok/s decode on RTX 3090 (3.9x faster than HuggingFace).

Performance

Backend	Decode (tok/s)	Speedup
Megakernel	531	3.9x
TensorRT-LLM	355	2.6x
vLLM	107	0.8x
SGLang	107	0.8x
HuggingFace	136	1.0x

Note: Decode throughput depends on context length. At position 1: 525 tok/s, at position 200: 422 tok/s. See experiments/RESULTS.md for full benchmarks.

Fair Comparison (Devil's Advocate)

Credit where it's due: TensorRT-LLM, vLLM, SGLang, and other frameworks are excellently optimized for production workloads with dynamic shapes, variable batch sizes, and long contexts. This megakernel exploits several advantages they intentionally don't:

Static shapes: All dimensions (hidden size, head count, MLP width) are compile-time constants. Production frameworks must handle arbitrary model architectures at runtime.
Short context bias: The benchmarks favor position 1-100 where KV cache overhead is minimal. At longer contexts, TensorRT-LLM's consistent 355 tok/s beats the megakernel's degradation to 158 tok/s.
Single model, single GPU: No tensor parallelism, no continuous batching, no dynamic memory allocation. Real serving systems need all of these.
Learning exercise: This project was built to understand GPU optimization, not to replace production inference engines.

The speedup is real, but it comes from exploiting a narrow regime (batch=1, short context, static shapes) where the texture cache (__ldg()) provides massive benefits by keeping weights in the read-only cache path while L1/L2 handles activations. Production frameworks can't make these assumptions.

TL;DR: Use TensorRT-LLM or vLLM for production. Use this to learn how GPUs actually work.

What is a Megakernel?

A megakernel fuses an entire transformer block into a single CUDA kernel launch, eliminating kernel launch overhead and intermediate memory traffic. This implementation:

Fuses RMSNorm, QKV projection, RoPE, attention, O projection, and MLP into one kernel
Uses __ldg() for cached weight reads via texture cache
Employs cooperative groups for grid-wide synchronization
Implements online softmax for memory-efficient attention

Requirements

NVIDIA GPU with compute capability 8.6+ (RTX 3090, A100, etc.)
CUDA 11.8+
Python 3.10+

Installation

git clone https://github.com/Infatoshi/MegaQwen.git
cd MegaQwen

# Create virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
uv pip install transformers triton

Usage

Interactive Chat

python chat.py

Run Benchmarks

python benchmark_suite.py

Verify Correctness

python verify_correctness.py

Key Findings

After exhaustive optimization, we discovered the kernel is latency-bound by synchronization, not memory bandwidth:

5% memory bandwidth utilization (47 GB/s effective vs 936 GB/s peak)
140+ grid.sync() calls per token at ~0.7us each = ~100us sync overhead
~530 tok/s is the architectural ceiling for batch=1 bf16 cooperative megakernels on RTX 3090

What Worked

Optimization	Impact
Block divergence + L2 prefetch	+2x
128-bit vectorized loads	+3.5%

What Didn't Work

Optimization	Impact	Why
Warp producer/consumer split	0%	Reduces compute parallelism
Shared memory caching	0%	L1/L2 already effective
cp.async double-buffering	+1%	Can't overlap enough compute

See DEVLOG.md for the complete optimization journey.

Documentation

Development Log - Complete optimization journey and learnings
Benchmark Results - Full benchmark data
Memory Analysis - Bandwidth and SASS analysis
Architecture - Kernel architecture details
Specification - Technical specification

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
csrc		csrc
docs		docs
experiments		experiments
.gitignore		.gitignore
DEVLOG.md		DEVLOG.md
README.md		README.md
SPEC.md		SPEC.md
benchmark_suite.py		benchmark_suite.py
chat.py		chat.py
demo.py		demo.py
demo_e2e.py		demo_e2e.py
generate.py		generate.py
kernels		kernels
qwen3-0.6b.py		qwen3-0.6b.py
requirements.txt		requirements.txt
verify_correctness.py		verify_correctness.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaQwen

Performance

Fair Comparison (Devil's Advocate)

What is a Megakernel?

Requirements

Installation

Usage

Interactive Chat

Run Benchmarks

Verify Correctness

Key Findings

What Worked

What Didn't Work

Documentation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Infatoshi/MegaQwen

Folders and files

Latest commit

History

Repository files navigation

MegaQwen

Performance

Fair Comparison (Devil's Advocate)

What is a Megakernel?

Requirements

Installation

Usage

Interactive Chat

Run Benchmarks

Verify Correctness

Key Findings

What Worked

What Didn't Work

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages