Skip to content

xiaguan/pegainfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pegainfer logo

pegainfer

Pure Rust + CUDA LLM inference engine. No PyTorch. No frameworks. Just metal.

Quickstart · Architecture · Performance · API


What is this?

pegainfer is a from-scratch LLM inference engine written in Rust with hand-written CUDA kernels. It currently runs Qwen3-4B at ~70 tokens/sec on a single GPU.

The goal is not to replace vLLM or TensorRT-LLM — it's to understand every layer of the inference stack by building it from the ground up, and to explore what a Rust-native inference engine can look like.

What's implemented:

  • Full Qwen3 transformer: GQA, RoPE, SwiGLU MLP, RMSNorm
  • 11 custom CUDA kernels + cuBLAS GEMV
  • BF16 storage, FP32 accumulators
  • KV cache with tiled fused attention (online softmax, TILE_SIZE=64)
  • OpenAI-compatible /v1/completions HTTP API
  • Safetensors weight loading, HuggingFace tokenizer

What's not (yet):

  • Batching, PagedAttention, streaming (SSE)
  • FlashAttention-level kernel optimization
  • Multi-GPU / tensor parallelism
  • Quantization (INT8/INT4)

Quickstart

Prerequisites

  • Rust (2024 edition)
  • CUDA Toolkit (nvcc, cuBLAS)
  • A CUDA-capable GPU
  • Qwen3-4B model weights in models/Qwen3-4B/

Build & Run

export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Build (compiles CUDA kernels via build.rs)
cargo build --release

# Run inference server on port 8000
cargo run --release

# Disable CUDA Graph capture/replay on decode path
cargo run --release -- --cuda-graph=false

# Run tests
cargo test --release

Download Model

# Using huggingface-cli
huggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B

API

OpenAI-compatible completions endpoint:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_tokens": 32}'

Architecture

Tokenize → Embedding → 28× TransformerBlock → RMSNorm → LM Head → Argmax
                              │
                              ├── RMSNorm → Fused GQA Attention → Residual
                              └── RMSNorm → Fused SwiGLU MLP    → Residual
src/
├── main.rs           # HTTP server (axum)
├── model.rs          # Qwen3Model, Attention, MLP, TransformerBlock
├── tensor.rs         # DeviceVec, DeviceMatrix — GPU tensor types
├── ops.rs            # GPU operators (linear, rms_norm, rope, fused_mlp, fused_attention)
├── kv_cache.rs       # KV cache for autoregressive generation
├── weight_loader.rs  # Safetensors loading + RoPE precomputation
├── ffi.rs            # FFI bindings to CUDA kernels
├── qwen3_config.rs   # Model config parsing
├── tokenizer.rs      # HuggingFace tokenizers wrapper
└── trace_reporter.rs # Chrome Trace JSON profiling output

csrc/
├── kernels.cu          # RMSNorm, RoPE, SiLU, embedding, GEMV, fused MLP, sampling
├── fused_attention.cu  # Fused GQA attention with tiled online softmax
└── common.cuh          # Shared CUDA utilities

Key design decisions

  • All computation on GPU — no CPU fallback, no hybrid execution
  • Custom CUDA kernels for everything except matrix multiplication (cuBLAS)
  • Fused operators — attention and MLP are each a single kernel launch
  • BF16 storage, FP32 accumulation — numerical stability without memory overhead
  • Synchronous execution — simple and debuggable, no overlap optimization yet

Performance

Measured on RTX 5070 Ti, Qwen3-4B, BF16:

Metric Value
TTFT (prompt_len=4) ~17 ms
TPOT ~14 ms/token
Throughput ~70 tokens/sec

Profiling traces are written to traces/ in Chrome Trace JSON format — open with Perfetto UI.

License

MIT

About

Pure Rust + CUDA LLM inference engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors