Skip to content

Pranshu-Bahadur/4Bit-Forge

Repository files navigation

4Bit-Forge logo

This repo builds upon the solid foundation laid in MoE-Quant. Thank you for your amazing work!

Supports: MiniMax M2.1, DeepSeek V3.2, Qwen3, (Prolly more models) (technicallly kimi k2 as well, but need multigpu support to confirm / downcasting Hessians / dispatch solver kernel)

Runs GPTQ quantization on a Single A100...and its like supa fast...!

Unpacks model weights accordingly (fp8, uint8, packed int32, using checkpoint model's scales etc -> fp16/bf16) instead of neaively just setting weights to fp16/bf16.

Note: Still...WIP..

Todo(s):

  • pack uint8 <- int4?

  • save final model checkpoint for formatting for vllm compatibility

  • confirm frontend didnt break quantization...

  • test inference

  • fp8, fp16, bf16 kernel dispatcher (important for speed and efficieny.)

  • sparsegptq 2:4 vllm inference compatible format

    • build interblock Mask update sparsegptq 2:4 kernel
    • Test sparsegptq on qwen3 ("mvp")
  • metrics.py / stat view during compression run (?)

  • multigpu support

  • R&D

    • spargptq 1:8
    • Custom Sparse 1:8 Kernels for inference
    • Monkey patch with transformers/torch nn.Linear
    • Explore monkey patching with vllm
  • improve cpu offloading for the gpu poor

  • io.py + handle quantized layers

  • scaffold initial quant.py

  • engine.py

  • preprocess.py (openplatypus)

About

big llm become smol using CUDA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors