Skip to content

r-chong/mxfp4-dequantizer

Repository files navigation

MXFP4 Dequantizer in Zig

A std.io.Reader that takes MXFP4 bytes from a safetensors file and outputs a stream of dequantized float bytes. When I'm done it should interface like:

var reader = TensorReader.init(&quantized_tensor);
defer reader.deinit();

var sample: [16]f32 = undefined;
const written = reader.read(std.mem.asBytes(&sample));

Steps to run:

1. Download gpt-oss weights

hf download openai/gpt-oss-20b

2. Update constants in main.zig

You'll need to update these hardcoded values:

Change to your absolute path where you downloaded the model

const SAFETENSORS_PATH =
    "PLACEHOLDER";

Chosen layer to inspect See layer.zig for available LayerKind options.

const lyr: layer.Layer = .{
    .block_idx = 22,  // Which transformer block (0-based)
    .kind = .Mlp1WeightQuant,  // Which tensor to load
};

3. Compile and run

zig run main.zig

Key Terminology

A tensor is an array. Ours contain model weights in FP4 format.

A layer is a semantic grouping of tensors. It comes from neural network layers.

Splatting is applying a scale to a block of values

Microscaling FP4 (abbrev. MXFP4) is a custom 4-bit quantization format where every TWO FP4 values are packed into one byte, and the number of values depends entirely on the tensor’s shape. This is the quantization format assumed in this program.

Dequantization is the reverse of quantization, where we go from low precision -> high precision.

Quantization

Quantization is a way to "compress" trained AI model weights into smaller file sizes, by representing the higher precision weights in a lower precision data type.

Let's say our higher precision data type is FP16, and our lower precision data type is MXFP4. Currently, in high-precision, the data is a weight matrix but it is stored linearly. So,

W = [
  [a, b, c, d],
  [e, f, g, h],
  [i, j, k, l],
  ...
]

is stored as

[a, b, c, d, e, f, g, h, i, j, k, l, ...]

The rows are side by side instead of on top of each other.

Our matrix is very large, so we split it into "blocks" of length 32 (as per MXFP4).

Our goal is to constrain all values in the block to a smaller length [-a, a], centered around 0. To shrink an interval, one needs to divide by that interval length in order to get into the bounds [-a, a]. Also, instead of -a and a, they're actually called q_min and q_max.

So, for each block, we calculate a scaling value - the value that we divided each of the block values by, so we know what value to multiply by to return to the original weights (or close). We store this value in the block.

Our model weights are stored in tensors alongside a format called Safetensors, which provides JSON information about each tensor's configuration.

What this program does

In this program, we read the Safetensors JSON header, and then follow pointers to the locations of the first tensor. The reason location is plural is because scale values and block values are stored in separate tensors in GPT-OSS weight

There, we decode the lower precision values and then dequantize by multiplying the scaling factor by the FP4 data. Dequantization happens on the fly.

We repeat this process for all tensors.

This process is accelerated using the Single Instruction, Multiple Data (SIMD) technique - which is basically using vectors to do operations at once instead of in sequences.

  • Instead of decoding a single nibble, we unpack 8 or 16 FP4 values from a few bytes.
  • We convert those nibbles into floats
  • We broadcast the scale for that block into a SIMD vector.
  • We do a vector multiply to dequantize a whole block in one go, and then write the results back to memory.

This program outputs to a stream, the "decompressed" FP4 values in a higher precision form (haven't decided between F32 or BF16 yet).

References:

The above (codebase included) was written without AI.

TODO:

  • fp4 lookup table
  • get_safetensors_content without an allocator
  • use a fixed-capacity representation for shapes
  • make LayerMetadata not depend on heap-allocated []u64 shapes (use fixed [MAX_DIMS]u64 + rank).

About

MXFP4 dequantizer in Zig, using SIMD

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages