[DRAFT][quantization] Full quantization of LLama compatible models by stamalakhov · Pull Request #436 · Samsung/TICO

stamalakhov · 2026-01-13T13:44:30Z

This draft tries to get fully quantized circle layers for Llama model.

TODO:

tests/cleanup

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

mhs4670go · 2026-01-21T05:42:10Z

tico/quantization/algorithm/gptq/quantizer.py

                    gptq[name] = GPTQ(subset[name])
                    gptq[name].quantizer.configure(
-                        bits=8, perchannel=True, sym=False, mse=False
+                        bits=4, perchannel=True, sym=False, mse=False


FYI, you can give the option for this with this PR.

FYI, you can give the option for this with this PR.

@mhs4670go
Thank you. I'll rebase after merging of #441.

mhs4670go · 2026-01-21T05:57:26Z

tico/quantization/wrapq/observers/affine_base.py

Could you give me some explanations for the reason of changes related with observers?

Deleting some attributes and register them as buffer.

Change ObserverBase's parent from ABC to torch.nn.Module. (and MinMaxObserver)

Could you give me some explanations for the reason of changes related with observers?

Deleting some attributes and register them as buffer.

Ahh. It occured that model.to("cuda") or model.to("cpu") do not transfer scales and zero_points to gpu/cpu, they were not registered as buffers or parameters, that is why they were registered as buffers. Deleting them is needed, because otherwise torch fails to register known attributes as buffers.

Change ObserverBase's parent from ABC to torch.nn.Module

It will enable using buffer registering and correct automatical transfer of scales/zp to cpu<->gpu. The same approach is used in gptq/quant.py for the same reason (i suppose).
Please see

TICO/tico/quantization/algorithm/gptq/quant.py

Line 32 in aaf55d7

class Quantizer(nn.Module):

This draft is just PoC (quick and dirty).

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

ok. Thank you very much.

mhs4670go · 2026-01-21T06:12:52Z

tico/quantization/passes/fold_quant_ops.py

+                continue
+            if (
+                dq.target
+                != torch.ops.circle_custom.dequantize_mx_to_float.default


Seems that just quantize_mx and dequantize_mx are simpler. Is there some consideration for exposing dtypes in the name?

There is no fake_quantize for mx types (just circle_custom::quantize_mx). So quantize_float_to_mx is a try (m.b. failed) to distinguish it from quantize_mx. In case circle_custom::quantize_mx will become circle_custom::fakequantize_mx, then usual quantize/dequantize naming scheme applies.

@mhs4670go
It can be renamed to any other (more appropriate) name.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

@mhs4670go
Ok. Got it. Thank you.

mhs4670go · 2026-02-05T10:08:37Z

tico/quantization/wrapq/wrappers/llama/quant_attn.py

            try:
-                k_total, v_total = past_key_value.update(k_rot, v)
+                if len(sig.parameters) == 2:
+                    k_total, v_total = past_key_value.update(k_rot, v)


@stamalakhov Just for curiosity, can this be exported? Or, it is just a fallback for float model?

yep. it works. a lot of model outputs (which are just k, v outputs) for use_cache==True.

if len(sig.parameters) == 2 to make test passable

When sig.parameters is 2? I'm asking this because I'm gonna trim the cache logic.

i've added if len(sig.parameters) == 2 to make quantization.wrapq.wrappers.llama.test_quant_attn.TestQuantLlamaAttention passable.
MockCache.update uses just two inputs.
For export of the whole LLama 3 inputs are required.

I'm gonna trim the cache logic.

@mhs4670go
You mean remove it completely?

Ah, I got it. I think you can just modify MockCache to have third parameter - layer_idx. Which will makes the codes simpler.

You mean remove it completely?

No. I just checked whether it has redundant logic. But, seems not that much:)

Ah, I got it. I think you can just modify MockCache to have third parameter - layer_idx. Which will makes the codes simpler.

@mhs4670go
Understood. I'll update MockCache. Thank you.

stamalakhov · 2026-02-10T05:33:52Z

tico/quantization/wrapq/quantizer.py

        # Case A: HuggingFace-style transformers: model.model.layers
        lm = getattr(root, "model", None)
+
+        embeddings = (


@mhs4670go
May be it will be better to introduce something like QuantLlamaModel and wrap all internal structure (embed, lm_head, norm) inside of it? and leave this general code as a fallback? like this:

try: wrap(the_whole_model) catch: no specific wrapper for a class then use general logic

It would be good to have QuantLlamaModel for convinience when we evaluate it or something like that. Just for note, even though the whole model is quantized, only decoder layers would need to be exported because of the runtime requirements. Nothing has been fixed though.

OK. Got it. Thank you.

This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov self-assigned this Jan 13, 2026

stamalakhov added DRAFT No merge labels Jan 13, 2026

stamalakhov force-pushed the full_quantization_br branch 9 times, most recently from 0253cb9 to 5201525 Compare January 20, 2026 11:37

mhs4670go reviewed Jan 21, 2026

View reviewed changes

stamalakhov force-pushed the full_quantization_br branch 7 times, most recently from 60fcd6a to f7bb4d9 Compare January 27, 2026 13:51

stamalakhov force-pushed the full_quantization_br branch 6 times, most recently from 06581cb to 542db37 Compare January 30, 2026 06:07

stamalakhov changed the title ~~[DRAFT][NO_MERGE][quantization] Full quantization~~ [DRAFT][quantization] Full quantization Jan 30, 2026

stamalakhov force-pushed the full_quantization_br branch from 542db37 to 7f684a3 Compare January 30, 2026 10:31

stamalakhov force-pushed the full_quantization_br branch from 4fa1e40 to 369dabb Compare January 30, 2026 13:44

stamalakhov changed the title ~~[DRAFT][quantization] Full quantization~~ [DRAFT][quantization] Full quantization of LLama compatible models Feb 2, 2026

stamalakhov mentioned this pull request Feb 2, 2026

[quantization] Introduce a wrapper for nn.Embedding #455

Merged

stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from 546d33e to f6d4a2b Compare February 2, 2026 11:51

stamalakhov mentioned this pull request Feb 2, 2026

[quantization] Decoder output quantization #458

Merged

stamalakhov force-pushed the full_quantization_br branch 3 times, most recently from 02f9367 to ab81153 Compare February 4, 2026 09:20

stamalakhov mentioned this pull request Feb 4, 2026

[quantization] Fold scale #465

Merged

stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from 628786e to 234d069 Compare February 5, 2026 06:56

stamalakhov mentioned this pull request Feb 5, 2026

[quantization] Introduce a script for LLM evaluation #467

Merged

mhs4670go reviewed Feb 5, 2026

View reviewed changes

stamalakhov force-pushed the full_quantization_br branch 6 times, most recently from fb0bd17 to 19cc002 Compare February 9, 2026 10:46

stamalakhov mentioned this pull request Feb 9, 2026

[quantization] Propagate qparam for expand #479

Merged

stamalakhov force-pushed the full_quantization_br branch from 19cc002 to b3dac32 Compare February 9, 2026 12:48

stamalakhov commented Feb 10, 2026

View reviewed changes

stamalakhov force-pushed the full_quantization_br branch 3 times, most recently from 348cb52 to 83f9b1e Compare February 12, 2026 07:12

[quantization] Full quantization

93aa4c8

This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the full_quantization_br branch from 83f9b1e to 93aa4c8 Compare February 13, 2026 06:04

This was referenced Feb 13, 2026

[quantization][draft] Quantization of Llama #492

Draft

[quantization][DRAFT] Disk space consumption improvements for full model quantization #495

Draft

Conversation

stamalakhov commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamalakhov Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Jan 13, 2026 •

edited

Loading

mhs4670go Jan 21, 2026 •

edited

Loading

stamalakhov Jan 21, 2026 •

edited

Loading

mhs4670go Jan 21, 2026 •

edited

Loading

mhs4670go Feb 5, 2026 •

edited

Loading

mhs4670go Feb 10, 2026 •

edited

Loading