[DRAFT][quantization] Full quantization of LLama compatible models#436
[DRAFT][quantization] Full quantization of LLama compatible models#436stamalakhov wants to merge 1 commit intoSamsung:mainfrom
Conversation
0253cb9 to
5201525
Compare
| gptq[name] = GPTQ(subset[name]) | ||
| gptq[name].quantizer.configure( | ||
| bits=8, perchannel=True, sym=False, mse=False | ||
| bits=4, perchannel=True, sym=False, mse=False |
There was a problem hiding this comment.
FYI, you can give the option for this with this PR.
There was a problem hiding this comment.
FYI, you can give the option for this with this PR.
@mhs4670go
Thank you. I'll rebase after merging of #441.
There was a problem hiding this comment.
Could you give me some explanations for the reason of changes related with observers?
- Deleting some attributes and register them as buffer.
- Change
ObserverBase's parent fromABCtotorch.nn.Module. (andMinMaxObserver)
There was a problem hiding this comment.
Could you give me some explanations for the reason of changes related with observers?
- Deleting some attributes and register them as buffer.
Ahh. It occured that model.to("cuda") or model.to("cpu") do not transfer scales and zero_points to gpu/cpu, they were not registered as buffers or parameters, that is why they were registered as buffers. Deleting them is needed, because otherwise torch fails to register known attributes as buffers.
- Change ObserverBase's parent from ABC to torch.nn.Module
It will enable using buffer registering and correct automatical transfer of scales/zp to cpu<->gpu. The same approach is used in gptq/quant.py for the same reason (i suppose).
Please see
This draft is just PoC (quick and dirty).
There was a problem hiding this comment.
Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.
There was a problem hiding this comment.
Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.
ok. Thank you very much.
| continue | ||
| if ( | ||
| dq.target | ||
| != torch.ops.circle_custom.dequantize_mx_to_float.default |
There was a problem hiding this comment.
Seems that just quantize_mx and dequantize_mx are simpler. Is there some consideration for exposing dtypes in the name?
There was a problem hiding this comment.
There is no fake_quantize for mx types (just circle_custom::quantize_mx). So quantize_float_to_mx is a try (m.b. failed) to distinguish it from quantize_mx. In case circle_custom::quantize_mx will become circle_custom::fakequantize_mx, then usual quantize/dequantize naming scheme applies.
There was a problem hiding this comment.
@mhs4670go
It can be renamed to any other (more appropriate) name.
There was a problem hiding this comment.
Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.
There was a problem hiding this comment.
Ah, I see. How about go with
quantize_mx_decomposed, anddequantize_mx_decomposed? This aligns withtorch.ops.quantized_decomposed.
@mhs4670go
Ok. Got it. Thank you.
60fcd6a to
f7bb4d9
Compare
06581cb to
542db37
Compare
542db37 to
7f684a3
Compare
4fa1e40 to
369dabb
Compare
546d33e to
f6d4a2b
Compare
02f9367 to
ab81153
Compare
628786e to
234d069
Compare
| try: | ||
| k_total, v_total = past_key_value.update(k_rot, v) | ||
| if len(sig.parameters) == 2: | ||
| k_total, v_total = past_key_value.update(k_rot, v) |
There was a problem hiding this comment.
@stamalakhov Just for curiosity, can this be exported? Or, it is just a fallback for float model?
There was a problem hiding this comment.
yep. it works. a lot of model outputs (which are just k, v outputs) for use_cache==True.
There was a problem hiding this comment.
if len(sig.parameters) == 2 to make test passable
There was a problem hiding this comment.
When sig.parameters is 2? I'm asking this because I'm gonna trim the cache logic.
There was a problem hiding this comment.
i've added if len(sig.parameters) == 2 to make quantization.wrapq.wrappers.llama.test_quant_attn.TestQuantLlamaAttention passable.
MockCache.update uses just two inputs.
For export of the whole LLama 3 inputs are required.
I'm gonna trim the cache logic.
@mhs4670go
You mean remove it completely?
There was a problem hiding this comment.
Ah, I got it. I think you can just modify MockCache to have third parameter - layer_idx. Which will makes the codes simpler.
You mean remove it completely?
No. I just checked whether it has redundant logic. But, seems not that much:)
There was a problem hiding this comment.
Ah, I got it. I think you can just modify
MockCacheto have third parameter - layer_idx. Which will makes the codes simpler.
@mhs4670go
Understood. I'll update MockCache. Thank you.
fb0bd17 to
19cc002
Compare
19cc002 to
b3dac32
Compare
| # Case A: HuggingFace-style transformers: model.model.layers | ||
| lm = getattr(root, "model", None) | ||
|
|
||
| embeddings = ( |
There was a problem hiding this comment.
@mhs4670go
May be it will be better to introduce something like QuantLlamaModel and wrap all internal structure (embed, lm_head, norm) inside of it? and leave this general code as a fallback? like this:
try:
wrap(the_whole_model)
catch:
no specific wrapper for a class then use general logic
There was a problem hiding this comment.
It would be good to have QuantLlamaModel for convinience when we evaluate it or something like that. Just for note, even though the whole model is quantized, only decoder layers would need to be exported because of the runtime requirements. Nothing has been fixed though.
There was a problem hiding this comment.
OK. Got it. Thank you.
348cb52 to
83f9b1e
Compare
This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
83f9b1e to
93aa4c8
Compare
This draft tries to get fully quantized circle layers for
Llamamodel.TODO:
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com