-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
What?
Problem Statement
When TICO converts (lower/decompose) quantized operator into several operators, some intermediate compute operations remain as float32.
For example,
Conv3d can be lowered into Conv2d + Add + Reshape... or Conv2d + Reshape.
NOTE that our Qwen3-VL patch embedding aims to be lowered down to Conv2d + Reshape without Add when patch size == stride size. This pass is scheduled to be implemented soon.
However, our current pass (26-02-13, main branch) converts Conv3d into Conv2d + Add + Reshape...
Thus, after quantization, Add is additionally generated. This extra Add remains in float32.
How to resolve?
- Operator conversion should be done before quantization-calibration, if it produces any computing operations.
- If required, we could manipulate qparams for specific operators.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels