Problem: Current inference runs at full FP32 precision, which leads to high VRAM usage (typically >12GB for standard checkpoints) and slower generation times.
Proposed Solution:
Implement torch.cuda.amp.autocast in the inference script.
Provide a flag (--half) to switch between precision modes.
Benchmarking: Test for potential degradation in audio fidelity (WER/SIM scores) when using reduced precision.
Impact: This would allow users with 8GB VRAM GPUs to run the model locally.