Enhancement- Add Mixed-Precision (FP16/BF16) Support for Inference

Problem: Current inference runs at full FP32 precision, which leads to high VRAM usage (typically >12GB for standard checkpoints) and slower generation times.

Proposed Solution:

Implement torch.cuda.amp.autocast in the inference script.

Provide a flag (--half) to switch between precision modes.

Benchmarking: Test for potential degradation in audio fidelity (WER/SIM scores) when using reduced precision.

Impact: This would allow users with 8GB VRAM GPUs to run the model locally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement- Add Mixed-Precision (FP16/BF16) Support for Inference #156

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Enhancement- Add Mixed-Precision (FP16/BF16) Support for Inference #156

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions