feat: chunked TTS generation with quality selector#99
feat: chunked TTS generation with quality selector#99glaucusj-sai wants to merge 1 commit intojamiepine:mainfrom
Conversation
Long text that exceeds the model's max_new_tokens limit now gets automatically split at sentence boundaries, generated per-chunk, and concatenated with a short crossfade. A runtime-configurable quality setting lets users choose between standard (24 kHz native) and high (44.1 kHz via soxr VHQ resampling). Changes: - Add backend/utils/chunked_tts.py with text splitting, audio concatenation, and resampling utilities - Integrate chunking directly into PyTorchTTSBackend.generate() so both the UI /generate and any API consumer benefit - Add GET/POST /tts/settings endpoints for runtime quality control - Bump GenerationRequest.text max_length from 5000 to 50000 - Add soxr to requirements.txt Tested with 9K+ character input producing ~12 minutes of seamless audio on an NVIDIA DGX Spark (Qwen3-TTS 1.7B).
|
AI Generated pull request. Please review code to make sure it works. |
yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets. |
Thank you for being honest, I approve of commit |
Summary
Long text that exceeds the Qwen3-TTS model's
max_new_tokens=2048limit (~170s audio) now gets automatically handled:standard(24kHz native) andhigh(44.1kHz via soxr VHQ resampling)GET/POST /tts/settingsendpoints for runtime quality control without restartShort text (<800 chars) uses the original single-shot fast path with zero overhead.
Changes
backend/utils/chunked_tts.pybackend/backends/pytorch_backend.pygenerate(), extract_generate_single()backend/main.pyGET/POST /tts/settingsendpointsbackend/models.pyTTSSettingsUpdatemodel, bump text max_length to 50000backend/requirements.txtsoxr>=0.3.0for high-quality resamplingEnvironment variables
TTS_QUALITYstandardstandard=24kHz,high=44.1kHz)TTS_MAX_CHUNK_CHARS800TTS_UPSAMPLE_RATE44100Test plan
GET /tts/settingsreturns current configPOST /tts/settingswith{"quality":"high"}updates at runtimeTested on NVIDIA DGX Spark with Qwen3-TTS 1.7B — 9K character input produced ~12 minutes of seamless audio.