Skip to content

Comments

feat: chunked TTS generation with quality selector#99

Open
glaucusj-sai wants to merge 1 commit intojamiepine:mainfrom
glaucusj-sai:feat/chunked-tts-quality
Open

feat: chunked TTS generation with quality selector#99
glaucusj-sai wants to merge 1 commit intojamiepine:mainfrom
glaucusj-sai:feat/chunked-tts-quality

Conversation

@glaucusj-sai
Copy link

Summary

Long text that exceeds the Qwen3-TTS model's max_new_tokens=2048 limit (~170s audio) now gets automatically handled:

  • Text splitting: Splits at sentence boundaries (with clause/word fallbacks) into configurable chunks (default 800 chars)
  • Crossfade concatenation: Joins audio chunks with a 50ms crossfade to eliminate clicks at boundaries
  • Quality selector: Runtime-switchable between standard (24kHz native) and high (44.1kHz via soxr VHQ resampling)
  • Settings API: New GET/POST /tts/settings endpoints for runtime quality control without restart

Short text (<800 chars) uses the original single-shot fast path with zero overhead.

Changes

File Change
backend/utils/chunked_tts.py New: text splitting, audio concat, resampling utilities
backend/backends/pytorch_backend.py Integrate chunking into generate(), extract _generate_single()
backend/main.py Add GET/POST /tts/settings endpoints
backend/models.py Add TTSSettingsUpdate model, bump text max_length to 50000
backend/requirements.txt Add soxr>=0.3.0 for high-quality resampling

Environment variables

Variable Default Description
TTS_QUALITY standard Output quality (standard=24kHz, high=44.1kHz)
TTS_MAX_CHUNK_CHARS 800 Max characters per chunk
TTS_UPSAMPLE_RATE 44100 Target sample rate for high quality

Test plan

  • Short text (<800 chars): uses single-shot path, no chunking overhead
  • Long text (9K+ chars): splits into ~12 chunks, generates and concatenates seamlessly
  • Quality switch to high: output sample rate changes to 44100
  • Switch back to standard: output returns to 24000
  • GET /tts/settings returns current config
  • POST /tts/settings with {"quality":"high"} updates at runtime

Tested on NVIDIA DGX Spark with Qwen3-TTS 1.7B — 9K character input produced ~12 minutes of seamless audio.

Long text that exceeds the model's max_new_tokens limit now gets
automatically split at sentence boundaries, generated per-chunk,
and concatenated with a short crossfade. A runtime-configurable
quality setting lets users choose between standard (24 kHz native)
and high (44.1 kHz via soxr VHQ resampling).

Changes:
- Add backend/utils/chunked_tts.py with text splitting, audio
  concatenation, and resampling utilities
- Integrate chunking directly into PyTorchTTSBackend.generate()
  so both the UI /generate and any API consumer benefit
- Add GET/POST /tts/settings endpoints for runtime quality control
- Bump GenerationRequest.text max_length from 5000 to 50000
- Add soxr to requirements.txt

Tested with 9K+ character input producing ~12 minutes of
seamless audio on an NVIDIA DGX Spark (Qwen3-TTS 1.7B).
@TacoDark
Copy link

AI Generated pull request. Please review code to make sure it works.

@glaucusj-sai
Copy link
Author

AI Generated pull request. Please review code to make sure it works.

yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets.

@TacoDark
Copy link

AI Generated pull request. Please review code to make sure it works.

yes it has been tested and i have used AI to post as i didn't have much experence with forking. this was done for my project that needed large scripts for podcast where quality matters and need long text to convert in the voice. I have generated more then 10 hours of voice so far with this. thanks for your commets.

Thank you for being honest, I approve of commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants