A Python application that reads PDF books aloud using high-quality text-to-speech synthesis. Automatically detects and starts reading from the first chapter with multiple TTS engine options for optimal voice quality.
- Multiple TTS Engines: Choose from traditional engines (Festival, eSpeak NG, eSpeak) or advanced neural models
- Neural TTS Models: High-quality AI-powered speech synthesis with GPU acceleration support
- Intelligent Text Chunking: Automatic text segmentation respects model token limits for error-free processing
- Lazy Loading: Heavy ML libraries only loaded when neural models are selected
- Interactive Voice Selection: Pick your preferred TTS engine and voice at startup
- Intelligent Chapter Detection: Automatically finds actual chapters (not subsections) using "Chapter" keyword
- Chapter Selection: Choose specific chapters to read or save as audio files
- Dual Reading Modes: Live audio playback or save chapters to WAV files
- Smart Audio Organization: Saves files in
pdf-tts/<book-title>/<chapter>.wavstructure - Segfault Prevention: Large chapters processed in chunks to prevent crashes
- Smart Text Processing: Cleans and formats text for optimal speech synthesis
- Multiple Book Support: Browse and select from organized book collections
- TTS Testing Suite: Comprehensive testing tool for all TTS engines and models
- Cross-Platform: Works on Linux, WSL, and Windows
- WSL Audio Support: Automatic audio configuration for Windows Subsystem for Linux
- Python 3.11+
- Conda (Anaconda/Miniconda)
- Linux or Windows environment
- Create and activate conda environment:
conda create -n pdf-tts python=3.11
conda activate pdf-tts- Install Python dependencies:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg libstdcxx-ngNote: Neural TTS dependencies are included in requirements.txt and will be installed automatically.
- Configure PDF directory paths:
Create a
.envfile in the project directory to specify your PDF book locations:
# Copy the example and edit with your paths
cp .env.example .envEdit .env file with your book directory paths:
# Linux/WSL base directory
LINUX_BASEDIR=/home/username/Documents/Books/
# Windows base directory
WINDOWS_BASEDIR=C:\Users\Username\Documents\Books\Important: The .env file is ignored by git to keep personal paths private.
- Install system dependencies:
For Linux/WSL:
sudo apt update
sudo apt install -y pulseaudio pulseaudio-utils alsa-utils ffmpeg libstdcxx-ng festival espeak-ng espeak-ng-data festvox-kallpc16kFor WSL specifically: Run the provided audio setup script for automatic configuration:
./setup_audio.shThis script installs and configures:
- PulseAudio for WSL audio support
- Festival TTS engine (natural sounding voices)
- eSpeak NG (high quality synthetic voices)
- eSpeak (basic fallback)
conda activate pdf-tts
python pdf_reader.pyTest all TTS engines and GPU acceleration with the comprehensive testing suite:
# Interactive mode - menu-driven testing
python test-tts.py --interactive
# List all available TTS options
python test-tts.py --list
# Test specific neural model
python test-tts.py --model microsoft/speecht5_tts
# Test with custom text
python test-tts.py --model facebook/mms-tts-eng --text "Custom test text"
# Run full test suite
python test-tts.pyInteractive Testing Features:
- Test all libraries - Tests pyttsx3, festival, espeak-ng, espeak, GPU acceleration
- Test all neural models - Tests HuggingFace transformer models
- Test specific library - Choose individual TTS engines from numbered list
- Test specific model - Choose neural models with custom text input
- GPU acceleration testing - Validates CUDA availability and PyTorch GPU support
Available Neural TTS Models:
microsoft/speecht5_tts- SpeechT5 (Highest quality, most natural speech with speaker embeddings)facebook/mms-tts-eng- MMS TTS English (Fast, multilingual with good quality)kakao-enterprise/vits-ljs- VITS LJSpeech (Natural sounding with advanced tokenization)
Neural TTS Features:
- Smart Chunking: Automatically splits long text into model-appropriate chunks
- Token Limit Handling: Prevents sequence length errors with intelligent text segmentation
- Audio Combination: Seamlessly combines multiple chunks into single audio output
- Error Recovery: Graceful fallbacks and enhanced error handling for each model type
When you run the application, you'll be presented with TTS engine choices:
Traditional Engines:
- pyttsx3 + espeak - Cross-platform engine with voice selection options
- eSpeak NG (High Quality) - Good quality synthetic voices, faster processing
- Festival (Natural Sounding) - Best traditional quality, most human-like voices
- eSpeak (Basic) - Basic quality, reliable fallback
Neural AI Models: 5. SpeechT5 - High-quality neural TTS with speaker embeddings [GPU/CPU] 6. MMS TTS English - Fast multilingual neural TTS [GPU/CPU] 7. VITS LJSpeech - Natural sounding neural speech synthesis [GPU/CPU]
For quick testing with predefined selections:
# Test with Festival TTS engine (adjust indices based on your setup)
echo -e "2\n0\n37\n1\n1" | python pdf_reader.py
# Test with eSpeak NG (adjust indices based on your setup)
echo -e "1\n0\n37\n1\n1" | python pdf_reader.pyNote: The numbers correspond to:
- TTS engine selection (0=pyttsx3, 1=eSpeak NG, 2=Festival, 3=eSpeak, 4=SpeechT5, 5=MMS TTS, 6=VITS)
- Folder selection (0=first folder)
- Book selection (37=specific book index - adjust for your collection)
- Chapter selection (1=first chapter)
- Mode selection (1=live audio, 2=save to file)
- TTS Engine Selection: Choose your preferred text-to-speech engine and voice
- Book Directory Selection: Pick a folder containing PDF books
- Book Selection: Choose a specific PDF to read
- Chapter Selection: Pick a specific chapter from automatically detected chapters
- Reading Mode: Choose between live audio playback or saving to WAV file
- Audio Processing: Enjoy high-quality speech synthesis or organized audio files
When saving chapters to audio files, the system creates an organized directory structure:
pdf-tts/
├── <Book-Title-From-PDF-Metadata>/
│ ├── 1-Chapter-1-Title.wav
│ ├── 2-Chapter-2-Title.wav
│ └── ...
└── <Another-Book>/
├── 1-Chapter-1-Title.wav
└── ...
- Book Title Detection: Uses PDF metadata
/Titlefield when available, falls back to filename - Clean Naming: Removes special characters and normalizes spaces for filesystem compatibility
- Chapter Organization: Each chapter saved as individual WAV file with descriptive naming
PyPDF2: PDF reading and processingpyttsx3: Text-to-speech synthesis interfacetorch: Deep learning framework for neural TTS modelstransformers: HuggingFace transformer models for neural TTSdatasets: Speaker embeddings and model datasoundfile: Audio file reading/writing for neural TTSphonemizer: Text to phoneme conversion for VITS models
pulseaudio: Audio server for Linux/WSLalsa-utils: Audio utilitiesffmpeg: Audio codec supportlibstdcxx-ng: C++ standard library compatibility
festival: High-quality natural sounding TTS engineespeak-ng: Modern eSpeak with improved voice qualityespeak-ng-data: Voice data for eSpeak NGfestvox-kallpc16k: Festival voice pack for better audio qualityespeak: Basic fallback TTS engine
| Engine | Quality | Speed | GPU | Notes |
|---|---|---|---|---|
| Neural Models | ||||
| SpeechT5 | ★★★★★ | ★★☆☆☆ | ✓ | Highest quality, most natural speech with smart chunking |
| MMS TTS | ★★★★☆ | ★★★★☆ | ✓ | Fast neural TTS, multilingual with robust processing |
| VITS | ★★★★☆ | ★★★☆☆ | ✓ | Natural sounding with advanced tokenization handling |
| Traditional Engines | ||||
| Festival | ★★★★☆ | ★★★☆☆ | ✗ | Best traditional quality |
| eSpeak NG | ★★★☆☆ | ★★★★☆ | ✗ | Good synthetic quality |
| eSpeak (pyttsx3) | ★★★☆☆ | ★★★★★ | ✗ | Configurable voices |
| eSpeak (basic) | ★★☆☆☆ | ★★★★★ | ✗ | Basic compatibility |
If you experience audio problems in WSL, run the audio setup script:
./setup_audio.shInstall missing engines:
sudo apt install festival espeak-ng espeak-ng-dataTry different TTS engines in order of preference:
- Neural Models (highest quality):
- SpeechT5 - Best overall quality and naturalness
- MMS TTS - Fast with good quality
- VITS - Natural sounding speech
- Traditional engines:
- Festival - Best traditional quality
- eSpeak NG - Good synthetic quality
- eSpeak with pyttsx3 - Configurable voices
If you see errors about missing base directory configuration:
"Error: Base directory not configured":
- Create a
.envfile in the project directory - Add the appropriate variable for your operating system:
# For Linux/WSL users LINUX_BASEDIR=/path/to/your/pdf/books/ # For Windows users WINDOWS_BASEDIR=C:\path\to\your\pdf\books\
"Error: Configured base directory does not exist":
- Verify the path exists and is accessible
- Check for typos in the directory path
- Ensure you have read permissions for the directory
- Create the directory if it doesn't exist
If you encounter errors with neural models:
Token Length Errors:
- The application automatically handles token limits with smart chunking
- If you still see token errors, try selecting shorter chapters or text sections
CUDA/GPU Issues:
- Neural models work on both GPU and CPU
- GPU provides faster processing but is not required
- Check GPU availability with:
python -c "import torch; print(torch.cuda.is_available())"
Model Loading Errors:
- Ensure stable internet connection for initial model downloads
- Models are cached locally after first download
- Try switching to a different neural model if one fails to load
Audio Playback Issues:
- Ensure audio system is properly configured (run
./setup_audio.shfor WSL) - Try different audio players: aplay, paplay, or save to file mode
hello_pdf/
├── pdf_reader.py # Main PDF reader application
├── test-tts.py # Comprehensive TTS testing suite
├── setup_audio.sh # WSL audio configuration script
├── requirements.txt # Python dependencies
├── claude.md # Project instructions and session history
└── README.md # This file
Feel free to submit issues and enhancement requests!
This project is open source and available under the MIT License.