A full-stack application for podcast transcription, summarization, and interactive Q&A using Retrieval-Augmented Generation (RAG).
PodNotes transforms your podcast listening experience by:
- Transcribing audio files using Whisper
- Identifying speakers with advanced diarization
- Summarizing podcast content with AI
- Enabling natural language Q&A about the podcast content
- Storing podcasts for future reference
The application uses a modern tech stack with a FastAPI backend, React frontend, and leverages AWS services for production deployment.
- Audio Processing: Upload and transcribe podcast audio files
- Transcription: Convert speech to text using OpenAI's Whisper
- Speaker Diarization: Identify different speakers using DOVER-Lap fusion technology
- AI Summarization: Generate concise summaries of podcast content
- Interactive Q&A: Ask questions about podcast content using RAG
- Cloud Storage: Store podcasts, transcripts, and metadata in AWS
- Vector Search: Semantic search capabilities using ChromaDB
PodNotes uses a hybrid architecture that can run locally for development or on AWS for production:
┌─────────────┐ ┌──────────────────────────────────────┐
│ │ │ Backend │
│ Frontend │ │ │
│ (React/TS) │◄───┤ FastAPI + Whisper + ChromaDB + LLM │
│ │ │ │
└─────────────┘ └──────────────────────────────────────┘
Note: This architecture uses ChromaDB deployed alongside the backend (e.g., on EC2 or ECS) for vector storage, simplifying the stack compared to previous OpenSearch integration.
┌─────────────┐ ┌──────────────────────────────┐ ┌───────────────┐
│ │ │ Backend (FastAPI/ECS) │ │ AWS S3 │
│ Frontend │ │ + Whisper + ChromaDB + LLM │ │ (Audio & │
│ (React/TS) │◄───┤ │◄───┤ Transcript │
│ │ │ │ │ Storage) │
└─────────────┘ └──────────────────────────────┘ └───────────────┘
│
▼
┌─────────────┐
│ AWS DynamoDB│
│ (Metadata │
│ Storage) │
└─────────────┘
The backend logic is organized into services within the backend/services/ directory:
aws_service.py: Handles interactions with AWS services like S3 and DynamoDB.chromadb_service.py: Manages vector storage and retrieval using ChromaDB for the RAG system.langchain_service.py: Orchestrates language model interactions (transcription, summarization, Q&A) using LangChain.ollama_service.py: Provides specific integration points for Ollama models if used.
PodNotes uses a hybrid Retrieval-Augmented Generation (RAG) approach to provide accurate and context-rich answers to questions about podcast content:
-
Document Processing:
- Podcast audio is transcribed to text using Whisper
- Text is split into smaller chunks
- Each chunk is converted to a vector embedding
- Vector embeddings and transcript metadata are stored in ChromaDB
- Original audio files and structured transcripts are stored in S3 (AWS) or locally
- Podcast metadata (like summaries) is stored in DynamoDB (AWS) or locally
-
Storage:
- ChromaDB stores vector embeddings
- BM25 index is built over transcript text for keyword-based retrieval
- Metadata and references are stored in DynamoDB
-
Retrieval:
- When a question is asked, two retrieval strategies are run in parallel:
- BM25 lexical retrieval: Finds chunks with exact keyword matches
- Semantic retrieval (vector search): Finds chunks semantically similar to the question
- The two results are fused (hybrid scoring) to maximize coverage and relevance
- Retrieved chunks provide context for the LLM
- When a question is asked, two retrieval strategies are run in parallel:
-
Generation:
- The LLM generates an answer using the retrieved context
- The system maintains conversation history for follow-up questions
PodNotes uses DOVER-Lap (Diarization Output Voting Error Reduction - Label-Propagation) for accurate speaker identification in podcasts:
-
Multiple Diarization Systems:
- The system runs multiple speaker diarization algorithms in parallel:
- Pyannote.audio: State-of-the-art neural speaker diarization
- PvFalcon: Picovoice's speaker diarization technology
- The system runs multiple speaker diarization algorithms in parallel:
-
System Fusion:
- DOVER-Lap combines the outputs from multiple diarization systems
- Uses a graph-based label propagation algorithm to resolve disagreements
- Produces a more accurate consensus diarization than any single system
-
Integration with Whisper:
- Speaker labels are mapped to Whisper transcript segments
- Each segment is assigned to the speaker with maximum temporal overlap
- Results in a structured transcript with accurate speaker attribution
-
Benefits:
- Improved speaker identification accuracy (10-20% error reduction)
- More robust to different acoustic conditions and speaker overlaps
- Enhanced transcript readability with clear speaker labels
To use the DOVER-Lap diarization feature:
-
HuggingFace Token:
- Create an account at HuggingFace
- Accept the user agreements for:
- Generate a token at HuggingFace Settings
- Add the token to your
.envfile asHUGGINGFACE_TOKEN=your-token-here
-
Enable Diarization:
- Set
DIARIZATION=truein your environment or.envfile - The system will automatically use DOVER-Lap when diarization is enabled
- Set
-
System Requirements:
- Requires PyTorch and additional dependencies
- Recommended: GPU for faster processing of longer podcasts
PodNotes supports fully containerized local development using Docker Compose. This will start the backend, frontend, and Ollama LLM services with a single command.
- Docker and Docker Compose installed
-
Clone the repository:
git clone https://github.com/yourusername/PodNotes.git cd PodNotes -
Configure environment variables:
- Copy and edit the example env files as needed:
cp backend/env.example backend/.env # Edit backend/.env with your credentials and settings - (Optional) Configure HuggingFace, AWS, and other tokens as needed in
backend/.env.
- Copy and edit the example env files as needed:
-
Build and start all services:
docker-compose up --build
This will:
- Build and run the backend (FastAPI, ChromaDB integration, etc.) on port 8001
- Build and run the frontend (React, Vite) on port 80
- Start the Ollama LLM service on port 11434
-
Access the application:
- Frontend: http://localhost
- Backend API docs: http://localhost:8001/docs
- Ollama API: http://localhost:11434
-
Stopping services:
docker-compose down
- The backend is configured to talk to Ollama at
http://ollama:11434(as defined indocker-compose.yml). - Ollama model data is persisted in a Docker volume (
ollama_data). - Uploaded files and temporary data are mapped to the host for development convenience.
- You can run
docker-compose up --buildanytime you change code or dependencies.
PodNotes/
├── backend/ # FastAPI backend, AI services, vector DB logic
│ ├── services/ # Modular service files (AWS, ChromaDB, LangChain, Ollama, etc.)
│ ├── main.py # FastAPI app entrypoint
│ ├── Dockerfile # Backend Docker build config
│ └── ...
├── frontend/ # React (Vite) frontend
│ ├── src/ # React components, pages, utils
│ ├── Dockerfile # Frontend Docker build config
│ └── ...
├── docker-compose.yml # Multi-service orchestration (backend, frontend, ollama)
└── README.md # Project documentation
- backend/services/: Contains all major service modules (AWS, ChromaDB, LangChain, Ollama integration, etc.)
- frontend/src/: All frontend React/TypeScript code
- docker-compose.yml: Defines and networks all services for local development
Here is a reference for the provided docker-compose.yml:
services:
backend:
build:
context: ./backend
env_file:
- ./backend/.env
environment:
- IS_LOCAL=false
- IS_AWS=${IS_AWS:-false}
- MOCK_MODE=${MOCK_MODE:-false}
- DIARIZATION=${DIARIZATION:-false}
- VECTOR_STORE_DIR=/app/data/vector_stores
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- ./backend/temp:/app/temp
ports:
- "8001:8001"
restart: unless-stopped
depends_on:
- ollama
frontend:
build:
context: ./frontend
ports:
- "80:80"
depends_on:
- backend
environment:
- VITE_API_URL=http://backend:8001
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
volumes:
ollama_data:- Backend cannot connect to Ollama: Ensure the Ollama service is running and the backend is using
OLLAMA_BASE_URL=http://ollama:11434. - File permission errors: Make sure your host user has permission to write to the mapped
backend/tempdirectory. - Port conflicts: Make sure ports 80, 8001, and 11434 are free on your host.
- AWS/Cloud issues: Double-check your
.envconfiguration and IAM permissions.
- Python 3.9+
- Node.js 18+
- AWS account (for production deployment)
- OpenAI API key (optional, for OpenAI models)
-
Clone the repository:
git clone https://github.com/yourusername/PodNotes.git cd PodNotes/backend -
Create and activate a virtual environment:
python -m venv PN source PN/bin/activate # On Windows: PN\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the backend directory with:IS_LOCAL=true -
Start the backend server:
./start_backend.sh # Or manually: # uvicorn main:app --reload --host 0.0.0.0 --port 8000
-
Navigate to the frontend directory:
cd ../frontend -
Install dependencies:
npm install
-
Start the development server:
npm run dev
-
Access the application: Open your browser and go to
http://localhost:5173
-
S3 Bucket:
- Create an S3 bucket for storing audio files and transcripts
- Configure CORS settings to allow frontend access
-
DynamoDB:
- Create a DynamoDB table named
Podcastswith:- Primary Key:
PodcastID(String) - Sort Key:
Type(String)
- Primary Key:
- Ensure appropriate IAM permissions for the backend to access this table
- Create a DynamoDB table named
-
ChromaDB Deployment:
- ChromaDB needs to be accessible by the backend. This could involve:
- Running ChromaDB as a separate container/service (e.g., on ECS/EKS or a dedicated EC2 instance)
- Running ChromaDB persistently on the same instance/container as the FastAPI backend (simpler, suitable for smaller scale)
- Ensure the
CHROMA_HOSTandCHROMA_PORTenvironment variables point to the correct ChromaDB instance
- ChromaDB needs to be accessible by the backend. This could involve:
-
IAM Permissions:
- Create an IAM user or role for the backend application
- Grant necessary permissions for S3 (GetObject, PutObject, ListBucket) and DynamoDB (GetItem, PutItem, Query, Scan)
- Set up an EC2 instance or other compute environment
- Install Docker
- Create
.envfile on the server with production settings:IS_LOCAL=false AWS_REGION=your-aws-region # Add other necessary variables like LLM API keys, HuggingFace token, etc. CHROMA_HOST=your_chromadb_host_or_ip # Or 127.0.0.1 if running on same instance CHROMA_PORT=your_chromadb_port # e.g., 8000 # Ensure AWS credentials are configured (e.g., via IAM role attached to EC2) - Build and run the backend Docker container (you might need a Dockerfile):
# Example docker run command (adapt as needed) docker run -d --env-file .env -p 8000:8000 your-backend-image-name - (If running ChromaDB separately) Ensure the ChromaDB container/service is running and accessible
- Build the frontend:
npm run build - Upload the contents of the
frontend/distdirectory to an S3 bucket configured for static website hosting - (Optional) Configure CloudFront as a CDN in front of the S3 bucket for better performance and HTTPS
- Update frontend API endpoint: Ensure the frontend code points to the deployed backend URL
The tests directory contains utilities for testing various components:
# Test OpenSearch connectivity
cd backend
./tests/opensearch/run_opensearch_test.sh- AWS Credentials: Ensure correct IAM permissions and that credentials (or IAM role) are properly configured for the backend environment
- DynamoDB Key Schema: Verify the
Podcaststable usesPodcastID(String) as HASH andType(String) as RANGE key - ChromaDB Connection: Check
CHROMA_HOSTandCHROMA_PORTenvironment variables and network connectivity between the backend and ChromaDB - LLM API Keys: Make sure API keys for Whisper, summarization, or Q&A models are correctly set in the environment
- HuggingFace Token: Required for
pyannotemodels used in diarization
Contributions are welcome! Please feel free to submit a Pull Request
- Whisper for transcription
- LangChain for LLM orchestration
- ChromaDB for vector storage
- FastAPI for the backend framework
- React for the frontend framework
This project is licensed under the MIT License - see the LICENSE file for details
- OpenAI Whisper for audio transcription
- LangChain for RAG implementation
- FastAPI for the backend framework
- React for the frontend framework