A real-time multimodal AI assistant that combines voice interaction with screen context awareness. Talk to AI while it can see and understand what's on your screen.
- π€ Real-time Voice Chat - Continuous conversation with VAD (Voice Activity Detection)
- π₯οΈ Screen Context Awareness - AI can see and analyze your screen when relevant
- π§ Smart Screen Triggers - Automatically captures screen based on conversation context
- β‘ Fast Response Times - Optimized for real-time interaction
- π WebSocket Communication - Low-latency bidirectional communication
- Speech-to-Text: OpenAI Whisper API
- Text-to-Speech: OpenAI TTS API
- Multimodal AI: Google Gemini 2.0 Pro
- Voice Activity Detection: Silero VAD (via @ricky0123/vad-react)
- Frontend: React 18 + TypeScript + Vite + Chakra UI
- Backend: FastAPI + Python 3.11 + WebSockets
- Deployment: Docker + AWS EC2 + GitHub Actions CI/CD
- Development: Dev Containers + Hot Reload
# 1. Open in VS Code with Dev Container extension
# 2. Container auto-setups dependencies
# 3. Start services:
# Terminal 1 - Backend
cd backend && python main.py
# Terminal 2 - Frontend
cd frontend && npm run dev# Backend setup
cd backend
uv pip install -r requirements.txt
cp env.example .env # Configure your API keys
python main.py
# Frontend setup (new terminal)
cd frontend
npm install
npm run dev- Frontend: http://localhost:3000
- Backend: http://localhost:8000
- Health Check: http://localhost:8000/health
Copy backend/env.example to backend/.env:
# Required for full functionality
OPENAI_API_KEY=your_openai_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
# Optional
HUGGINGFACE_API_TOKEN=your_huggingface_token_here
SECRET_KEY=your_secret_key_hereImportant: For voice features to work:
- Use HTTPS in production (required for microphone access)
- For local development: Chrome/Firefox will ask for microphone permission
- Allow microphone access when prompted
Why HTTPS? Browsers require HTTPS for microphone access.
π Best Option: Cloudflare Tunnel with Custom Domain
# Run the migration script on your EC2 instance
ssh -i your-key.pem ec2-user@54.211.160.83
cd /opt/app/background-multimodal-llm
./deployment/scripts/migrate-to-cloudflare.shBenefits of Cloudflare Tunnel:
- β Custom domain support (back-agent.com)
- β No bandwidth limits or timeouts
- β Better performance and reliability
- β Enterprise-grade security
- β Free or much cheaper than alternatives ($10/year vs $36-96/year)
Quick URL Updates:
# Update all configurations with new URL
./deployment/scripts/quick-update.sh https://your-domain.comβββ backend/ # FastAPI backend
β βββ main.py # Main application entry point
β βββ models/ # AI model integrations
β βββ services/ # Core services & managers
β βββ env.example # Environment configuration template
β βββ requirements.txt # Python dependencies
βββ frontend/ # React frontend
β βββ src/
β β βββ components/ # React components
β β βββ hooks/ # Custom React hooks
β β βββ services/ # API service layers
β βββ package.json # Node dependencies
βββ deployment/ # All deployment files
β βββ docker-compose.dev.yml # Development deployment
β βββ scripts/setup-aws-dev.sh # AWS infrastructure setup
β βββ infrastructure/ # CloudFormation templates
βββ docs/ # Documentation
βββ .github/workflows/ # CI/CD pipelines
# Backend
cd backend
python main.py # Start development server
uv pip install -r requirements.txt # Install dependencies
# Frontend
cd frontend
npm run dev # Start development server
npm run build # Build for production
npm run preview # Preview production build
# Deployment
docker-compose -f deployment/docker-compose.dev.yml up -d # Start with Docker# View logs
docker-compose -f deployment/docker-compose.dev.yml logs -f
# Check specific service
docker-compose -f deployment/docker-compose.dev.yml logs backend
docker-compose -f deployment/docker-compose.dev.yml logs frontend
# Restart services
docker-compose -f deployment/docker-compose.dev.yml restart# 1. Setup AWS infrastructure
chmod +x deployment/scripts/setup-aws-dev.sh
./deployment/scripts/setup-aws-dev.sh
# 2. Configure GitHub Secrets (for CI/CD)
# Go to: https://github.com/your-repo/settings/secrets/actions
# Add these secrets:
# - DEV_EC2_INSTANCE_IP: Your EC2 public IP
# - DEV_EC2_SSH_PRIVATE_KEY: Your EC2 private key content
# - OPENAI_API_KEY: Your OpenAI API key
# - GEMINI_API_KEY: Your Gemini API key
# 3. Deploy via GitHub Actions
git push origin main # Triggers automatic deploymentWe now use Cloudflare Tunnel instead of ngrok for HTTPS access:
# Run the migration script
chmod +x deployment/scripts/migrate-to-cloudflare.sh
./deployment/scripts/migrate-to-cloudflare.shBenefits of Cloudflare Tunnel vs ngrok:
- β Custom domain support (back-agent.com)
- β No bandwidth limits or timeouts
- β Better performance and reliability
- β Enterprise-grade security
- β Free or much cheaper than ngrok ($10/year vs $36-96/year)
After migration:
- Frontend: https://back-agent.com
- Backend API: https://api.back-agent.com
- WebSockets: wss://api.back-agent.com/ws
- Push to
mainβ Automatic production deployment - Pull requests β Automatic testing
- Health checks β Automatic validation
- Rollback support β Safe deployments
- Development: ~$5-15/month (free tier eligible)
- Production: ~$25-50/month (depends on usage)
- Click "Start Voice Assistant"
- Grant microphone permission when prompted
- Start talking - VAD automatically detects speech
- AI responds with voice and text
- Smart Triggers: AI automatically captures screen when you say things like:
- "Can you see my screen?"
- "What's this error?"
- "Help me with this"
- Manual Capture: Click "Share Screen" for continuous sharing
- Privacy: Screen capture only when explicitly needed
- Clear Speech: Speak clearly for better transcription
- Context Clues: Use phrases like "look at this" to trigger screen capture
- Error Debugging: Say "what's wrong here?" while viewing errors
- Natural Conversation: Talk naturally - the AI understands context
Adjust in frontend/src/hooks/useVoiceAgent.ts:
const vadOptions = {
positiveSpeechThreshold: 0.8, // Higher = less sensitive
negativeSpeechThreshold: 0.2, // Lower = less sensitive
minSpeechFrames: 3, // Minimum frames for speech detection
};Configure in backend/main.py:
SCREEN_TRIGGER_CONFIDENCE = 0.7 # Confidence threshold for auto-capture
SCREEN_CAPTURE_QUALITY = 0.8 # Image quality (0.1-1.0)π€ Microphone not working
- Ensure HTTPS (required in production)
- Check browser permissions
- Try refreshing the page
π₯οΈ Screen sharing not working
- Use Chrome/Firefox (Safari has limitations)
- Grant screen sharing permission
- Check for browser extensions blocking
β‘ Slow responses
- Check your internet connection
- Verify API keys are configured
- Monitor backend logs for errors
π Connection issues
- Check WebSocket connection in browser dev tools
- Verify backend is running on port 8000
- Check firewall settings
- Health Check: http://localhost:8000/health
- Performance: http://localhost:8000/performance
- Logs:
docker-compose logs -f