Skip to content

foundationallm/realtime-speech-prototype

Repository files navigation

Realtime Voice AI Agent

A web application that enables real-time voice conversations with an AI agent using Azure's Realtime API (Microsoft Foundry). The agent can use MCP tools to access Microsoft Learn documentation, supports interruptions, and maintains natural conversation flow.

Features

  • Web-based Interface: Accessible via web browser, no installation required
  • Real-time Voice: Bidirectional audio streaming via WebSocket
  • Interruptions: User can interrupt agent mid-speech
  • MCP Tool Integration: Agent can access Microsoft Learn documentation via native Azure Realtime API MCP support
  • Natural Flow: Conversation continues until explicit goodbye
  • Comprehensive Logging: All interactions logged with timestamps
  • Browser Audio: Uses Web Audio API for microphone capture and audio playback
  • Audio Visualization: Real-time visual feedback for both user and agent speech
  • Error Handling: Graceful error handling and recovery
  • Session Management: Support for multiple concurrent sessions

Prerequisites

  • Python 3.9 or higher
  • Azure OpenAI account with Realtime API access
  • Azure OpenAI resource with gpt-realtime model deployed

Quick Start

  1. Clone the repository

    git clone <repository-url>
    cd realtime-speech
  2. Create virtual environment

    python -m venv venv
    # Windows
    venv\Scripts\activate
    # Linux/Mac
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Configure environment

    copy .env.example .env
    # Edit .env with your credentials
  5. Verify environment

    python verify_env.py
  6. Start the server

    python app.py
  7. Access the web interface

    • Open browser to http://localhost:8000
    • Grant microphone permissions when prompted
    • Click "Start Conversation" button

Detailed Setup Instructions

1. Prerequisites

Python Installation

  • Ensure Python 3.9+ is installed
  • Verify with: python --version

Azure OpenAI Setup

  1. Create an Azure OpenAI resource in the Azure portal
  2. Deploy the gpt-realtime model in your resource
  3. Get your endpoint URL and API key from the Azure portal
  4. Note your deployment name (typically gpt-realtime)

MCP Server (Optional)

The Microsoft Learn MCP server is pre-configured and requires no setup. It provides free access to official Microsoft documentation.

To use a different MCP server, update these settings in .env:

  • MCP_SERVER_URL: URL of your MCP server
  • MCP_SERVER_LABEL: Label for the MCP server
  • MCP_REQUIRE_APPROVAL: Approval mode (never, always, or auto)

The agent has access to the following tools via the Microsoft Learn MCP Server:

Tool Description
microsoft_docs_search Semantic search against Microsoft official technical documentation
microsoft_docs_fetch Fetch and convert a Microsoft documentation page into markdown
microsoft_code_sample_search Search for official Microsoft/Azure code snippets and examples

2. Installation Steps

Step 1: Clone/Download Repository

git clone <repository-url>
cd realtime-speech

Step 2: Create Virtual Environment

python -m venv venv

Windows:

venv\Scripts\activate

Linux/Mac:

source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure Environment

  1. Copy the example environment file:

    copy .env.example .env
    # Linux/Mac: cp .env.example .env
  2. Edit .env and fill in your values:

    • AZURE_OPENAI_ENDPOINT: Your Azure OpenAI endpoint URL
    • AZURE_OPENAI_API_KEY: Your Azure OpenAI API key
    • AZURE_OPENAI_DEPLOYMENT_NAME: Your deployment name (usually gpt-realtime)
    • AZURE_OPENAI_API_VERSION: API version (default: 2024-05-01-preview)
    • NEWSAPI_API_KEY: Your NewsAPI key (optional)
    • MCP_SERVER_PATH: Path to NewsAPI MCP server directory

Step 5: Verify Environment

python verify_env.py

This will check:

  • .env file exists
  • ✓ All required environment variables are set
  • ✓ MCP server path is accessible
  • ✓ Optional: Test Azure API connection (with --test-api)
  • ✓ Optional: Test MCP server (with --test-mcp)

3. Running the Application

Start the Server

Option 1: Using the app.py script

python app.py

Option 2: Using uvicorn directly

uvicorn app:app --host 0.0.0.0 --port 8000

Access the Web Interface

  1. Open your browser to http://localhost:8000
  2. You should see the conversation interface
  3. Click "Start Conversation" to begin
  4. Grant microphone permissions when prompted

Using the Application

  • Start Conversation: Click the "Start Conversation" button
  • Stop Conversation: Click "Stop Conversation" to end the session
  • Interrupt Agent: Simply start speaking while the agent is talking
  • End Conversation: Say "goodbye", "bye", "exit", or "quit"
  • View Transcripts: See real-time transcripts in the transcript area
  • Audio Visualization: Watch the visual feedback for both your voice and the agent's voice

Configuration Reference

Environment Variables

Variable Required Description Example
AZURE_OPENAI_ENDPOINT Yes Azure OpenAI endpoint URL https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY Yes Azure OpenAI API key your-api-key
AZURE_OPENAI_DEPLOYMENT_NAME Yes Deployment name gpt-realtime
AZURE_OPENAI_API_VERSION Yes API version 2025-08-28
MCP_SERVER_URL No MCP server URL (Azure native support) https://learn.microsoft.com/api/mcp
MCP_SERVER_LABEL No Label for the MCP server microsoft-learn
MCP_REQUIRE_APPROVAL No Tool approval mode never (options: never, always, auto)
LOG_LEVEL No Logging level INFO (options: DEBUG, INFO, WARNING, ERROR)
LOG_FILE No Log file path logs/conversation.log
WEB_HOST No Web server host 0.0.0.0
WEB_PORT No Web server port 8000

Default Values

  • LOG_LEVEL: INFO
  • LOG_FILE: logs/conversation.log
  • WEB_HOST: 0.0.0.0
  • WEB_PORT: 8000

Troubleshooting

Common Issues

Audio Permission Denied

Problem: Browser blocks microphone access Solution:

  • Check browser settings for microphone permissions
  • Ensure you're using HTTPS or localhost (required for getUserMedia)
  • Try a different browser (Chrome, Firefox, Edge recommended)

WebSocket Connection Failed

Problem: Cannot connect to server Solution:

  • Verify server is running on correct port
  • Check firewall settings
  • Ensure WEB_HOST and WEB_PORT are correct
  • Check browser console for errors

Azure API Errors

Problem: 401 Unauthorized or 404 Not Found Solution:

  • Verify AZURE_OPENAI_ENDPOINT is correct
  • Check AZURE_OPENAI_API_KEY is valid
  • Ensure AZURE_OPENAI_DEPLOYMENT_NAME matches your deployment
  • Verify gpt-realtime model is deployed in your resource

MCP Tools Not Working

Problem: MCP tools not being called or returning errors Solution:

  • Verify MCP_SERVER_URL is correct (default: https://learn.microsoft.com/api/mcp)
  • Check network connectivity to the MCP server
  • Verify Azure Realtime API version supports MCP (2025-08-28 or later recommended)
  • Check logs for MCP-related errors

No Audio Visualization

Problem: Audio visualizations not showing Solution:

  • Check browser console for JavaScript errors
  • Ensure microphone permissions are granted
  • Try refreshing the page
  • Check browser compatibility (Chrome, Firefox, Edge recommended)

Browser Compatibility

  • Chrome/Edge: Full support (recommended)
  • Firefox: Full support
  • Safari: Limited support (may have audio issues)
  • Mobile browsers: Limited support

Development

Running in Development Mode

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

Enabling Debug Logging

Set in .env:

LOG_LEVEL=DEBUG

Project Structure

realtime-speech/
├── app.py                 # FastAPI application entry point
├── config.py              # Configuration management
├── websocket_handler.py    # WebSocket connection management
├── realtime_client.py      # Azure Realtime API client
├── mcp_client.py           # MCP protocol client
├── tool_handler.py         # Tool registration and execution
├── conversation_manager.py # Conversation state management
├── logger.py               # Logging configuration
├── verify_env.py           # Environment verification
├── static/                 # Static files
│   ├── index.html          # Web interface
│   ├── app.js              # Frontend JavaScript
│   └── styles.css          # CSS styles
├── logs/                   # Log files
├── requirements.txt        # Python dependencies
├── .env.example            # Example configuration
└── README.md               # This file

Examples and Use Cases

Example Conversations

User: "What are the Azure CLI commands to create an Azure Container App?"

Agent: [Calls microsoft_docs_search tool] "According to the official Microsoft documentation, you can create an Azure Container App using..."

User: "Show me how to implement IHttpClientFactory in .NET 8"

Agent: [Calls microsoft_code_sample_search tool] "Here's an example from the official Microsoft documentation..."

MCP Tool Integration

The agent automatically uses Microsoft Learn MCP tools when relevant:

  • microsoft_docs_search: Semantic search against Microsoft official documentation
  • microsoft_docs_fetch: Fetch full content from a specific documentation page
  • microsoft_code_sample_search: Search for official code samples and examples

These tools are executed automatically by Azure Realtime API - no manual tool handling required.

Additional Resources

License

[Add your license here]

Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review the logs in logs/conversation.log
  3. Run python verify_env.py to verify configuration
  4. Open an issue in the repository

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published