diff --git a/docs/daily.mdx b/docs/daily.mdx index a60d307..c8c4745 100644 --- a/docs/daily.mdx +++ b/docs/daily.mdx @@ -4,13 +4,514 @@ title: Daily import { IntegrationHeader } from '/snippets/integration-header.mdx' - -[Daily](https://daily.co/) is the team behind Pipecat, empowering developers to build voice agents at scale using ultra low latency, open source SDKs and enterprise reliability. Building the future of voice, video, and real-time AI, Daily helps you imagine and create innovative communication experiences with infrastructure built on WebRTC. +This guide demonstrates how to build a real-time voice agent using [Pipecat](https://github.com/pipecat-ai/pipecat), Daily's open-source framework for building voice agents. Rime provides natural-sounding speech synthesis. -Rime's text-to-speech (TTS) synthesis model is available through the Daily API. With Daily's Rime integration and the Pipecat framework, you can develop responsive AI voice applications that deliver natural, lifelike interactions. +You can mix and match different services for each component of your Pipecat pipeline. This tutorial uses: +- `silero` for voice activity detection (VAD) +- `gpt-4o-transcribe` for speech-to-text (STT) +- `gpt-4o-mini` for generating responses +- `rime` for text-to-speech (TTS) -View our [Rime Pipecat demo agents](https://github.com/rimelabs/rime-pipecat-agents) for ready-to-use examples, from basic voice agents to multilingual agents that switch languages dynamically. For more details on the Pipecat framework, visit [Pipecat's documentation](https://docs.pipecat.ai/getting-started/introduction). \ No newline at end of file +The result is a working voice agent that runs locally and opens in your browser. + +Demo of a voice agent conversation using Pipecat and Rime + +The guide uses the following Pipecat terminology: +- A **pipeline** is a sequence of frame processors. Audio frames flow in, are transcribed, processed by the LLM, and synthesized into speech, then flow back out. +- A **transport** handles real-time audio input and output (I/O). Pipecat supports multiple transports, including WebRTC (browser), WebSocket, and local audio devices. +- **Frame processors** are the building blocks. Each service (STT, LLM, and TTS, respectively) is a processor that transforms frames as they flow through the pipeline. + +If you'd like to experiment directly with Rime's TTS API before building a full voice agent, check out: [TTS in five minutes](/docs/quickstart-five-minute). + +## Step 1: Prerequisites + +Gather the following API keys and tools before starting: + +### 1.1 A Rime API key + +Sign up for a [Rime account](https://app.rime.ai/signup/) and copy your API key from the [API Tokens](https://app.rime.ai/tokens/) page. This enables access to the Rime API for generating TTS. + +### 1.2 An OpenAI API key + +Create an [OpenAI account](https://platform.openai.com/signup) and generate an API key from the [API keys](https://platform.openai.com/api-keys) page. This key enables STT transcription and LLM responses. + +### 1.3 Python + +Install [Python 3.10 or later](https://www.python.org/downloads/). Verify your installation by running the following command in your terminal: + +```bash +python --version +``` + +## Step 2: Project setup + +Set up your project folder, environment variables, and dependencies. + +### 2.1 Create the project folder + +Create a new folder for your project and navigate into it: + +```bash +mkdir rime-pipecat-agent +cd rime-pipecat-agent +``` + +### 2.2 Set up environment variables + +In the new directory, create a file called `.env` and add the keys that you gathered in [Step 1](#step-1-prerequisites): + +``` +RIME_API_KEY=your_rime_api_key +OPENAI_API_KEY=your_openai_api_key +``` + +Replace the placeholder values with your actual API keys. + +### 2.3 Configure dependencies + +Install the `uv` package manager: + + +```bash macOS/Linux +curl -LsSf https://astral.sh/uv/install.sh | sh +``` + +```bash pip +pip install uv +``` + +```bash Homebrew (macOS) +brew install uv +``` + + +Create a `pyproject.toml` file and add the following dependencies to it: + +```toml +[project] +name = "rime-pipecat-agent" +version = "0.1.0" +requires-python = ">=3.10" +dependencies = [ + "python-dotenv>=1.1.1", + "pipecat-ai[openai,rime,silero,webrtc,runner]>=0.0.100", + "pipecat-ai-small-webrtc-prebuilt>=2.0.4", +] +``` + +Pipecat uses a plugin system where each service integration is a separate package. In this code, the extras in brackets (`[openai,rime,silero,webrtc,runner]`) install the following plugins: +- `openai` adds STT and LLM services for transcription and generating responses. +- `rime` adds a TTS service for synthesizing speech. +- `silero` adds VAD for detecting when the user starts and stops speaking. +- `webrtc` provides the transport for browser-based audio via WebRTC. +- `runner` adds a development runner that handles server setup and WebRTC connections. + +The `pipecat-ai-small-webrtc-prebuilt` package provides a ready-to-use browser client that connects to your agent. + +Then, install the dependencies by running this command: + +```bash +uv sync +``` + +## Step 3: Create the agent + +Create an `agent.py` file to contain all the code that gets your agent talking. If you're in a rush and just want to run it, skip to [Step 3.5: Full agent code](#3-5-full-agent-code). Otherwise, continue reading to code the agent step-by-step. + +### 3.1 Load environment variables and configure imports + +Add the following imports and initialization code to `agent.py`: + +```python +import os +from dotenv import load_dotenv + +from pipecat.pipeline.pipeline import Pipeline +from pipecat.pipeline.runner import PipelineRunner +from pipecat.pipeline.task import PipelineParams, PipelineTask +from pipecat.frames.frames import LLMMessagesAppendFrame +from pipecat.services.openai.stt import OpenAISTTService +from pipecat.services.openai.llm import OpenAILLMService +from pipecat.services.rime.tts import RimeNonJsonTTSService +from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext +from pipecat.audio.vad.silero import SileroVADAnalyzer +from pipecat.transports.base_transport import TransportParams +from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport +from pipecat.runner.run import main +from pipecat.runner.types import SmallWebRTCRunnerArguments + +load_dotenv() +``` + +Each import corresponds to a frame processor or utility: +- `Pipeline` chains processors together in sequence. +- `PipelineRunner` manages the event loop and runs the pipeline. +- `LLMMessagesAppendFrame` triggers the LLM to respond when queued. +- Services like `OpenAISTTService`, `OpenAILLMService`, and `RimeNonJsonTTSService` are the frame processors that do the actual work. +- `OpenAILLMContext` maintains conversation history across turns. +- `SileroVADAnalyzer` detects speech boundaries so the agent knows when you've finished talking. +- `SmallWebRTCTransport` handles peer-to-peer WebRTC connections for browser-based audio. +- `SmallWebRTCRunnerArguments` provides connection details when a user connects to the agent. + +### 3.2 Define the system prompt + +Add the following configuration below the imports: + +```python +SYSTEM_PROMPT = """You are a helpful voice assistant. +Keep your responses short and conversational - no more than 2-3 sentences. +Be friendly and natural.""" +``` + +This system prompt defines your agent's personality. It can be as simple or complex as you like. Later in the guide, you'll see an example of a detailed system prompt that fully customizes the agent's behavior. + +### 3.3 Code the conversation pipeline + +Add the following `bot` function to `agent.py`: + +```python +async def bot(runner_args: SmallWebRTCRunnerArguments): +``` + +The Pipecat runner automatically discovers any function named `bot` in your module. When a user connects via WebRTC, the runner calls this function and passes connection details through `runner_args`. + +Inside the `bot` function, add the WebRTC transport configuration: + +```python + transport = SmallWebRTCTransport( + runner_args.webrtc_connection, + TransportParams( + audio_in_enabled=True, + audio_out_enabled=True, + vad_analyzer=SileroVADAnalyzer(), + ), + ) +``` + +This creates the WebRTC transport and enables audio I/O as well as Silero VAD for detecting when the user starts and stops speaking. + +Next, add the AI services for transcription, response generation, and speech synthesis: + +```python + stt = OpenAISTTService( + api_key=os.getenv("OPENAI_API_KEY"), + model="gpt-4o-transcribe", + ) + + llm = OpenAILLMService( + api_key=os.getenv("OPENAI_API_KEY"), + model="gpt-4o-mini", + ) + + tts = RimeNonJsonTTSService( + api_key=os.getenv("RIME_API_KEY"), + voice_id="atrium", + model="arcana", + ) +``` + +These configure OpenAI for STT and LLM responses, and the Rime `arcana` model for TTS. + +Add the conversation context: + +```python + context = OpenAILLMContext( + messages=[{"role": "system", "content": SYSTEM_PROMPT}] + ) + context_aggregator = llm.create_context_aggregator(context) +``` + +This maintains the conversation history, so the LLM can reference previous messages. + +Add the pipeline that connects all the components: + +```python + pipeline = Pipeline([ + transport.input(), + stt, + context_aggregator.user(), + llm, + tts, + transport.output(), + context_aggregator.assistant(), + ]) +``` + +Frames flow through the processors in order. + +1. **Audio in:** Raw microphone input from the user +2. **Transcription:** Converting speech to text via the STT provider +3. **User context:** Aggregating the user's message into the conversation history +4. **LLM response:** Generating a reply based on the conversation so far +5. **Speech synthesis:** Converting the LLM's text response to audio via Rime TTS +6. **Audio out:** Streaming the synthesized speech back to the user +7. **Assistant context:** Recording the assistant's response in the conversation history + +The context aggregator appears twice to capture both sides of the conversation. + +Finally, add the task runner and an event handler for greeting the user: + +```python + task = PipelineTask( + pipeline, + params=PipelineParams(enable_metrics=True), + ) + + @transport.event_handler("on_client_connected") + async def on_client_connected(transport, client): + await task.queue_frames([LLMMessagesAppendFrame( + messages=[{"role": "system", "content": "Say hello and introduce yourself."}], + run_llm=True + )]) + + runner = PipelineRunner(handle_sigint=runner_args.handle_sigint) + await runner.run(task) +``` + +The `on_client_connected` event fires when a user connects to the agent. It appends a system message prompting the LLM to greet the user and triggers an immediate response with `run_llm=True`. + +### 3.4 Create the main entrypoint + +Add the following code at the bottom of `agent.py`: + +```python +if __name__ == "__main__": + main() +``` + +Pipecat's `main` helper from `pipecat.runner.run` automatically: +- Discovers the `bot` function in your module +- Starts a FastAPI server with WebRTC endpoints +- Serves a prebuilt browser client at `/client` +- Sets up the WebRTC connection and passes the connection to your `bot` function + +When you run the agent, Pipecat starts a local HTTP server. Open the browser client to connect via WebRTC. The server runs locally, but the agent makes API calls to OpenAI and Rime. + +### 3.5 Full agent code + +At this point, your `agent.py` file should look like the complete example below: + + +```python +import os +from dotenv import load_dotenv + +from pipecat.pipeline.pipeline import Pipeline +from pipecat.pipeline.runner import PipelineRunner +from pipecat.pipeline.task import PipelineParams, PipelineTask +from pipecat.frames.frames import LLMMessagesAppendFrame +from pipecat.services.openai.stt import OpenAISTTService +from pipecat.services.openai.llm import OpenAILLMService +from pipecat.services.rime.tts import RimeNonJsonTTSService +from pipecat.processors.aggregators.openai_llm_context import OpenAILLMContext +from pipecat.audio.vad.silero import SileroVADAnalyzer +from pipecat.transports.base_transport import TransportParams +from pipecat.transports.smallwebrtc.transport import SmallWebRTCTransport +from pipecat.runner.run import main +from pipecat.runner.types import SmallWebRTCRunnerArguments + +load_dotenv() + +SYSTEM_PROMPT = """You are a helpful voice assistant. +Keep your responses short and conversational - no more than 2-3 sentences. +Be friendly and natural.""" + + +async def bot(runner_args: SmallWebRTCRunnerArguments): + transport = SmallWebRTCTransport( + runner_args.webrtc_connection, + TransportParams( + audio_in_enabled=True, + audio_out_enabled=True, + vad_analyzer=SileroVADAnalyzer(), + ), + ) + + stt = OpenAISTTService( + api_key=os.getenv("OPENAI_API_KEY"), + model="gpt-4o-transcribe", + ) + + llm = OpenAILLMService( + api_key=os.getenv("OPENAI_API_KEY"), + model="gpt-4o-mini", + ) + + tts = RimeNonJsonTTSService( + api_key=os.getenv("RIME_API_KEY"), + voice_id="atrium", + model="arcana", + ) + + context = OpenAILLMContext( + messages=[{"role": "system", "content": SYSTEM_PROMPT}] + ) + context_aggregator = llm.create_context_aggregator(context) + + pipeline = Pipeline([ + transport.input(), + stt, + context_aggregator.user(), + llm, + tts, + transport.output(), + context_aggregator.assistant(), + ]) + + task = PipelineTask( + pipeline, + params=PipelineParams(enable_metrics=True), + ) + + @transport.event_handler("on_client_connected") + async def on_client_connected(transport, client): + await task.queue_frames([LLMMessagesAppendFrame( + messages=[{"role": "system", "content": "Say hello and introduce yourself."}], + run_llm=True + )]) + + runner = PipelineRunner(handle_sigint=runner_args.handle_sigint) + await runner.run(task) + + +if __name__ == "__main__": + main() +``` + + +## Step 4: Test your agent + +The full pipeline is now ready for you to test. You can run the agent from the terminal using `uv` and interact with it in your browser. + +### 4.1 Start the agent + +Run the following command to start your agent: + +```bash +uv run agent.py +``` + +You'll see output indicating the server is starting. + +### 4.2 Connect to your agent + +Open a browser and navigate to `http://localhost:7860/client`. Allow microphone access when prompted. + +You can now talk to your agent using your microphone. + +## Step 5: Customize your agent + +Now that your agent is running, you can experiment with different voices and personalities. + +### 5.1 Change the voice + +Update the `tts` initialization in your `bot` function to try a different voice: + +```python +tts = RimeNonJsonTTSService( + api_key=os.getenv("RIME_API_KEY"), + voice_id="celest", + model="arcana", +) +``` + +Rime offers many voices with different personalities. See the full list on the [Voices](/docs/voices) page. + +### 5.2 Fine-tune agent personalities + +Create a new file called `personality.py` with the following content: + + +```python +SYSTEM_PROMPT = """ +CHARACTER: +You are Detective Marlowe, a world-weary noir detective from the 1940s who +somehow ended up as an AI assistant. You treat every question like it's a +case to be cracked and speak in dramatic, hard-boiled metaphors. + +PERSONALITY: +- Cynical but secretly caring underneath the tough exterior +- Treats mundane tasks like high-stakes mysteries +- References your "years on the force" and "cases that still haunt you" +- Suspicious of technology but grudgingly impressed by it +- Has strong opinions about coffee and rain + +SPEECH STYLE: +- Keep responses to 2-3 sentences maximum +- Use noir metaphors like "this code is messier than a speakeasy on a Saturday night" +- Dramatic pauses with "..." for effect +- Call the user "kid" or "pal" occasionally +- End with ominous or philosophical observations + +RESTRICTIONS: +- Never break character +- Don't use emojis or special characters +- Stay family-friendly despite the noir tone +""" + +INTRO_MESSAGE = "The name's Marlowe... I've seen things that would make your code freeze, pal. So what case are you bringing to my desk tonight?" +``` + + +Update your `agent.py` to import and use this prompt: + +```python +from personality import SYSTEM_PROMPT, INTRO_MESSAGE +``` + +Then update the `on_client_connected` handler to use your custom intro message: + +```python +@transport.event_handler("on_client_connected") +async def on_client_connected(transport, client): + await task.queue_frames([LLMMessagesAppendFrame( + messages=[{"role": "system", "content": f"Say: {INTRO_MESSAGE}"}], + run_llm=True + )]) +``` + +Storing your system prompt in a separate file keeps your personality configuration separate from your agent logic, making it easy to experiment with different characters. + +## Next steps + +Pipecat's modular design makes it easy to swap components. Experiment with your agent by: +- Replacing OpenAI with another STT provider, such as Deepgram or AssemblyAI +- Using a different LLM, such as Anthropic, Gemini, or a local model +- Switching transports to use WebSocket for server-to-server or Daily's hosted rooms for production deployments + +To learn more about the Pipecat framework, including its transport options, deployment patterns, and advanced features, browse the [Pipecat documentation](https://docs.pipecat.ai/getting-started/introduction). + +View Rime's [Pipecat demo agents](https://github.com/rimelabs/rime-pipecat-agents) for a ready-to-use multilingual agent example that switches languages dynamically. + +## Troubleshooting + +If you encounter problems while following this guide, consult the quick fixes below. + +### No audio output and other TTS errors + +- **Check your TTS service class:** The `arcana` model requires `RimeNonJsonTTSService`. If you see WebSocket HTTP `400` errors in the logs, you may be using `RimeTTSService` (which is only compatible with models like `mistv2`). +- **Verify your Rime API key:** Ensure the key is valid and has TTS permissions. + +### The agent doesn't respond to speech + +- **Check microphone permissions:** Ensure you've enabled microphone access in your browser. +- **Verify VAD is working:** Look for logs indicating speech detection. If the logs are missing, check your Silero installation. +- **Test audio input:** Use a different microphone or headset. + +### "API key not set" errors + +- **Check environment variables:** Ensure you set all keys correctly (with no extra spaces) in `.env`. +- **Verify the `.env` file location:** The file should be in the same directory as `agent.py`. + +### Audio quality issues + +- **Check your microphone:** Test using a different input device or headset. +- **Reduce background noise:** The VAD may struggle to detect speech in noisy environments. diff --git a/docs/replit.mdx b/docs/replit.mdx index 952c5ac..bedd295 100644 --- a/docs/replit.mdx +++ b/docs/replit.mdx @@ -4,82 +4,87 @@ title: Replit import { IntegrationHeader } from '/snippets/integration-header.mdx' - -[Replit](https://replit.com/) is an AI-powered platform that lets you create and publish apps from a single browser tab. No local setup, no environment headaches. Just describe what you want to build, and Replit's AI agent handles the heavy lifting. +This guide demonstrates how to add Rime's text-to-speech (TTS) functionality to apps built with [Replit](https://replit.com/), an AI-powered platform that generates code from users' text descriptions. Rime is a TTS API that generates natural-sounding voice audio. When your Replit app needs to speak, it sends text to Rime's API and receives audio back. -With Rime's text-to-speech integration, you can add natural, lifelike voices to your Replit apps in minutes. Whether you're building a voiceover tool, an accessibility feature, or an AI assistant, Rime + Replit makes it surprisingly simple. - -## What You'll Need - -Before you start, make sure you have: - -- A [Replit account](https://replit.com/signup) -- A Rime API key from [app.rime.ai/tokens](https://app.rime.ai/tokens/) +