An autonomous AI agent with multi-modal capabilities (vision + voice + chat) that controls a remote Linux desktop environment through natural language. Built with Claude's Computer Use API, Deepgram live transcription, and E2B Desktop Sandbox.
- π€ Fully autonomous agentic loops - Agent perceives, reasons, acts, and adapts based on visual feedback
- π€ Multi-modal input - Voice (via Deepgram) and text chat interfaces
- ποΈ Vision-powered control - Agent analyzes screenshots to plan and execute actions
- π±οΈ Computer use tools - Mouse clicks, keyboard input, bash commands, file editing
- π₯οΈ Real-time desktop streaming - Live Linux desktop (Xfce) streamed to browser via VNC
- π Persistent sessions - Reconnect to existing sandbox sessions
- π Clipboard integration - Read/write clipboard access
- β‘ Streaming responses - Real-time agent reasoning and action updates
The application implements a complete autonomous agent system with perception-action loops:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Input Layer β
β Voice Input βββΊ Deepgram βββΊ WebSocket βββΊ Live Transcription β
β Text Input βββββββββββββββββββββββββββββββΊ Chat Interface β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agent Orchestration Layer β
β Next.js API Routes + Server-Sent Events (SSE) Streaming β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Agentic Loop (Claude) β
β β
β 1. Perception: Take screenshot of desktop β
β 2. Reasoning: Analyze visual state + user intent β
β 3. Planning: Decide which tool(s) to use β
β 4. Action: Execute computer/bash/editor tools β
β 5. Feedback: Capture new screenshot β
β 6. Iterate: Loop until task complete β
β β
β Tools: computer_use (mouse/keyboard), bash, text_editor β
βββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Desktop Execution Layer β
β E2B Desktop Sandbox - Isolated Linux VM with VNC streaming β
β β’ Resolution scaling for Claude's vision API β
β β’ Action executor (clicks, typing, scrolling, bash) β
β β’ Screenshot capture and base64 encoding β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Voice Pipeline: Browser MediaRecorder β WebSocket β Deepgram Live API β Real-time transcript
- Agent Provider: Claude Agent with Computer Use API (beta 2025-01-24)
- Action Executor: Translates agent decisions into desktop interactions
- Resolution Scaler: Adapts between display resolution and Claude's vision constraints
- Streaming Protocol: SSE for real-time agent reasoning, actions, and status updates
The agent operates in a continuous perception-action loop:
- User sends command (voice or text) - Natural language instruction
- Agent initializes sandbox - Spins up isolated Linux VM if needed
- Agentic loop begins:
- Agent takes screenshot of desktop
- Claude analyzes visual state and user intent
- Plans which computer use tools to invoke
- Executes actions (mouse clicks, typing, bash commands)
- Takes new screenshot to verify results
- Reasons about next steps
- Loop continues until task is complete or user intervenes
- Desktop streams live - User watches agent work in real-time via VNC iframe
- Frontend: Next.js
- Agent: Claude Agent with Computer Use tools
- Voice: Deepgram live transcription
- Sandbox: E2B Desktop Sandbox (isolated Linux VM with VNC)
- Streaming: WebSocket (voice), Server-Sent Events (agent responses)
- Tools: Computer use, Bash execution, Text editor
- Node.js 20+
- E2B API key (get one here)
- Anthropic API key (get one here)
- Deepgram API key (get one here)
- Clone and install dependencies
bun install- Configure environment variables
Create .env.local from env.example:
cp env.example .env.localAdd your API keys:
E2B_API_KEY=your_e2b_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
DEEPGRAM_API_KEY=your_deepgram_key_here
- Run the development server
bun devOpen http://localhost:3000 and start chatting or speaking to control the desktop.


