Skip to content

An autonomous, multimodal AI agent that controls a live Linux virtual machine via natural language using vision, voice, and computer-use tools.

Notifications You must be signed in to change notification settings

arvindrk/computer-use-agent

Repository files navigation

Autonomous Agent for Remote Desktop Control

An autonomous AI agent with multi-modal capabilities (vision + voice + chat) that controls a remote Linux desktop environment through natural language. Built with Claude's Computer Use API, Deepgram live transcription, and E2B Desktop Sandbox.

Features

  • πŸ€– Fully autonomous agentic loops - Agent perceives, reasons, acts, and adapts based on visual feedback
  • 🎀 Multi-modal input - Voice (via Deepgram) and text chat interfaces
  • πŸ‘οΈ Vision-powered control - Agent analyzes screenshots to plan and execute actions
  • πŸ–±οΈ Computer use tools - Mouse clicks, keyboard input, bash commands, file editing
  • πŸ–₯️ Real-time desktop streaming - Live Linux desktop (Xfce) streamed to browser via VNC
  • πŸ”„ Persistent sessions - Reconnect to existing sandbox sessions
  • πŸ“‹ Clipboard integration - Read/write clipboard access
  • ⚑ Streaming responses - Real-time agent reasoning and action updates

Architecture

The application implements a complete autonomous agent system with perception-action loops:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          User Input Layer                           β”‚
β”‚  Voice Input ──► Deepgram ──► WebSocket ──► Live Transcription      β”‚
β”‚  Text Input  ──────────────────────────────► Chat Interface         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Agent Orchestration Layer                      β”‚
β”‚  Next.js API Routes + Server-Sent Events (SSE) Streaming            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Agentic Loop (Claude)                         β”‚
β”‚                                                                     β”‚
β”‚  1. Perception:   Take screenshot of desktop                        β”‚
β”‚  2. Reasoning:    Analyze visual state + user intent                β”‚
β”‚  3. Planning:     Decide which tool(s) to use                       β”‚
β”‚  4. Action:       Execute computer/bash/editor tools                β”‚
β”‚  5. Feedback:     Capture new screenshot                            β”‚
β”‚  6. Iterate:      Loop until task complete                          β”‚
β”‚                                                                     β”‚
β”‚  Tools: computer_use (mouse/keyboard), bash, text_editor            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Desktop Execution Layer                          β”‚
β”‚  E2B Desktop Sandbox - Isolated Linux VM with VNC streaming         β”‚
β”‚  β€’ Resolution scaling for Claude's vision API                       β”‚
β”‚  β€’ Action executor (clicks, typing, scrolling, bash)                β”‚
β”‚  β€’ Screenshot capture and base64 encoding                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

  • Voice Pipeline: Browser MediaRecorder β†’ WebSocket β†’ Deepgram Live API β†’ Real-time transcript
  • Agent Provider: Claude Agent with Computer Use API (beta 2025-01-24)
  • Action Executor: Translates agent decisions into desktop interactions
  • Resolution Scaler: Adapts between display resolution and Claude's vision constraints
  • Streaming Protocol: SSE for real-time agent reasoning, actions, and status updates

How the Agent Works

The agent operates in a continuous perception-action loop:

  1. User sends command (voice or text) - Natural language instruction
  2. Agent initializes sandbox - Spins up isolated Linux VM if needed
  3. Agentic loop begins:
    • Agent takes screenshot of desktop
    • Claude analyzes visual state and user intent
    • Plans which computer use tools to invoke
    • Executes actions (mouse clicks, typing, bash commands)
    • Takes new screenshot to verify results
    • Reasons about next steps
  4. Loop continues until task is complete or user intervenes
  5. Desktop streams live - User watches agent work in real-time via VNC iframe

Screenshots

Home Page

Home Page

🏈 Who's performing at the Super Bowl halftime show in 2026?

Super Bowl Search

πŸ›’ Find highly-rated dog toys on Amazon under $30

Amazon Search

Technical Stack

  • Frontend: Next.js
  • Agent: Claude Agent with Computer Use tools
  • Voice: Deepgram live transcription
  • Sandbox: E2B Desktop Sandbox (isolated Linux VM with VNC)
  • Streaming: WebSocket (voice), Server-Sent Events (agent responses)
  • Tools: Computer use, Bash execution, Text editor

Prerequisites

Setup

  1. Clone and install dependencies
bun install
  1. Configure environment variables

Create .env.local from env.example:

cp env.example .env.local

Add your API keys:

E2B_API_KEY=your_e2b_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
DEEPGRAM_API_KEY=your_deepgram_key_here
  1. Run the development server
bun dev

Open http://localhost:3000 and start chatting or speaking to control the desktop.

Learn More

About

An autonomous, multimodal AI agent that controls a live Linux virtual machine via natural language using vision, voice, and computer-use tools.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors