Autonomous Agent for Remote Desktop Control

An autonomous AI agent with multi-modal capabilities (vision + voice + chat) that controls a remote Linux desktop environment through natural language. Built with Claude's Computer Use API, Deepgram live transcription, and E2B Desktop Sandbox.

Features

🤖 Fully autonomous agentic loops - Agent perceives, reasons, acts, and adapts based on visual feedback
🎤 Multi-modal input - Voice (via Deepgram) and text chat interfaces
👁️ Vision-powered control - Agent analyzes screenshots to plan and execute actions
🖱️ Computer use tools - Mouse clicks, keyboard input, bash commands, file editing
🖥️ Real-time desktop streaming - Live Linux desktop (Xfce) streamed to browser via VNC
🔄 Persistent sessions - Reconnect to existing sandbox sessions
📋 Clipboard integration - Read/write clipboard access
⚡ Streaming responses - Real-time agent reasoning and action updates

Architecture

The application implements a complete autonomous agent system with perception-action loops:

┌─────────────────────────────────────────────────────────────────────┐
│                          User Input Layer                           │
│  Voice Input ──► Deepgram ──► WebSocket ──► Live Transcription      │
│  Text Input  ──────────────────────────────► Chat Interface         │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      Agent Orchestration Layer                      │
│  Next.js API Routes + Server-Sent Events (SSE) Streaming            │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                       Agentic Loop (Claude)                         │
│                                                                     │
│  1. Perception:   Take screenshot of desktop                        │
│  2. Reasoning:    Analyze visual state + user intent                │
│  3. Planning:     Decide which tool(s) to use                       │
│  4. Action:       Execute computer/bash/editor tools                │
│  5. Feedback:     Capture new screenshot                            │
│  6. Iterate:      Loop until task complete                          │
│                                                                     │
│  Tools: computer_use (mouse/keyboard), bash, text_editor            │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Desktop Execution Layer                          │
│  E2B Desktop Sandbox - Isolated Linux VM with VNC streaming         │
│  • Resolution scaling for Claude's vision API                       │
│  • Action executor (clicks, typing, scrolling, bash)                │
│  • Screenshot capture and base64 encoding                           │
└─────────────────────────────────────────────────────────────────────┘

Key Components

Voice Pipeline: Browser MediaRecorder → WebSocket → Deepgram Live API → Real-time transcript
Agent Provider: Claude Agent with Computer Use API (beta 2025-01-24)
Action Executor: Translates agent decisions into desktop interactions
Resolution Scaler: Adapts between display resolution and Claude's vision constraints
Streaming Protocol: SSE for real-time agent reasoning, actions, and status updates

How the Agent Works

The agent operates in a continuous perception-action loop:

User sends command (voice or text) - Natural language instruction
Agent initializes sandbox - Spins up isolated Linux VM if needed
Agentic loop begins:
- Agent takes screenshot of desktop
- Claude analyzes visual state and user intent
- Plans which computer use tools to invoke
- Executes actions (mouse clicks, typing, bash commands)
- Takes new screenshot to verify results
- Reasons about next steps
Loop continues until task is complete or user intervenes
Desktop streams live - User watches agent work in real-time via VNC iframe

Screenshots

Home Page

🏈 Who's performing at the Super Bowl halftime show in 2026?

🛒 Find highly-rated dog toys on Amazon under $30

Technical Stack

Frontend: Next.js
Agent: Claude Agent with Computer Use tools
Voice: Deepgram live transcription
Sandbox: E2B Desktop Sandbox (isolated Linux VM with VNC)
Streaming: WebSocket (voice), Server-Sent Events (agent responses)
Tools: Computer use, Bash execution, Text editor

Prerequisites

Node.js 20+
E2B API key (get one here)
Anthropic API key (get one here)
Deepgram API key (get one here)

Setup

Clone and install dependencies

bun install

Configure environment variables

Create .env.local from env.example:

cp env.example .env.local

Add your API keys:

E2B_API_KEY=your_e2b_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
DEEPGRAM_API_KEY=your_deepgram_key_here

Run the development server

bun dev

Open http://localhost:3000 and start chatting or speaking to control the desktop.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.agents/skills		.agents/skills
.cursor/skills		.cursor/skills
.vscode		.vscode
public		public
src		src
.gitignore		.gitignore
README.md		README.md
bun.lock		bun.lock
components.json		components.json
env.example		env.example
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Agent for Remote Desktop Control

Features

Architecture

Key Components

How the Agent Works

Screenshots

Home Page

🏈 Who's performing at the Super Bowl halftime show in 2026?

🛒 Find highly-rated dog toys on Amazon under $30

Technical Stack

Prerequisites

Setup

Learn More

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

arvindrk/computer-use-agent

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agent for Remote Desktop Control

Features

Architecture

Key Components

How the Agent Works

Screenshots

Home Page

🏈 Who's performing at the Super Bowl halftime show in 2026?

🛒 Find highly-rated dog toys on Amazon under $30

Technical Stack

Prerequisites

Setup

Learn More

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages