A macOS native AI desktop companion — a pixel-art 저승사자 (Grim Reaper) developer that haunts your screen, watches what you code, and speaks to you with Gemini's native audio.
Bugs are 귀마 (demons). Fixing code is 퇴마 (exorcism). You are the idol developer.
Built for the Gemini 3 Seoul Hackathon (Feb 28, 2026).
KpopDemonCoders sits on your desktop as a pixel-art 저승사자 that follows your cursor across all monitors. It captures the window under your cursor every 5 seconds, analyzes it with Gemini 3 Flash, and decides whether to comment — out loud — using Gemini's native audio WebSocket.
- Sees your screen — VisionAgent reads what's around your cursor and scores its significance
- Speaks with personality — Zubenelgenubi voice with darkly witty Korean supernatural references
- Thinks before speaking — Mediator gates speech with cooldowns, significance thresholds, and adaptive timing
- Reads full windows — On high-significance events (errors, build results), captures the entire active window for deeper context
- Adapts to you — MARL-inspired scheduler learns your response patterns and adjusts how often the saja speaks
- Gesture input — Draw 3 circles with your mouse to force analysis and comment
- Google Search Grounding — Can reference real-time web information in responses
- Character switching — Switch between 6 characters (saja, cat, derpy, jinwoo, kimjongun, trump) at runtime with full persona swap
- Voice chat — Global hotkey ⌘⇧G + wake word "잼민아" for interactive conversation
┌─────────────────────┐
│ScreenCaptureService │
│ ScreenCaptureKit │
│ + change detection │
└────────┬────────────┘
│ JPEG image
▼
┌─────────────────────┐
│ VisionAgent │
│ gemini-3-flash │
│ REST + JSON schema │
│ thinkingLevel: │
│ minimal │
└────────┬────────────┘
│ VisionAnalysis
│ {significance, content, emotion, shouldSpeak}
▼
┌──────────────┐ ┌─────────────────────┐ ┌────────────────────┐
│ Engagement │──▶│ Mediator │◀──│ AdaptiveScheduler │
│ Agent │ │ Speech gating │ │ MARL reward-based │
│ silence │ │ Cooldown + urgency │ │ adaptive timing │
│ monitor │ │ evaluate() → typed │ │ │
│ (dynamic) │ │ MediatorDecision │ │ responseRate → │
└──────────────┘ └────────┬────────────┘ │ silence threshold │
│ │ interruptRate → │
┌────────┴────────┐ │ cooldown │
│ │ └────────────────────┘
speak=true speak=false
│ │
▼ └──▶ (silent)
┌─────────────────┐
│ ScreenAnalyzer │
│ (Orchestrator) │
│ │
│ sig ≥ 7? │
│ ├─ yes → full │
│ │ window │
│ │ capture │
│ └─ no → text │
│ only │
└────┬────────────┘
│
┌────────┴────────┐
│ │
WS connected? WS down?
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ GeminiLive │ │ REST Fallback│
│ Client (WS) │ │ gemini-3- │
│ │ │ flash + │
│ gemini-2.5- │ │ TTSClient │
│ flash-native │ │ │
│ -audio │ └──────────────┘
│ │
│ Configurable │
│ voice + │
│ affective │
│ dialog │
└──────┬───────┘
│ audio chunks + outputTranscription
│ (sentence-level via `finished` flag)
▼
┌──────────────┐ ┌──────────────┐
│ AudioPlayer │ │ KDCView │
│ AVAudioEngine│ │ SwiftUI │
│ 24kHz F32 │ │ bubble + │
│ streaming │ │ sprite │
└──────────────┘ └──────────────┘
Inspired by Multi-Agent Reinforcement Learning's Centralized Training with Decentralized Execution:
| Agent | Role | Communication |
|---|---|---|
| VisionAgent | Analyzes screen captures → structured JSON | Text-only, no personality |
| Mediator | Centralized critic — gates speech decisions | Typed MediatorDecision |
| EngagementAgent | Monitors silence → proactive triggers | Neutral intent packets |
| AdaptiveScheduler | MARL reward signals → dynamic timing | responseRate, interruptRate |
| ScreenAnalyzer | Orchestrator / router (LangGraph-style) | Coordinates all agents |
| GeminiLiveClient | WebSocket transport + keep-alive | Audio + transcription |
| TTSClient | REST TTS fallback when WS is down | PCM audio |
Key principle: Agents communicate in structured text. Character personality is applied ONLY at the output boundary (Prompts.swift) based on the active KDCCharacterPreset.
| Model | Purpose | Method |
|---|---|---|
gemini-3-flash-preview |
Screen analysis (VisionAgent) | REST generateContent |
gemini-2.5-flash-native-audio-latest |
Live conversation + audio | WebSocket BidiGenerateContent |
gemini-2.5-flash-preview-tts |
TTS fallback when WS is down | REST generateContent |
Features used:
responseSchema— structured JSON output from VisionAgentthinkingConfig: { thinkingLevel: "minimal" }— fast analysismediaResolution: "MEDIA_RESOLUTION_HIGH"— 1102 tokens per imageenableAffectiveDialog: true— emotion-responsive voiceoutputAudioTranscription— text alongside audio for bubble displaysessionResumption: { transparent: true }— reconnection without context losscontextWindowCompression— sliding window for long sessionsproactivity: { proactiveAudio: true }— model-initiated speechtools: [{ googleSearch: {} }]— real-time grounding
| Character | ID | Voice | Size | Theme |
|---|---|---|---|---|
| 저승사자 | saja |
Zubenelgenubi | Large | 귀마/퇴마, darkly witty |
| White Cat | cat |
Zephyr | Medium | nya~/meow~, playful |
| Derpy | derpy |
Zephyr | Medium | goofy, chaotic energy |
| Jinwoo | jinwoo |
Zubenelgenubi | Large | solo leveling vibes |
| Kim Jong Un | kimjongun |
Zubenelgenubi | Large | supreme leader energy |
| Trump | trump |
Zubenelgenubi | Large | tremendous commentary |
Characters are complete persona bundles: sprite set + voice + size + prompt profile. Switching characters changes everything at once. Add new characters by creating a directory in Assets/Sprites/{name}/ with 16 sprites and a preset.json.
KpopDemonCoders/
├── Package.swift # Swift 6.2, macOS 26, zero dependencies
├── Sources/Core/ # 11 files — shared, testable library
│ ├── AudioMessageParser.swift # WebSocket message parsing
│ ├── CharacterPresetConfig.swift # Character preset loading/validation
│ ├── ChatMessage.swift # Chat message model
│ ├── ImageDiffer.swift # Pixel-level change detection
│ ├── ImageEncoder.swift # Base64 encoding for Gemini API
│ ├── ImageProcessor.swift # JPEG encoding + resizing
│ ├── KeychainHelper.swift # Secure API key storage
│ ├── PCMConverter.swift # Int16 PCM → Float32 conversion
│ ├── PromptBuilder.swift # Prompt assembly (saja/cat profiles)
│ ├── SettingsTypes.swift # 15 shared enums (30 voices)
│ └── KDCCore.swift # Module exports and shared utilities
├── Sources/KpopDemonCoders/ # 28 files — main application
│ ├── main.swift # Entry point
│ ├── KDCAppDelegate.swift # Component wiring (20 components)
│ ├── Config.swift # API key management
│ ├── KDCViewModel.swift # Mouse tracking, lerp smoothing
│ ├── KDCSpriteAnimator.swift # 8fps pixel-art animation
│ ├── KDCScreenCaptureService.swift # ScreenCaptureKit integration
│ ├── KDCVisionAgent.swift # Gemini 3 Flash REST analysis
│ ├── KDCMediator.swift # Speech gating
│ ├── KDCEngagementAgent.swift # Silence monitor
│ ├── KDCAdaptiveScheduler.swift # MARL adaptive timing
│ ├── KDCScreenAnalyzer.swift # Multi-agent orchestrator
│ ├── KDCLiveClient.swift # WebSocket + dual ping
│ ├── KDCAudioPlayer.swift # AVAudioEngine 24kHz streaming
│ └── ... # + 15 more files
├── Tests/KpopDemonCodersTests/ # 11 test suites, 182 tests
├── Assets/
│ ├── Sprites/ # Pixel-art sprites (6 characters)
│ ├── TrayIcons/ # Menu bar icons (1x/2x/3x)
│ ├── TrayIcons_Clean/ # Minimal tray icon variants
│ └── Music/ # Lo-fi background tracks (WAV)
└── docs/ # Full documentation
# Requirements: macOS 26+, Xcode 26+ with Swift 6.2
# 1. Configure API key (first time only)
cp .env.test.example .env.test
# Edit .env.test and add your Gemini API key
# Get one at: https://aistudio.google.com/apikey
# 2. Build + codesign + run
make run
# Other targets
make build # Build only
make run-log # Run with full logging
make test # Run 182 tests
make clean # Clean build artifactsGrant Screen Recording permission when prompted (System Settings → Privacy & Security → Screen Recording).
All visual and audio assets are AI-generated:
- Sprites: fal.ai Nano Banana 2 pipeline — text-to-image base → image-to-image editing per emotion/frame → background removal → PNG with alpha. 16 sprites per character (4 emotions × 4 frames). 6 characters included.
- Tray Icons: Animated menu bar icons in 1x/2x/3x Retina resolutions, 4 emotions × 4 frames = 48 icons.
- Background Music: Google Lyria RealTime — lo-fi tracks
golden_lofi.wavandcoding_lofi.wav(48kHz stereo WAV).
See docs/reference/asset-pipeline.md for the full pipeline architecture, engine comparison, and how to generate sprites for new characters.
| Input | How | What Happens |
|---|---|---|
| Passive | Automatic every 5s | VisionAgent analyzes cursor area, Mediator gates response |
| Silence break | After adaptive threshold | EngagementAgent triggers with screen-specific observation |
| Circle gesture | Draw 3 circles in 6s | Force captures full window, bypasses Mediator |
| High significance | Auto-detected (sig ≥ 7) | Full active window captured for deeper analysis |
| Voice chat | ⌘⇧G or "잼민아" | Interactive conversation via text or voice |
- Zero dependencies — Pure Swift + Apple frameworks. No SPM packages. Ship-ready binary.
- Personality at the boundary — Agents analyze in neutral structured text. Only the output adapter adds character personality.
- Immediate reconnect — No exponential backoff for WebSocket. Fixed 1s delay. The companion should reconnect instantly.
- Adaptive, not annoying — MARL-inspired scheduler tracks response rates. Ignore it → speaks less. Engage → speaks more.
- Specific, not generic — Prompts trained to spot errors, suggest fixes, ask concrete questions.
- Native audio only — No macOS TTS. Gemini's native audio with configurable voice and affective dialog.
| Constraint | Value |
|---|---|
| macOS | 26+ (Tahoe) |
| Swift | 6.2+ |
| Architecture | Apple Silicon (arm64) |
| Dependencies | Zero (pure Swift + Apple frameworks) |
| Vision Interval | 5 seconds (configurable) |
| Audio | 24kHz Float32 mono streaming |
| WebSocket | Dual-ping keep-alive, session resumption |
Gemini 3 Seoul Hackathon · February 28, 2026 · Sebitsom 3F Vista, Seoul
| Criteria | Weight | How KpopDemonCoders Addresses It |
|---|---|---|
| Demo | 50% | Live desktop saja — visible, audible, interactive |
| Impact | 25% | Developer companion that spots errors and adapts to behavior |
| Creativity | 15% | MARL-inspired multi-agent 저승사자 with emotion-responsive native audio |
| Pitch | 10% | "A demon that haunts your screen and exorcises your bugs" |