Skip to content

macOS native AI desktop companion — pixel-art 저승사자 (Grim Reaper) developer that haunts your screen, watches your code, and speaks with Gemini native audio. Built with Gemini 3 Flash + Live API. Zero dependencies.

Notifications You must be signed in to change notification settings

Two-Weeks-Team/KpopDemonCoders

Repository files navigation

KpopDemonCoders

Swift 6.2 macOS 26 Gemini Zero Dependencies Platform

A macOS native AI desktop companion — a pixel-art 저승사자 (Grim Reaper) developer that haunts your screen, watches what you code, and speaks to you with Gemini's native audio.

Bugs are 귀마 (demons). Fixing code is 퇴마 (exorcism). You are the idol developer.

Built for the Gemini 3 Seoul Hackathon (Feb 28, 2026).

What It Does

KpopDemonCoders sits on your desktop as a pixel-art 저승사자 that follows your cursor across all monitors. It captures the window under your cursor every 5 seconds, analyzes it with Gemini 3 Flash, and decides whether to comment — out loud — using Gemini's native audio WebSocket.

  • Sees your screen — VisionAgent reads what's around your cursor and scores its significance
  • Speaks with personality — Zubenelgenubi voice with darkly witty Korean supernatural references
  • Thinks before speaking — Mediator gates speech with cooldowns, significance thresholds, and adaptive timing
  • Reads full windows — On high-significance events (errors, build results), captures the entire active window for deeper context
  • Adapts to you — MARL-inspired scheduler learns your response patterns and adjusts how often the saja speaks
  • Gesture input — Draw 3 circles with your mouse to force analysis and comment
  • Google Search Grounding — Can reference real-time web information in responses
  • Character switching — Switch between 6 characters (saja, cat, derpy, jinwoo, kimjongun, trump) at runtime with full persona swap
  • Voice chat — Global hotkey ⌘⇧G + wake word "잼민아" for interactive conversation

Architecture

                    ┌─────────────────────┐
                    │ScreenCaptureService │
                    │  ScreenCaptureKit   │
                    │  + change detection │
                    └────────┬────────────┘
                             │ JPEG image
                             ▼
                    ┌─────────────────────┐
                    │    VisionAgent      │
                    │  gemini-3-flash     │
                    │  REST + JSON schema │
                    │  thinkingLevel:     │
                    │    minimal          │
                    └────────┬────────────┘
                             │ VisionAnalysis
                             │ {significance, content, emotion, shouldSpeak}
                             ▼
┌──────────────┐   ┌─────────────────────┐   ┌────────────────────┐
│  Engagement  │──▶│     Mediator        │◀──│ AdaptiveScheduler  │
│    Agent     │   │  Speech gating      │   │  MARL reward-based │
│  silence     │   │  Cooldown + urgency │   │  adaptive timing   │
│  monitor     │   │  evaluate() → typed │   │                    │
│  (dynamic)   │   │  MediatorDecision   │   │  responseRate →    │
└──────────────┘   └────────┬────────────┘   │  silence threshold │
                            │                │  interruptRate →   │
                   ┌────────┴────────┐       │  cooldown          │
                   │                 │       └────────────────────┘
              speak=true        speak=false
                   │                 │
                   ▼                 └──▶ (silent)
          ┌─────────────────┐
          │  ScreenAnalyzer │
          │  (Orchestrator) │
          │                 │
          │  sig ≥ 7?       │
          │  ├─ yes → full  │
          │  │   window     │
          │  │   capture    │
          │  └─ no → text   │
          │     only        │
          └────┬────────────┘
               │
      ┌────────┴────────┐
      │                 │
  WS connected?     WS down?
      │                 │
      ▼                 ▼
┌──────────────┐  ┌──────────────┐
│ GeminiLive   │  │ REST Fallback│
│ Client (WS)  │  │ gemini-3-    │
│              │  │ flash +      │
│ gemini-2.5-  │  │ TTSClient    │
│ flash-native │  │              │
│ -audio       │  └──────────────┘
│              │
│ Configurable │
│ voice +      │
│ affective    │
│   dialog     │
└──────┬───────┘
       │ audio chunks + outputTranscription
       │ (sentence-level via `finished` flag)
       ▼
┌──────────────┐   ┌──────────────┐
│ AudioPlayer  │   │   KDCView    │
│ AVAudioEngine│   │  SwiftUI     │
│ 24kHz F32    │   │  bubble +    │
│ streaming    │   │  sprite      │
└──────────────┘   └──────────────┘

Multi-Agent Design (MARL CTDE)

Inspired by Multi-Agent Reinforcement Learning's Centralized Training with Decentralized Execution:

Agent Role Communication
VisionAgent Analyzes screen captures → structured JSON Text-only, no personality
Mediator Centralized critic — gates speech decisions Typed MediatorDecision
EngagementAgent Monitors silence → proactive triggers Neutral intent packets
AdaptiveScheduler MARL reward signals → dynamic timing responseRate, interruptRate
ScreenAnalyzer Orchestrator / router (LangGraph-style) Coordinates all agents
GeminiLiveClient WebSocket transport + keep-alive Audio + transcription
TTSClient REST TTS fallback when WS is down PCM audio

Key principle: Agents communicate in structured text. Character personality is applied ONLY at the output boundary (Prompts.swift) based on the active KDCCharacterPreset.

Gemini API Usage

Model Purpose Method
gemini-3-flash-preview Screen analysis (VisionAgent) REST generateContent
gemini-2.5-flash-native-audio-latest Live conversation + audio WebSocket BidiGenerateContent
gemini-2.5-flash-preview-tts TTS fallback when WS is down REST generateContent

Features used:

  • responseSchema — structured JSON output from VisionAgent
  • thinkingConfig: { thinkingLevel: "minimal" } — fast analysis
  • mediaResolution: "MEDIA_RESOLUTION_HIGH" — 1102 tokens per image
  • enableAffectiveDialog: true — emotion-responsive voice
  • outputAudioTranscription — text alongside audio for bubble display
  • sessionResumption: { transparent: true } — reconnection without context loss
  • contextWindowCompression — sliding window for long sessions
  • proactivity: { proactiveAudio: true } — model-initiated speech
  • tools: [{ googleSearch: {} }] — real-time grounding

Character System

Character ID Voice Size Theme
저승사자 saja Zubenelgenubi Large 귀마/퇴마, darkly witty
White Cat cat Zephyr Medium nya~/meow~, playful
Derpy derpy Zephyr Medium goofy, chaotic energy
Jinwoo jinwoo Zubenelgenubi Large solo leveling vibes
Kim Jong Un kimjongun Zubenelgenubi Large supreme leader energy
Trump trump Zubenelgenubi Large tremendous commentary

Characters are complete persona bundles: sprite set + voice + size + prompt profile. Switching characters changes everything at once. Add new characters by creating a directory in Assets/Sprites/{name}/ with 16 sprites and a preset.json.

File Structure

KpopDemonCoders/
├── Package.swift                    # Swift 6.2, macOS 26, zero dependencies
├── Sources/Core/                    # 11 files — shared, testable library
│   ├── AudioMessageParser.swift     # WebSocket message parsing
│   ├── CharacterPresetConfig.swift  # Character preset loading/validation
│   ├── ChatMessage.swift            # Chat message model
│   ├── ImageDiffer.swift            # Pixel-level change detection
│   ├── ImageEncoder.swift           # Base64 encoding for Gemini API
│   ├── ImageProcessor.swift         # JPEG encoding + resizing
│   ├── KeychainHelper.swift         # Secure API key storage
│   ├── PCMConverter.swift           # Int16 PCM → Float32 conversion
│   ├── PromptBuilder.swift          # Prompt assembly (saja/cat profiles)
│   ├── SettingsTypes.swift          # 15 shared enums (30 voices)
│   └── KDCCore.swift                # Module exports and shared utilities
├── Sources/KpopDemonCoders/         # 28 files — main application
│   ├── main.swift                   # Entry point
│   ├── KDCAppDelegate.swift         # Component wiring (20 components)
│   ├── Config.swift                 # API key management
│   ├── KDCViewModel.swift           # Mouse tracking, lerp smoothing
│   ├── KDCSpriteAnimator.swift      # 8fps pixel-art animation
│   ├── KDCScreenCaptureService.swift # ScreenCaptureKit integration
│   ├── KDCVisionAgent.swift         # Gemini 3 Flash REST analysis
│   ├── KDCMediator.swift            # Speech gating
│   ├── KDCEngagementAgent.swift     # Silence monitor
│   ├── KDCAdaptiveScheduler.swift   # MARL adaptive timing
│   ├── KDCScreenAnalyzer.swift      # Multi-agent orchestrator
│   ├── KDCLiveClient.swift          # WebSocket + dual ping
│   ├── KDCAudioPlayer.swift         # AVAudioEngine 24kHz streaming
│   └── ...                          # + 15 more files
├── Tests/KpopDemonCodersTests/      # 11 test suites, 182 tests
├── Assets/
│   ├── Sprites/                     # Pixel-art sprites (6 characters)
│   ├── TrayIcons/                   # Menu bar icons (1x/2x/3x)
│   ├── TrayIcons_Clean/             # Minimal tray icon variants
│   └── Music/                       # Lo-fi background tracks (WAV)
└── docs/                            # Full documentation

Build & Run

# Requirements: macOS 26+, Xcode 26+ with Swift 6.2

# 1. Configure API key (first time only)
cp .env.test.example .env.test
# Edit .env.test and add your Gemini API key
# Get one at: https://aistudio.google.com/apikey

# 2. Build + codesign + run
make run

# Other targets
make build       # Build only
make run-log     # Run with full logging
make test        # Run 182 tests
make clean       # Clean build artifacts

Grant Screen Recording permission when prompted (System Settings → Privacy & Security → Screen Recording).

Asset Generation

All visual and audio assets are AI-generated:

  • Sprites: fal.ai Nano Banana 2 pipeline — text-to-image base → image-to-image editing per emotion/frame → background removal → PNG with alpha. 16 sprites per character (4 emotions × 4 frames). 6 characters included.
  • Tray Icons: Animated menu bar icons in 1x/2x/3x Retina resolutions, 4 emotions × 4 frames = 48 icons.
  • Background Music: Google Lyria RealTime — lo-fi tracks golden_lofi.wav and coding_lofi.wav (48kHz stereo WAV).

See docs/reference/asset-pipeline.md for the full pipeline architecture, engine comparison, and how to generate sprites for new characters.

Interaction Model

Input How What Happens
Passive Automatic every 5s VisionAgent analyzes cursor area, Mediator gates response
Silence break After adaptive threshold EngagementAgent triggers with screen-specific observation
Circle gesture Draw 3 circles in 6s Force captures full window, bypasses Mediator
High significance Auto-detected (sig ≥ 7) Full active window captured for deeper analysis
Voice chat ⌘⇧G or "잼민아" Interactive conversation via text or voice

Key Design Decisions

  1. Zero dependencies — Pure Swift + Apple frameworks. No SPM packages. Ship-ready binary.
  2. Personality at the boundary — Agents analyze in neutral structured text. Only the output adapter adds character personality.
  3. Immediate reconnect — No exponential backoff for WebSocket. Fixed 1s delay. The companion should reconnect instantly.
  4. Adaptive, not annoying — MARL-inspired scheduler tracks response rates. Ignore it → speaks less. Engage → speaks more.
  5. Specific, not generic — Prompts trained to spot errors, suggest fixes, ask concrete questions.
  6. Native audio only — No macOS TTS. Gemini's native audio with configurable voice and affective dialog.

Technical Constraints

Constraint Value
macOS 26+ (Tahoe)
Swift 6.2+
Architecture Apple Silicon (arm64)
Dependencies Zero (pure Swift + Apple frameworks)
Vision Interval 5 seconds (configurable)
Audio 24kHz Float32 mono streaming
WebSocket Dual-ping keep-alive, session resumption

Hackathon

Gemini 3 Seoul Hackathon · February 28, 2026 · Sebitsom 3F Vista, Seoul

Criteria Weight How KpopDemonCoders Addresses It
Demo 50% Live desktop saja — visible, audible, interactive
Impact 25% Developer companion that spots errors and adapts to behavior
Creativity 15% MARL-inspired multi-agent 저승사자 with emotion-responsive native audio
Pitch 10% "A demon that haunts your screen and exorcises your bugs"

About

macOS native AI desktop companion — pixel-art 저승사자 (Grim Reaper) developer that haunts your screen, watches your code, and speaks with Gemini native audio. Built with Gemini 3 Flash + Live API. Zero dependencies.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors