KpopDemonCoders

A macOS native AI desktop companion — a pixel-art 저승사자 (Grim Reaper) developer that haunts your screen, watches what you code, and speaks to you with Gemini's native audio.

Bugs are 귀마 (demons). Fixing code is 퇴마 (exorcism). You are the idol developer.

Built for the Gemini 3 Seoul Hackathon (Feb 28, 2026).

What It Does

KpopDemonCoders sits on your desktop as a pixel-art 저승사자 that follows your cursor across all monitors. It captures the window under your cursor every 5 seconds, analyzes it with Gemini 3 Flash, and decides whether to comment — out loud — using Gemini's native audio WebSocket.

Sees your screen — VisionAgent reads what's around your cursor and scores its significance
Speaks with personality — Zubenelgenubi voice with darkly witty Korean supernatural references
Thinks before speaking — Mediator gates speech with cooldowns, significance thresholds, and adaptive timing
Reads full windows — On high-significance events (errors, build results), captures the entire active window for deeper context
Adapts to you — MARL-inspired scheduler learns your response patterns and adjusts how often the saja speaks
Gesture input — Draw 3 circles with your mouse to force analysis and comment
Google Search Grounding — Can reference real-time web information in responses
Character switching — Switch between 6 characters (saja, cat, derpy, jinwoo, kimjongun, trump) at runtime with full persona swap
Voice chat — Global hotkey ⌘⇧G + wake word "잼민아" for interactive conversation

Architecture

                    ┌─────────────────────┐
                    │ScreenCaptureService │
                    │  ScreenCaptureKit   │
                    │  + change detection │
                    └────────┬────────────┘
                             │ JPEG image
                             ▼
                    ┌─────────────────────┐
                    │    VisionAgent      │
                    │  gemini-3-flash     │
                    │  REST + JSON schema │
                    │  thinkingLevel:     │
                    │    minimal          │
                    └────────┬────────────┘
                             │ VisionAnalysis
                             │ {significance, content, emotion, shouldSpeak}
                             ▼
┌──────────────┐   ┌─────────────────────┐   ┌────────────────────┐
│  Engagement  │──▶│     Mediator        │◀──│ AdaptiveScheduler  │
│    Agent     │   │  Speech gating      │   │  MARL reward-based │
│  silence     │   │  Cooldown + urgency │   │  adaptive timing   │
│  monitor     │   │  evaluate() → typed │   │                    │
│  (dynamic)   │   │  MediatorDecision   │   │  responseRate →    │
└──────────────┘   └────────┬────────────┘   │  silence threshold │
                            │                │  interruptRate →   │
                   ┌────────┴────────┐       │  cooldown          │
                   │                 │       └────────────────────┘
              speak=true        speak=false
                   │                 │
                   ▼                 └──▶ (silent)
          ┌─────────────────┐
          │  ScreenAnalyzer │
          │  (Orchestrator) │
          │                 │
          │  sig ≥ 7?       │
          │  ├─ yes → full  │
          │  │   window     │
          │  │   capture    │
          │  └─ no → text   │
          │     only        │
          └────┬────────────┘
               │
      ┌────────┴────────┐
      │                 │
  WS connected?     WS down?
      │                 │
      ▼                 ▼
┌──────────────┐  ┌──────────────┐
│ GeminiLive   │  │ REST Fallback│
│ Client (WS)  │  │ gemini-3-    │
│              │  │ flash +      │
│ gemini-2.5-  │  │ TTSClient    │
│ flash-native │  │              │
│ -audio       │  └──────────────┘
│              │
│ Configurable │
│ voice +      │
│ affective    │
│   dialog     │
└──────┬───────┘
       │ audio chunks + outputTranscription
       │ (sentence-level via `finished` flag)
       ▼
┌──────────────┐   ┌──────────────┐
│ AudioPlayer  │   │   KDCView    │
│ AVAudioEngine│   │  SwiftUI     │
│ 24kHz F32    │   │  bubble +    │
│ streaming    │   │  sprite      │
└──────────────┘   └──────────────┘

Multi-Agent Design (MARL CTDE)

Inspired by Multi-Agent Reinforcement Learning's Centralized Training with Decentralized Execution:

Agent	Role	Communication
VisionAgent	Analyzes screen captures → structured JSON	Text-only, no personality
Mediator	Centralized critic — gates speech decisions	Typed `MediatorDecision`
EngagementAgent	Monitors silence → proactive triggers	Neutral intent packets
AdaptiveScheduler	MARL reward signals → dynamic timing	`responseRate`, `interruptRate`
ScreenAnalyzer	Orchestrator / router (LangGraph-style)	Coordinates all agents
GeminiLiveClient	WebSocket transport + keep-alive	Audio + transcription
TTSClient	REST TTS fallback when WS is down	PCM audio

Key principle: Agents communicate in structured text. Character personality is applied ONLY at the output boundary (Prompts.swift) based on the active KDCCharacterPreset.

Gemini API Usage

Model	Purpose	Method
`gemini-3-flash-preview`	Screen analysis (VisionAgent)	REST `generateContent`
`gemini-2.5-flash-native-audio-latest`	Live conversation + audio	WebSocket `BidiGenerateContent`
`gemini-2.5-flash-preview-tts`	TTS fallback when WS is down	REST `generateContent`

Features used:

responseSchema — structured JSON output from VisionAgent
thinkingConfig: { thinkingLevel: "minimal" } — fast analysis
mediaResolution: "MEDIA_RESOLUTION_HIGH" — 1102 tokens per image
enableAffectiveDialog: true — emotion-responsive voice
outputAudioTranscription — text alongside audio for bubble display
sessionResumption: { transparent: true } — reconnection without context loss
contextWindowCompression — sliding window for long sessions
proactivity: { proactiveAudio: true } — model-initiated speech
tools: [{ googleSearch: {} }] — real-time grounding

Character System

Character	ID	Voice	Size	Theme
저승사자	`saja`	Zubenelgenubi	Large	귀마/퇴마, darkly witty
White Cat	`cat`	Zephyr	Medium	nya~/meow~, playful
Derpy	`derpy`	Zephyr	Medium	goofy, chaotic energy
Jinwoo	`jinwoo`	Zubenelgenubi	Large	solo leveling vibes
Kim Jong Un	`kimjongun`	Zubenelgenubi	Large	supreme leader energy
Trump	`trump`	Zubenelgenubi	Large	tremendous commentary

Characters are complete persona bundles: sprite set + voice + size + prompt profile. Switching characters changes everything at once. Add new characters by creating a directory in Assets/Sprites/{name}/ with 16 sprites and a preset.json.

File Structure

KpopDemonCoders/
├── Package.swift                    # Swift 6.2, macOS 26, zero dependencies
├── Sources/Core/                    # 11 files — shared, testable library
│   ├── AudioMessageParser.swift     # WebSocket message parsing
│   ├── CharacterPresetConfig.swift  # Character preset loading/validation
│   ├── ChatMessage.swift            # Chat message model
│   ├── ImageDiffer.swift            # Pixel-level change detection
│   ├── ImageEncoder.swift           # Base64 encoding for Gemini API
│   ├── ImageProcessor.swift         # JPEG encoding + resizing
│   ├── KeychainHelper.swift         # Secure API key storage
│   ├── PCMConverter.swift           # Int16 PCM → Float32 conversion
│   ├── PromptBuilder.swift          # Prompt assembly (saja/cat profiles)
│   ├── SettingsTypes.swift          # 15 shared enums (30 voices)
│   └── KDCCore.swift                # Module exports and shared utilities
├── Sources/KpopDemonCoders/         # 28 files — main application
│   ├── main.swift                   # Entry point
│   ├── KDCAppDelegate.swift         # Component wiring (20 components)
│   ├── Config.swift                 # API key management
│   ├── KDCViewModel.swift           # Mouse tracking, lerp smoothing
│   ├── KDCSpriteAnimator.swift      # 8fps pixel-art animation
│   ├── KDCScreenCaptureService.swift # ScreenCaptureKit integration
│   ├── KDCVisionAgent.swift         # Gemini 3 Flash REST analysis
│   ├── KDCMediator.swift            # Speech gating
│   ├── KDCEngagementAgent.swift     # Silence monitor
│   ├── KDCAdaptiveScheduler.swift   # MARL adaptive timing
│   ├── KDCScreenAnalyzer.swift      # Multi-agent orchestrator
│   ├── KDCLiveClient.swift          # WebSocket + dual ping
│   ├── KDCAudioPlayer.swift         # AVAudioEngine 24kHz streaming
│   └── ...                          # + 15 more files
├── Tests/KpopDemonCodersTests/      # 11 test suites, 182 tests
├── Assets/
│   ├── Sprites/                     # Pixel-art sprites (6 characters)
│   ├── TrayIcons/                   # Menu bar icons (1x/2x/3x)
│   ├── TrayIcons_Clean/             # Minimal tray icon variants
│   └── Music/                       # Lo-fi background tracks (WAV)
└── docs/                            # Full documentation

Build & Run

# Requirements: macOS 26+, Xcode 26+ with Swift 6.2

# 1. Configure API key (first time only)
cp .env.test.example .env.test
# Edit .env.test and add your Gemini API key
# Get one at: https://aistudio.google.com/apikey

# 2. Build + codesign + run
make run

# Other targets
make build       # Build only
make run-log     # Run with full logging
make test        # Run 182 tests
make clean       # Clean build artifacts

Grant Screen Recording permission when prompted (System Settings → Privacy & Security → Screen Recording).

Asset Generation

All visual and audio assets are AI-generated:

Sprites: fal.ai Nano Banana 2 pipeline — text-to-image base → image-to-image editing per emotion/frame → background removal → PNG with alpha. 16 sprites per character (4 emotions × 4 frames). 6 characters included.
Tray Icons: Animated menu bar icons in 1x/2x/3x Retina resolutions, 4 emotions × 4 frames = 48 icons.
Background Music: Google Lyria RealTime — lo-fi tracks golden_lofi.wav and coding_lofi.wav (48kHz stereo WAV).

See docs/reference/asset-pipeline.md for the full pipeline architecture, engine comparison, and how to generate sprites for new characters.

Interaction Model

Input	How	What Happens
Passive	Automatic every 5s	VisionAgent analyzes cursor area, Mediator gates response
Silence break	After adaptive threshold	EngagementAgent triggers with screen-specific observation
Circle gesture	Draw 3 circles in 6s	Force captures full window, bypasses Mediator
High significance	Auto-detected (sig ≥ 7)	Full active window captured for deeper analysis
Voice chat	⌘⇧G or "잼민아"	Interactive conversation via text or voice

Key Design Decisions

Zero dependencies — Pure Swift + Apple frameworks. No SPM packages. Ship-ready binary.
Personality at the boundary — Agents analyze in neutral structured text. Only the output adapter adds character personality.
Immediate reconnect — No exponential backoff for WebSocket. Fixed 1s delay. The companion should reconnect instantly.
Adaptive, not annoying — MARL-inspired scheduler tracks response rates. Ignore it → speaks less. Engage → speaks more.
Specific, not generic — Prompts trained to spot errors, suggest fixes, ask concrete questions.
Native audio only — No macOS TTS. Gemini's native audio with configurable voice and affective dialog.

Technical Constraints

Constraint	Value
macOS	26+ (Tahoe)
Swift	6.2+
Architecture	Apple Silicon (arm64)
Dependencies	Zero (pure Swift + Apple frameworks)
Vision Interval	5 seconds (configurable)
Audio	24kHz Float32 mono streaming
WebSocket	Dual-ping keep-alive, session resumption

Hackathon

Gemini 3 Seoul Hackathon · February 28, 2026 · Sebitsom 3F Vista, Seoul

Criteria	Weight	How KpopDemonCoders Addresses It
Demo	50%	Live desktop saja — visible, audible, interactive
Impact	25%	Developer companion that spots errors and adapts to behavior
Creativity	15%	MARL-inspired multi-agent 저승사자 with emotion-responsive native audio
Pitch	10%	"A demon that haunts your screen and exorcises your bugs"

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.github/workflows		.github/workflows
.sisyphus/notepads/prd-conformity-wave		.sisyphus/notepads/prd-conformity-wave
Assets		Assets
KpopDemonCoders		KpopDemonCoders
docs		docs
.env.test.example		.env.test.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KpopDemonCoders

What It Does

Architecture

Multi-Agent Design (MARL CTDE)

Gemini API Usage

Character System

File Structure

Build & Run

Asset Generation

Interaction Model

Key Design Decisions

Technical Constraints

Hackathon

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Two-Weeks-Team/KpopDemonCoders

Folders and files

Latest commit

History

Repository files navigation

KpopDemonCoders

What It Does

Architecture

Multi-Agent Design (MARL CTDE)

Gemini API Usage

Character System

File Structure

Build & Run

Asset Generation

Interaction Model

Key Design Decisions

Technical Constraints

Hackathon

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages