A Claude Code plugin providing powerful video analysis skills that combine visual frame extraction and audio transcription to enable comprehensive multimodal video understanding.
| Skill | Description | Usage |
|---|---|---|
| query-video-segment | Analyze video segments with multimodal AI (frames + audio) | /codey-video:query-video-segment <path> <start> <end> <query> |
| generate-video-outline | Create timestamped chapter outline of full video | /codey-video:generate-video-outline <path> |
| generate-transcript | Generate complete timestamped audio transcript | /codey-video:generate-transcript <path> |
- 100% Local Processing - All video processing and transcription happens on your machine.
- Relies on Claude Code for AI analysis.
- Multimodal Analysis - Combines visual frames (1fps) with Whisper audio transcription for comprehensive understanding
- Fast Haiku Subagent - Uses Claude Haiku with native vision for cost-effective analysis
- Pure Node.js - No Python dependencies. Uses TypeScript + npm for all preprocessing
- Parallel Processing - Frame extraction and transcription run simultaneously for speed
- Flexible - Analyze short segments or process hours of video
System requirements:
- macOS:
brew install ffmpeg - Linux (Ubuntu/Debian):
sudo apt install build-essential ffmpeg - Windows: Install MinGW-w64 or MSYS2, then
choco install ffmpeg - Node.js: Version 18+ (for ESM support)
# Clone the repository
git clone https://github.com/yourusername/codey-video.git
cd codey-video
# Install dependencies and download Whisper model
./scripts/post-install.sh
# Use with Claude Code
claude code --plugin-dir $(pwd)Skills are available with the codey-video: namespace.
Note: You'll need to use --plugin-dir $(pwd) each time you start Claude Code from this directory. Consider creating a shell alias:
# Add to ~/.zshrc or ~/.bashrc
alias codey='claude code --plugin-dir ~/path/to/codey-video'The post-install.sh script installs dependencies. To download a Whisper model for transcription:
cd scripts
npm run download-whisper base.enRecommended model: base.en (142 MB) - Best balance of speed and accuracy for English
Analyze what happens in a specific time range:
/codey-video:query-video-segment ~/Videos/presentation.mp4 60 120 "What is being discussed in this segment?"
Arguments:
path- Path to video filestart- Start time in secondsend- End time in secondsquery- What you want to know about the segment
Create a timestamped chapter outline with visual and audio descriptions:
/codey-video:generate-video-outline ~/Videos/tutorial.mp4
Output includes:
- Chapter titles and time ranges
- Visual descriptions of what's shown
- Key points from the audio
- Scene transitions and topic changes
Get a complete timestamped transcript:
/codey-video:generate-transcript ~/Videos/interview.mp4
Returns segment-level timestamps in format:
[MM:SS.S → MM:SS.S] Transcribed text
Download models with:
cd scripts
npm run download-whisper <model-name>Available models (English-only .en models are faster):
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
tiny.en |
75 MB | Fastest | Lowest | Quick drafts, low-resource systems |
base.en |
142 MB | Fast | Good | Recommended - Best balance |
small.en |
466 MB | Slow | Better | When accuracy is critical |
tiny |
75 MB | Fast | Low | Multilingual (100+ languages) |
base |
142 MB | Medium | Good | Multilingual |
small |
466 MB | Slow | Better | Multilingual, high accuracy |
Models are stored in resources/models/ and auto-detected in preference order: base.en, base, tiny.en, tiny, small.en, small.
-
Preprocessing Scripts (TypeScript/Node.js)
extract-frames.ts- Uses ffmpeg (via execa) + sharp to extract frames at 1fps, converts to WebPtranscribe.ts- Uses ffmpeg + @fugood/whisper.node for local CPU transcriptionget-duration.ts- Uses ffprobe to query video duration
-
Haiku Subagent (Forked Context)
- Runs preprocessing scripts in parallel via Bash tool
- Reads extracted frame images via Read tool
- Synthesizes visual and audio information to answer queries
- Uses native multimodal vision (no extra API calls)
-
Skill Definitions (SKILL.md)
- Define prompt templates that instruct the subagent
- Specify allowed tools (Bash, Read)
- Map arguments to script parameters
- Deterministic Preprocessing - Expensive ffmpeg/Whisper operations run once, results are cached
- No Auth Issues - Subagent inherits Claude Code authentication automatically
- Clean Separation - Scripts do data processing, AI does reasoning
- Cost-Effective - Only sends relevant frames/transcript to fast Haiku model
- Privacy-First - All processing local, no external API calls
Download a model:
cd scripts
npm run download-whisper base.enInstall ffmpeg via your system package manager (see System Requirements above).
- Verify video file path is correct
- Check start/end times are within video duration
- Test with ffprobe:
ffprobe <video-file>
- Check if video has audio stream:
ffprobe <video-file> - Some screen recordings lack audio tracks
- Video files without audio will show "No audio stream found"
Try a larger model:
cd scripts
npm run download-whisper small.enThe transcription script auto-detects and uses the best available model.
Simply stop using the --plugin-dir flag. To remove models and dependencies:
cd /path/to/codey-video
rm -rf resources/models/*.bin scripts/node_modulesEdit scripts/extract-frames.ts, line 60 (the line where fps=1 is defined).
'-vf', 'fps=2' // Change from fps=1 to fps=2The extract-frames.ts script supports extracting exactly N frames:
npx tsx scripts/extract-frames.ts video.mp4 0 120 --count 50This extracts exactly 50 frames uniformly distributed across the time range.
The transcribe.ts script supports multiple time formats:
- Seconds:
120 - MM:SS:
2:00 - HH:MM:SS:
1:30:00
For videos longer than a few minutes, consider:
- Using
/generate-video-outlinefirst to understand structure - Then using
/query-video-segmentto deep-dive into specific segments - Breaking analysis into multiple queries
Future skill ideas (contributions welcome):
/extract-frames- Just extract frames without AI analysis (for screenshots)/video-search <path> <query>- Search for when something specific happens/video-to-slides <path>- Extract unique slides from presentation recordings (deduplicate similar frames)/video-chapters <path>- Output YouTube-style chapter markers
Built with:
- ffmpeg - Video/audio processing
- execa - Process execution
- sharp - High-performance image processing
- @fugood/whisper.node - Fast Whisper bindings for Node.js
- Claude Code - AI-powered CLI and skill system
Contributions welcome! Please see CLAUDE.md for developer documentation.
Copyright (c) 2026 The University of Texas Southwestern Medical Center.
Licensed for academic research use only. See LICENSE for details. Commercial use is expressly prohibited.