Automated video analysis toolkit for human interaction research - Extract comprehensive behavioral annotations from videos using AI pipelines, with an intuitive web interface for visualization and analysis.
VideoAnnotator automatically analyzes videos of human interactions and extracts rich behavioral data including:
- ๐ฅ Person tracking - Multi-person detection and pose estimation with persistent IDs
- ๐ Facial analysis - Emotions, expressions, gaze direction, and action units
- ๐ฌ Scene detection - Environment classification and temporal segmentation
- ๐ค Audio analysis - Speech recognition, speaker identification, and emotion detection
Perfect for researchers studying parent-child interactions, social behavior, developmental psychology, and human-computer interaction.
VideoAnnotator provides both automated processing and interactive visualization:
AI-powered video processing pipeline
- Processes videos to extract behavioral annotations
- REST API for integration with research workflows
- Supports batch processing and custom configurations
- Outputs standardized JSON data
Interactive web-based visualization tool
- Load and visualize VideoAnnotator results
- Synchronized video playback with annotation overlays
- Timeline scrubbing with pose, face, and audio data
- Export tools for further analysis
Complete workflow: Your Videos โ [VideoAnnotator Processing] โ Annotation Data โ [Video Annotation Viewer] โ Interactive Analysis
Recommended: Run VideoAnnotator in Docker for the most reliable experience (consistent dependencies, easier GPU support, fewer host setup issues).
CPU (works anywhere):
docker compose up --buildGPU (faster processing; requires NVIDIA Container Toolkit):
docker compose --profile gpu up --build videoannotator-gpuThen open the interactive API docs at http://localhost:18011/docs.
If you want to initialize the database and create an admin API key explicitly:
docker compose exec videoannotator setupdb --admin-email you@example.com --admin-username you# Install modern Python package manager
curl -LsSf https://astral.sh/uv/install.sh | sh # Linux/Mac
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
# Clone and install
git clone https://github.com/InfantLab/VideoAnnotator.git
cd VideoAnnotator
uv sync # Fast dependency installation (30 seconds)
# Initialize the local database (creates tables + admin user/token)
uv run videoannotator setup-db --admin-email you@example.com --admin-username youIf you are using the provided Docker/devcontainer images, a few convenience commands are available on PATH.
These are optional shortcuts; the canonical CLI remains uv run videoannotator ....
If you are running via Docker Compose, you can use these shortcuts without โshelling inโ manually:
docker compose exec videoannotator setupdb --admin-email you@example.com --admin-username you
docker compose exec videoannotator server --host 0.0.0.0 --port 18011
# If you launched the GPU service instead:
docker compose exec videoannotator-gpu setupdb --admin-email you@example.com --admin-username you
docker compose exec videoannotator-gpu server --host 0.0.0.0 --port 18011| Action | Shortcut | Equivalent |
|---|---|---|
| Initialize the database + create an admin token | setupdb --admin-email you@example.com --admin-username you |
uv run videoannotator setup-db --admin-email you@example.com --admin-username you |
| Run the VideoAnnotator CLI (any subcommand) | va ... |
uv run videoannotator ... |
| Start the API server (recommended defaults) | va |
uv run videoannotator |
| Start the API server (explicit subcommand) | server ... |
uv run videoannotator server ... |
| Generate a new API token | newtoken ... |
uv run videoannotator generate-token ... |
| Run all tests (quick/quiet) | vatest |
uv run pytest -q |
| Run some tests (quick/quiet) | vatest tests/unit/ |
uv run pytest -q tests/unit/ |
For more copy-pasteable CLI workflows, see docs/usage/demo_commands.md.
# Start the API server
uv run videoannotator # Local install (advanced). In Docker: `docker compose up`
# Use the API key printed by `setup-db` (or the server's first-start output)
# Process your first video (in another terminal)
curl -X POST "http://localhost:18011/api/v1/jobs/" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "video=@your_video.mp4" \
-F "selected_pipelines=person,face,scene,audio"
# Check results at http://localhost:18011/docs# Install the companion web viewer
git clone https://github.com/InfantLab/video-annotation-viewer.git
cd video-annotation-viewer
npm install
npm run dev
Note: Ensure Node and NPM are installed. On macOS with Homebrew:
brew install node
# Open http://localhost:3000 and load your VideoAnnotator results๐ That's it! You now have both automated video processing and interactive visualization.
Authoritative pipeline metadata (names, tasks, modalities, capabilities) is generated from the registry:
- Pipeline specification table:
docs/pipelines_spec.md(auto-generated; do not edit by hand) - Emotion output format spec:
docs/specs/emotion_output_format.md
Additional Specs:
- Output Naming Conventions:
docs/specs/output_naming_conventions.md(stable patterns for downstream tooling) - Emotion Validator Utility:
src/validation/emotion_validator.py(programmatic validation of.emotion.jsonfiles) - CLI Validation:
videoannotator validate-emotion path/to/file.emotion.jsonreturns non-zero exit on failure Client tools (e.g. the Video Annotation Viewer) should rely on those sources or the/api/v1/pipelinesendpoint rather than hard-coding pipeline assumptions.
- Technology: YOLO11 + ByteTrack multi-object tracking
- Outputs: Bounding boxes, pose keypoints, persistent person IDs
- Use cases: Movement analysis, social interaction tracking, activity recognition
- Technology: OpenFace 3.0, LAION Face, OpenCV backends
- Outputs: 68-point landmarks, emotions, action units, gaze direction, head pose
- Use cases: Emotional analysis, attention tracking, facial expression studies
- Technology: PySceneDetect + CLIP environment classification
- Outputs: Scene boundaries, environment labels, temporal segmentation
- Use cases: Context analysis, setting classification, behavioral context
- Technology: OpenAI Whisper + pyannote speaker diarization
- Outputs: Speech transcripts, speaker identification, voice emotions
- Use cases: Conversation analysis, language development, vocal behavior
- No coding required - Web interface and REST API
- Standardized outputs - JSON formats compatible with analysis tools
- Reproducible results - Version-controlled processing pipelines
- Batch processing - Handle multiple videos efficiently
- State-of-the-art models - YOLO11, OpenFace 3.0, Whisper
- Validated pipelines - Tested on developmental psychology datasets
- Comprehensive metrics - Confidence scores, validation tools
- Flexible configuration - Adjust parameters for your research needs
- Fast processing - GPU acceleration, optimized pipelines
- Scalable architecture - Docker containers, API-first design
- Cross-platform - Windows, macOS, Linux support
- Enterprise features - Authentication, logging, monitoring
- 100% Local Processing - All analysis runs on your hardware, no cloud dependencies
- No Data Transmission - Videos and results never leave your infrastructure
- GDPR Compliant - Full control over sensitive research data
- Foundation Model Free - No external API calls to commercial AI services
- Research Ethics Ready - Designed for studies requiring strict data confidentiality
VideoAnnotator generates rich, structured data like this:
{
"person_tracking": [
{
"timestamp": 12.34,
"person_id": 1,
"bbox": [0.2, 0.3, 0.4, 0.5],
"pose_keypoints": [...],
"confidence": 0.87
}
],
"face_analysis": [
{
"timestamp": 12.34,
"person_id": 1,
"emotion": "happy",
"confidence": 0.91,
"facial_landmarks": [...],
"gaze_direction": [0.1, -0.2]
}
],
"scene_detection": [
{
"start_time": 0.0,
"end_time": 45.6,
"scene_type": "living_room",
"confidence": 0.95
}
],
"audio_analysis": [
{
"start_time": 1.2,
"end_time": 3.8,
"speaker": "adult",
"transcript": "Look at this toy!",
"emotion": "excited"
}
]
}- Python: Import JSON data into pandas, matplotlib, seaborn
- R: Load data with jsonlite, analyze with tidyverse
- MATLAB: Process JSON with built-in functions
- CVAT: Computer Vision Annotation Tool integration
- LabelStudio: Machine learning annotation platform
- ELAN: Linguistic annotation software compatibility
- Video Annotation Viewer: Interactive web-based analysis (recommended)
- Custom dashboards: Build with our REST API
- Jupyter notebooks: Examples included in repository
# Modern Python environment
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/InfantLab/VideoAnnotator.git
cd VideoAnnotator
uv sync
# Start processing
uv run videoannotator# CPU version (lightweight)
docker build -f Dockerfile.cpu -t videoannotator:cpu .
docker run -p 18011:18011 videoannotator:cpu
# GPU version (faster processing)
docker build -f Dockerfile.gpu -t videoannotator:gpu .
docker run -p 18011:18011 --gpus all videoannotator:gpu
# Development version (pre-cached models)
docker build -f Dockerfile.dev -t videoannotator:dev .
docker run -p 18011:18011 --gpus all videoannotator:dev# Python API for custom workflows
from videoannotator import VideoAnnotator
annotator = VideoAnnotator()
results = annotator.process("video.mp4", pipelines=["person", "face"])
# Analyze results
import pandas as pd
df = pd.DataFrame(results['person_tracking'])
print(f"Detected {df['person_id'].nunique()} unique people")| Resource | Description |
|---|---|
| ๐ Interactive Docs | Complete documentation with examples |
| ๐ฎ Live API Testing | Interactive API when server is running |
| ๐ Getting Started Guide | Step-by-step setup and first video |
| ๐ง Installation Guide | Detailed installation instructions |
| โ๏ธ Pipeline Specifications | Technical pipeline documentation |
| ๐ฏ Demo Commands | Example commands and workflows |
- Parent-child interaction studies with synchronized behavioral coding
- Social development research with multi-person tracking
- Language acquisition studies with audio-visual alignment
- Autism spectrum behavioral analysis with facial expression tracking
- Therapy session analysis with emotion and engagement metrics
- Developmental assessment with standardized behavioral measures
- User experience research with attention and emotion tracking
- Interface evaluation with gaze direction and facial feedback
- Accessibility studies with comprehensive behavioral data
- FastAPI - High-performance REST API with automatic documentation
- YOLO11 - State-of-the-art object detection and pose estimation
- OpenFace 3.0 - Comprehensive facial behavior analysis
- Whisper - Robust speech recognition and transcription
- PyTorch - GPU-accelerated machine learning inference
- Processing speed: ~2-4x real-time with GPU acceleration
- Memory usage: 4-8GB RAM for typical videos
- Storage: ~100MB output per hour of video
- Accuracy: 90%+ for person detection, 85%+ for emotion recognition
- Batch processing: Handle multiple videos simultaneously
- Container deployment: Docker support for cloud platforms
- Distributed processing: API-first design for microservices
- Resource optimization: CPU and GPU variants available
- ๐ Report issues: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ง Contact: Caspar Addyman at infantologist@gmail.com
- ๐ฌ Collaborations: Open to research partnerships
- Code quality: 83% test coverage, modern Python practices
- Documentation: Comprehensive guides and API documentation
- CI/CD: Automated testing and deployment pipelines
- Standards: Following research software engineering best practices
If you use VideoAnnotator in your research, please cite:
Addyman, C. (2025). VideoAnnotator: Automated video analysis toolkit for human interaction research.
Zenodo. https://Zenodo. doi.org/10.5281/zenodo.16961751
MIT License - Full terms in LICENSE
- The Global Parenting Initiative (Funded by The LEGO Foundation)
- Caspar Addyman (infantologist@gmail.com) - Lead Developer & Research Director
Built with and grateful to:
- YOLO & Ultralytics - Object detection and tracking
- OpenFace 3.0 - Facial behavior analysis
- OpenAI Whisper - Speech recognition
- FastAPI - Modern web framework
- PyTorch - Machine learning infrastructure
Development was greatly helped by:
- Visual Studio Code - Primary development environment
- GitHub Copilot - AI pair programming assistance
- Claude Code - Architecture design and documentation
- GPT-4 & Claude Models - Code generation and debugging help
This project demonstrates how AI-assisted development can accelerate research software creation while maintaining code quality and comprehensive testing.
๐ฅ Ready to start analyzing videos? Follow the quick start above!