一个基于 macOS 的音频输出流实时转录系统,专为会议、聊天等场景设计,能够实时捕获设备音频输出并进行语音识别和转录。
- 实时转录: 支持音频流的实时语音识别和转录
- 音频捕获: 使用 BlackHole + Multi-Output Device 技术捕获系统音频输出
- 智能检测: 集成 WebRTC VAD 进行说话检测,提高转录准确性
- 多语言支持: 支持多种语言的语音识别和转录
- 虚拟环境: 使用 Python 虚拟环境确保依赖隔离
- 一键安装: 自动化安装脚本,简化部署流程
系统音频输出 → Multi-Output Device → 同时输出到两个目标
├── 扬声器/耳机 (正常播放,无影响)
└── BlackHole → 转录系统 (后台捕获)
工作原理说明:
-
Multi-Output Device: 创建虚拟音频设备,能够将同一音频流同时输出到多个目标
- ✅ 扬声器/耳机: 正常播放音频,用户听到声音,完全不受影响
- ✅ BlackHole: 同时捕获音频流,用于转录系统处理
-
BlackHole: 虚拟音频驱动,捕获音频流而不产生实际输出
- 不会产生任何声音,完全静默运行
- 不影响正常的音频播放体验
-
音频处理: 使用 WebRTC VAD 检测语音活动,过滤静音
- 智能识别说话片段,提高转录效率
- 减少噪音干扰,提升转录质量
-
转录引擎: 基于 Whisper.cpp 进行实时语音识别
- 后台运行,不影响音频播放
- 实时生成转录文本
- BlackHole 2ch: 虚拟音频驱动,用于音频流捕获
- WebRTC VAD: 语音活动检测,识别说话片段
- Whisper.cpp: 高效的语音识别引擎
- Python 虚拟环境: 依赖管理和环境隔离
- 操作系统: macOS 10.15+ (Catalina 及以上)
- Python: 3.11+
- 内存: 建议 8GB+
- 存储: 至少 500MB 可用空间
- 网络: 首次安装需要下载模型文件
git clone https://github.com/your-username/audio-captions-rt.git
cd audio-captions-rtpython3 setup-env.py安装脚本会自动完成以下操作:
- ✅ 检查 Python 版本
- ✅ 创建 Python 虚拟环境
- ✅ 安装 whisper.cpp
- ✅ 安装 BlackHole 音频驱动
- ✅ 安装 Python 依赖包
- ✅ 下载 Whisper 模型
- ✅ 检查音频设备
- ✅ 创建启动脚本
# macOS/Linux
./start_translator.sh
# Windows
start_translator.bat# 激活虚拟环境
source venv/bin/activate
# 启动转录系统
python simple_transcriber.py --source en --target zh支持的语言代码:
en: 英语zh: 中文ja: 日语ko: 韩语fr: 法语de: 德语es: 西班牙语it: 意大利语pt: 葡萄牙语ru: 俄语ar: 阿拉伯语auto: 自动检测
- 英文 → 中文: 英语音频转录为中文文本
- 中文 → 英文: 中文音频转录为英文文本
- 自动检测 → 中文: 自动识别语言并转录为中文
- 自定义: 手动指定源语言和目标语言
系统会自动检测可用的音频设备,并推荐支持输入捕获的输出设备。您可以在系统偏好设置中配置音频设备:
- 打开 系统偏好设置 → 声音
- 创建 Multi-Output Device
- 添加 BlackHole 2ch 和您的扬声器
- 设置为默认输出设备
查阅此wiki获得更详细的说明
audio-captions-rt/
├── README.md # 项目说明文档
├── setup-env.py # 自动化安装脚本
├── simple_transcriber.py # 主要转录程序
├── requirements.txt # Python 依赖列表
├── start_translator.sh # macOS/Linux 启动脚本
├── start_translator.bat # Windows 启动脚本
├── venv/ # Python 虚拟环境
└── models/ # Whisper 模型文件
├── ggml-small.bin # 推荐模型 (244MB)
└── ggml-base.bin # 基础模型 (147MB)
我们正在积极开发以下功能,敬请期待:
- 在转录基础上实现实时翻译功能
- 支持多种语言之间的互译
- 集成高质量翻译模型
- 实时字幕显示
- 会议音频内容智能总结
- 关键点提取和标记
- 时间轴标注
- 导出多种格式(文本、PDF、Markdown)
- 麦克风拦截: 实时捕获麦克风输入
- 音色克隆: 学习并复制特定说话人的音色特征
- TTS 翻译: 将说话人的语言实时翻译成其他语音
- 跨语言对话: 支持不同语言用户之间的实时对话
- 多说话人识别和分离
- 情感分析和语调识别
- 自定义词汇和专业术语支持
- 云端同步和协作功能
💡 欢迎贡献: 如果您对这些功能感兴趣,欢迎提交 Issue 或 Pull Request 来帮助实现!
# 手动安装 BlackHole
brew install blackhole-2ch- 确保 BlackHole 已正确安装
- 检查系统音频权限设置
- 重启音频服务
- 使用更高质量的 Whisper 模型
- 调整音频输入音量
- 确保环境噪音较低
# 重新创建虚拟环境
rm -rf venv
python3 setup-env.py系统运行时会显示详细的日志信息,包括:
- 音频设备检测结果
- 转录进度和状态
- 错误信息和警告
欢迎提交 Issue 和 Pull Request!
- Fork 项目
- 创建功能分支
- 提交更改
- 推送到分支
- 创建 Pull Request
- 使用 Python 3.11+ 语法
- 遵循 PEP 8 代码风格
- 添加适当的注释和文档
- 确保代码通过测试
本项目采用 MIT 许可证 - 查看 LICENSE 文件了解详情。
- Whisper.cpp - 高效的语音识别引擎
- BlackHole - 虚拟音频驱动
- WebRTC VAD - 语音活动检测
- Homebrew - macOS 包管理器
- 项目主页: GitHub Repository
- 问题反馈: Issues
- 功能建议: Discussions
A real-time audio output stream transcription system based on macOS, designed for meetings, chats, and other scenarios. It can capture device audio output in real-time and perform speech recognition and transcription.
- Real-time Transcription: Supports real-time speech recognition and transcription of audio streams
- Audio Capture: Uses BlackHole + Multi-Output Device technology to capture system audio output
- Smart Detection: Integrates WebRTC VAD for speech detection, improving transcription accuracy
- Multi-language Support: Supports speech recognition and transcription in multiple languages
- Virtual Environment: Uses Python virtual environment to ensure dependency isolation
- One-click Installation: Automated installation script to simplify deployment
System Audio Output → Multi-Output Device → Outputs to Two Destinations
├── Speakers/Headphones (Normal playback, no impact)
└── BlackHole → Transcription System (Background capture)
Working Principle:
-
Multi-Output Device: Creates virtual audio devices that can output the same audio stream to multiple destinations simultaneously
- ✅ Speakers/Headphones: Normal audio playback, users hear sound without any interference
- ✅ BlackHole: Simultaneously captures audio stream for transcription system processing
-
BlackHole: Virtual audio driver that captures audio streams without producing actual output
- Generates no sound, runs completely silently
- Does not affect normal audio playback experience
-
Audio Processing: Uses WebRTC VAD to detect speech activity and filter silence
- Intelligently identifies speech segments, improving transcription efficiency
- Reduces noise interference, enhancing transcription quality
-
Transcription Engine: Real-time speech recognition based on Whisper.cpp
- Runs in background without affecting audio playback
- Generates transcription text in real-time
- BlackHole 2ch: Virtual audio driver for audio stream capture
- WebRTC VAD: Voice Activity Detection for identifying speech segments
- Whisper.cpp: Efficient speech recognition engine
- Python Virtual Environment: Dependency management and environment isolation
- Operating System: macOS 10.15+ (Catalina and above)
- Python: 3.11+
- Memory: Recommended 8GB+
- Storage: At least 500MB available space
- Network: Required for downloading model files on first installation
git clone https://github.com/your-username/audio-captions-rt.git
cd audio-captions-rtpython3 setup-env.pyThe installation script will automatically complete the following operations:
- ✅ Check Python version
- ✅ Create Python virtual environment
- ✅ Install whisper.cpp
- ✅ Install BlackHole audio driver
- ✅ Install Python dependencies
- ✅ Download Whisper models
- ✅ Check audio devices
- ✅ Create startup scripts
# macOS/Linux
./start_translator.sh
# Windows
start_translator.bat# Activate virtual environment
source venv/bin/activate
# Start transcription system
python simple_transcriber.py --source en --target zhSupported language codes:
en: Englishzh: Chineseja: Japaneseko: Koreanfr: Frenchde: Germanes: Spanishit: Italianpt: Portugueseru: Russianar: Arabicauto: Auto-detect
- English → Chinese: Transcribe English audio to Chinese text
- Chinese → English: Transcribe Chinese audio to English text
- Auto-detect → Chinese: Automatically detect language and transcribe to Chinese
- Custom: Manually specify source and target languages
The system automatically detects available audio devices and recommends output devices that support input capture. You can configure audio devices in System Preferences:
- Open System Preferences → Sound
- Create Multi-Output Device
- Add BlackHole 2ch and your speakers
- Set as default output device
audio-captions-rt/
├── README.md # Project documentation
├── setup-env.py # Automated installation script
├── simple_transcriber.py # Main transcription program
├── requirements.txt # Python dependency list
├── start_translator.sh # macOS/Linux startup script
├── start_translator.bat # Windows startup script
├── venv/ # Python virtual environment
└── models/ # Whisper model files
├── ggml-small.bin # Recommended model (244MB)
└── ggml-base.bin # Basic model (147MB)
We are actively developing the following features. Stay tuned!
- Implement real-time translation based on transcription
- Support translation between multiple languages
- Integrate high-quality translation models
- Real-time subtitle display
- Intelligent meeting audio content summarization
- Key point extraction and tagging
- Timeline annotation
- Export to multiple formats (Text, PDF, Markdown)
- Microphone Interception: Real-time capture of microphone input
- Voice Cloning: Learn and replicate specific speaker's voice characteristics
- TTS Translation: Real-time translation of speaker's language into other voices
- Cross-language Dialogue: Support real-time conversation between users of different languages
- Multi-speaker identification and separation
- Emotion analysis and tone recognition
- Custom vocabulary and professional terminology support
- Cloud synchronization and collaboration features
💡 Welcome Contributions: If you're interested in these features, feel free to submit Issues or Pull Requests to help implement them!
# Manually install BlackHole
brew install blackhole-2ch- Ensure BlackHole is properly installed
- Check system audio permission settings
- Restart audio services
- Use higher quality Whisper models
- Adjust audio input volume
- Ensure low environmental noise
# Recreate virtual environment
rm -rf venv
python3 setup-env.pyThe system displays detailed log information during operation, including:
- Audio device detection results
- Transcription progress and status
- Error messages and warnings
We welcome Issue submissions and Pull Requests!
- Fork the project
- Create a feature branch
- Commit changes
- Push to the branch
- Create a Pull Request
- Use Python 3.11+ syntax
- Follow PEP 8 code style
- Add appropriate comments and documentation
- Ensure code passes tests
This project is licensed under the MIT License - see the LICENSE file for details.
- Whisper.cpp - Efficient speech recognition engine
- BlackHole - Virtual audio driver
- WebRTC VAD - Voice Activity Detection
- Homebrew - macOS package manager
- Project Homepage: GitHub Repository
- Issue Feedback: Issues
- Feature Suggestions: Discussions
⭐ If this project helps you, please give us a star!