Extract YouTube channel knowledge into clean text files for learning & research.
- โ๏ธ Legal Notice & Disclaimer
- โจ Features
- ๐ค RAG & LLM Readiness
- ๐ง How it Works
- ๐งน Before & After
- ๐ ๏ธ Built With
- ๐ ๏ธ Installation
- ๐ Usage
- โ FAQ
- ๐ค Contributing
This tool is for:
- โ Personal, non-commercial use
- โ Fair use, education, research
- โ Offline learning & knowledge extraction
- โ Archiving your own content
YouTube Terms of Service allow:
- Downloading your own videos
- Offline viewing for personal use
- Subtitle extraction for accessibility
DO NOT:
- โ Re-upload content
- โ Commercial services
- โ Mass downloading without rate limiting
yt-dlp is used as the core engine (the industry standard for open-source media extraction).
Most YouTube subtitle downloaders just give you the raw VTT, which is full of repetition because YouTube "builds" sentences word-by-word in auto-generated captions.
ytknow uses a Prefix-Matching Algorithm:
- It strips all millisecond-level timestamps and inline tags.
- It compares each new line with the previous one.
- If the new line starts with the previous text, it "evolves" the line instead of repeating it.
- If it's a duplicate or a subset, it's discarded.
Result: You get a clean, human-readable paragraph instead of a 10,000-line stuttering mess.
ytknow cleans up the messy duplication and timing tags common in YouTube auto-captions:
00:00:00.480 --> 00:00:03.070
das<00:00:00.640><c> heutige</c><00:00:01.079><c> Video</c>
00:00:03.070 --> 00:00:03.080
das heutige Video bedarf eines Vorworts"Das heutige Video bedarf eines Vorworts..."
- โจ Interactive Menu: New TUI to easily select subtitle languages and download modes.
- ๐ Lightning Fast: Uses
yt-dlpwith--lazy-playlistto start processing immediately. - ๐งน Deep Cleaning: Removes all VTT timing codes, word-level tags, and alignment metadata.
- ๐ง Smart Deduplication: Automatically resolves sentence-building repetition in YouTube's auto-captions.
- ๐ค LLM-Optimized: Generates clean TXT and MD files with rich metadata headers and consolidated JSONL files.
- ๐๏ธ Whisper Fallback: Automatically transcribes videos using OpenAI Whisper if no subtitles are found.
- ๐ง AI Summarization: Generate high-quality summaries and key takeaways using the OpenAI API.
- ๐ฌ Comments Integration: Download video comments along with transcripts or as a standalone task.
- ๐ Smart Fallback: Automatically prefers
en-origif standardenis unavailable but requested. - ๐ก๏ธ Resilient: Gracefully handles unavailable or private videos in large playlists.
ytknow is specifically designed for Retrieval-Augmented Generation (RAG).
Output files include header metadata (Source URL, Upload Date), allowing LLMs to cite sources and prioritize recent information.
Every session generates a knowledge_master.jsonl file. This format is the industry standard for:
- Model Fine-tuning: directly usable as a training dataset.
- Archiving: keeps full context per video.
ytknow automatically generates a second file called knowledge_chunks.jsonl.
- Ready-to-Embed: Splits text into ~1000 char chunks, respecting sentence boundaries.
- Overlapping: Includes 100 char overlap to preserve context between chunks.
- Metadata Preserved: Each chunk carries the video URL, title, and upload date.
Just upload
knowledge_chunks.jsonlto your Vector DB (Pinecone, Chroma, Weaviate) and you're done!
# Clone the repository
git clone https://github.com/egohead/ytknow.git
cd ytknow
# Run the installer (macOS/Linux)
chmod +x install.sh
./install.shytknow now features an interactive mode. Simply run it with a URL:
# Start interactive processing
ytknow [YOUTUBE_URL]When you run ytknow, it will guide you through:
- Language Selection: Choose from all available subtitles (Original, Manual, or Auto-Translated).
- Download Mode:
- Knowledge Base: Subtitles + Metadata + AI Summary.
- Comments Only: Just the user comments.
- All: Everything combined.
# Skip the menu by providing a language code
ytknow [VIDEO_URL] -l en
# Summarize a video (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
ytknow [VIDEO_URL] --summarize
# Transcribe with a specific Whisper model
ytknow [VIDEO_URL] --model small
# Survey a channel for available languages
ytknow --survey [CHANNEL_URL]ytknow comes with a dedicated sub-tool for bulk comment extraction: yt-comments.
# Download comments for a video
yt-comments video "[VIDEO_URL]" --format json --output ./comments
# Download comments for a whole channel
yt-comments channel "[CHANNEL_URL]" --max-videos 20
# Download from list of URLs
yt-comments batch urls.txt --parallel 4Create ~/.config/yt-comments/config.yaml. Example provided in config_example.yaml.
- Python 3.8+
- yt-dlp: (Installed automatically via
install.sh) - ffmpeg: Required for metadata extraction and audio transcription.
The app creates a structured knowledge base for each source.
downloads/
โโโ ChannelName/
โโโ ChannelName_master.jsonl <-- Full context for all videos
โโโ ChannelName_chunks.jsonl <-- 1000-char semantic chunks (RAG ready)
โโโ ChannelName_master.txt <-- All transcripts in one file
โโโ ChannelName_master.md <-- All transcripts in one markdown file
โโโ videos/
โโโ Video_Title_1/
โโโ Video_Title_1.txt <-- Human readable with metadata headers
โโโ Video_Title_1.md <-- Markdown version
โโโ Video_Title_1.json <-- YouTube Comments (if enabled)
โโโ Video_Title_1_summary.md <-- AI Summary (if --summarize used)
Q: Does it work with private videos?
A: No, ytknow can only access public or unlisted content that you provide a URL for.
Q: Is it safe against YouTube bans?
A: We use yt-dlp's optimized extraction methods. For massive channels, we recommend being patient as YouTube may temporarily throttle requests.
Q: Can I use this for my RAG project? A: Yes! The JSONL output is designed specifically for tools like LangChain, LlamaIndex, or OpenAI Fine-tuning.
Contributions are welcome! Feel free to open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project respects content creators and YouTube's ToS.

