Skip to content

egohead/ytknow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“ฅ ytknow

ytknow logo

ytknow terminal screenshot

Extract YouTube channel knowledge into clean text files for learning & research.

License: MIT Python 3.8+ yt-dlp


๐Ÿ“– Table of Contents


โš–๏ธ Legal Notice & Disclaimer

This tool is for:

  • โœ… Personal, non-commercial use
  • โœ… Fair use, education, research
  • โœ… Offline learning & knowledge extraction
  • โœ… Archiving your own content

YouTube Terms of Service allow:

  • Downloading your own videos
  • Offline viewing for personal use
  • Subtitle extraction for accessibility

DO NOT:

  • โŒ Re-upload content
  • โŒ Commercial services
  • โŒ Mass downloading without rate limiting

yt-dlp is used as the core engine (the industry standard for open-source media extraction).


๐Ÿง  How it Works: Smart Deduplication

Most YouTube subtitle downloaders just give you the raw VTT, which is full of repetition because YouTube "builds" sentences word-by-word in auto-generated captions.

ytknow uses a Prefix-Matching Algorithm:

  1. It strips all millisecond-level timestamps and inline tags.
  2. It compares each new line with the previous one.
  3. If the new line starts with the previous text, it "evolves" the line instead of repeating it.
  4. If it's a duplicate or a subset, it's discarded.

Result: You get a clean, human-readable paragraph instead of a 10,000-line stuttering mess.


๐Ÿงน Before & After

ytknow cleans up the messy duplication and timing tags common in YouTube auto-captions:

โŒ Before (Standard VTT)

00:00:00.480 --> 00:00:03.070
das<00:00:00.640><c> heutige</c><00:00:01.079><c> Video</c>
00:00:03.070 --> 00:00:03.080
das heutige Video bedarf eines Vorworts

โœ… After (ytknow Output)

"Das heutige Video bedarf eines Vorworts..."


โœจ Features

  • โœจ Interactive Menu: New TUI to easily select subtitle languages and download modes.
  • ๐Ÿš€ Lightning Fast: Uses yt-dlp with --lazy-playlist to start processing immediately.
  • ๐Ÿงน Deep Cleaning: Removes all VTT timing codes, word-level tags, and alignment metadata.
  • ๐Ÿง  Smart Deduplication: Automatically resolves sentence-building repetition in YouTube's auto-captions.
  • ๐Ÿค– LLM-Optimized: Generates clean TXT and MD files with rich metadata headers and consolidated JSONL files.
  • ๐ŸŽ™๏ธ Whisper Fallback: Automatically transcribes videos using OpenAI Whisper if no subtitles are found.
  • ๐Ÿง  AI Summarization: Generate high-quality summaries and key takeaways using the OpenAI API.
  • ๐Ÿ’ฌ Comments Integration: Download video comments along with transcripts or as a standalone task.
  • ๐Ÿ”„ Smart Fallback: Automatically prefers en-orig if standard en is unavailable but requested.
  • ๐Ÿ›ก๏ธ Resilient: Gracefully handles unavailable or private videos in large playlists.

๐Ÿค– RAG & LLM Readiness

ytknow is specifically designed for Retrieval-Augmented Generation (RAG).

Metadata Enrichment

Output files include header metadata (Source URL, Upload Date), allowing LLMs to cite sources and prioritize recent information.

Master JSONL Export

Every session generates a knowledge_master.jsonl file. This format is the industry standard for:

  • Model Fine-tuning: directly usable as a training dataset.
  • Archiving: keeps full context per video.

๐Ÿงฉ Built-in Semantic Chunking

ytknow automatically generates a second file called knowledge_chunks.jsonl.

  • Ready-to-Embed: Splits text into ~1000 char chunks, respecting sentence boundaries.
  • Overlapping: Includes 100 char overlap to preserve context between chunks.
  • Metadata Preserved: Each chunk carries the video URL, title, and upload date.

Just upload knowledge_chunks.jsonl to your Vector DB (Pinecone, Chroma, Weaviate) and you're done!


๐Ÿ› ๏ธ Built With

  • Python
  • yt-dlp
  • FFmpeg

๐Ÿ› ๏ธ Installation

# Clone the repository
git clone https://github.com/egohead/ytknow.git
cd ytknow

# Run the installer (macOS/Linux)
chmod +x install.sh
./install.sh

๐Ÿš€ Usage

ytknow now features an interactive mode. Simply run it with a URL:

# Start interactive processing
ytknow [YOUTUBE_URL]

๐ŸŽฎ Interactive Options

When you run ytknow, it will guide you through:

  1. Language Selection: Choose from all available subtitles (Original, Manual, or Auto-Translated).
  2. Download Mode:
    • Knowledge Base: Subtitles + Metadata + AI Summary.
    • Comments Only: Just the user comments.
    • All: Everything combined.

๐Ÿ› ๏ธ CLI Overrides

# Skip the menu by providing a language code
ytknow [VIDEO_URL] -l en

# Summarize a video (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
ytknow [VIDEO_URL] --summarize

# Transcribe with a specific Whisper model
ytknow [VIDEO_URL] --model small

# Survey a channel for available languages
ytknow --survey [CHANNEL_URL]

๐Ÿ’ฌ YouTube Comments Downloader

ytknow comes with a dedicated sub-tool for bulk comment extraction: yt-comments.

Usage

# Download comments for a video
yt-comments video "[VIDEO_URL]" --format json --output ./comments

# Download comments for a whole channel
yt-comments channel "[CHANNEL_URL]" --max-videos 20

# Download from list of URLs
yt-comments batch urls.txt --parallel 4

Configuration

Create ~/.config/yt-comments/config.yaml. Example provided in config_example.yaml.

๐Ÿ“‹ Requirements

  • Python 3.8+
  • yt-dlp: (Installed automatically via install.sh)
  • ffmpeg: Required for metadata extraction and audio transcription.

๐Ÿ“ Output Format

The app creates a structured knowledge base for each source.

downloads/
โ””โ”€โ”€ ChannelName/
    โ”œโ”€โ”€ ChannelName_master.jsonl  <-- Full context for all videos
    โ”œโ”€โ”€ ChannelName_chunks.jsonl  <-- 1000-char semantic chunks (RAG ready)
    โ”œโ”€โ”€ ChannelName_master.txt    <-- All transcripts in one file
    โ”œโ”€โ”€ ChannelName_master.md     <-- All transcripts in one markdown file
    โ””โ”€โ”€ videos/
        โ””โ”€โ”€ Video_Title_1/
            โ”œโ”€โ”€ Video_Title_1.txt         <-- Human readable with metadata headers
            โ”œโ”€โ”€ Video_Title_1.md          <-- Markdown version
            โ”œโ”€โ”€ Video_Title_1.json        <-- YouTube Comments (if enabled)
            โ””โ”€โ”€ Video_Title_1_summary.md  <-- AI Summary (if --summarize used)

โ“ FAQ

Q: Does it work with private videos? A: No, ytknow can only access public or unlisted content that you provide a URL for.

Q: Is it safe against YouTube bans? A: We use yt-dlp's optimized extraction methods. For massive channels, we recommend being patient as YouTube may temporarily throttle requests.

Q: Can I use this for my RAG project? A: Yes! The JSONL output is designed specifically for tools like LangChain, LlamaIndex, or OpenAI Fine-tuning.

๐Ÿค Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request.

โš–๏ธ MIT License

This project is licensed under the MIT License - see the LICENSE file for details.


This project respects content creators and YouTube's ToS.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors