Kindle Ebook Highlighter

Automatically generate Word documents from your Kindle ebooks with all your highlights marked in yellow. Works with your Calibre library or individual ebook files.

✨ Features

📦 Batch mode by default - Processes ALL books from your Calibre library that match your highlights
📝 Preserves formatting - Bold, italic, underline, and paragraph structure maintained
🎯 Three matching methods - Choose between regex (fast), difflib (balanced), or vector (most accurate)
📚 Calibre integration - Automatically scans your library and matches books to highlights
🔍 Fuzzy title matching - Handles different editions and formatting variations
🌍 Multi-language support - Recognizes highlights in English, German, French, Spanish, and Italian
💾 HTML preservation - Optionally keep intermediate files for debugging
📊 Detailed logging - Verbose modes (-v, -vv) for troubleshooting

📋 Requirements

Required Dependencies

pip install python-docx beautifulsoup4 thefuzz python-Levenshtein lxml

Required Software

Calibre (for ebook conversion) - Download from https://calibre-ebook.com/

Optional Dependencies (for vector matching)

pip install sentence-transformers torch nltk

Optional Dependencies (for native conversion)

pip install mobi ebooklib

🚀 Quick Start

1. Export Your Kindle Highlights

Connect your Kindle to your computer and copy the My Clippings.txt file from:

Kindle/documents/My Clippings.txt

2. Run the Script

Default: Process ALL matched books from Calibre library

python kindle_highlighter.py --clippings "My Clippings.txt" -v

Process only the best matching book

python kindle_highlighter.py --clippings "My Clippings.txt" --no-batch -v

Process a specific ebook file

python kindle_highlighter.py --ebook "book.mobi" --clippings "My Clippings.txt" -v

📖 Usage Examples

List all books in your Calibre library

python kindle_highlighter.py --list-books

Batch process with debugging output

python kindle_highlighter.py --clippings "My Clippings.txt" --preserve-html -vv

Compare all three matching methods

python kindle_highlighter.py --ebook "book.mobi" --clippings "My Clippings.txt" --compare -v

Use vector matching for best accuracy

python kindle_highlighter.py --clippings "My Clippings.txt" -m vector -v

Specify custom Calibre library path

python kindle_highlighter.py --library-path "/path/to/library" --clippings "My Clippings.txt" -v

🎛️ Command Line Options

Required

--clippings PATH - Path to your "My Clippings.txt" file

Mode Selection

--list-books - List all books in Calibre library and exit
--ebook PATH - Process a specific ebook file (overrides library mode)

Library Options

--library-path PATH - Custom Calibre library path (default: ~/Calibre Library/)
--no-batch - Process only the best match instead of all books with 95%+ confidence

Matching Options

-m, --method {regex,diff,vector} - Matching method (default: diff)
- regex - Fast, exact matches only
- diff - Balanced, handles minor variations (default)
- vector - Most accurate, uses AI semantic matching
--similarity-threshold FLOAT - Threshold for difflib method (default: 0.9)
--vector-threshold FLOAT - Threshold for vector method (default: 0.65)
--compare - Run all three methods and compare results

Output Options

-o, --output PATH - Custom output path (single file mode only)
--preserve-html - Keep intermediate HTML files for inspection

Advanced Options

--calibre-path PATH - Custom path to Calibre's ebook-convert binary
--try-native - Attempt native Python conversion if Calibre fails
-v, --verbose - Increase verbosity (-v for info, -vv for debug)

🔧 How It Works

Scans your Calibre library for all ebooks
Parses your My Clippings.txt file
Matches ebook titles to clipping titles using fuzzy matching
Converts ebooks to HTML format
Extracts text while preserving paragraph structure and formatting
Finds your highlights in the book text
Generates Word documents with highlights marked in yellow

📊 Matching Methods Explained

Regex (Fast)

Uses regular expressions for exact matching
Handles minor spacing and hyphen variations
Best for: Books with identical text to Kindle version

Difflib (Balanced) - DEFAULT

Uses Python's SequenceMatcher for fuzzy matching
Adjustable similarity threshold
Best for: Most books, handles minor formatting differences

Vector (Most Accurate)

Uses AI-powered semantic similarity
Requires additional dependencies (sentence-transformers, torch)
Best for: Books with significant reformatting or different editions

🎯 Matching Confidence

The script automatically matches books based on title similarity:

95-100% - Excellent match, processed by default in batch mode
90-94% - Good match, use --no-batch to process manually
85-89% - Uncertain match, verify manually with --ebook
<85% - Low confidence, not processed automatically

🐛 Troubleshooting

No highlights found in output

Try different matching method: -m vector
Lower the similarity threshold: --similarity-threshold 0.8
Use --preserve-html to inspect the extracted text
Verify the ebook is not DRM-protected

Book not matched to highlights

Check title format in Calibre vs. My Clippings.txt
Use -vv to see matching scores
Process manually with --ebook if automatic matching fails

Calibre not found

Install Calibre: https://calibre-ebook.com/
Or specify path: --calibre-path /path/to/ebook-convert
Or use native conversion: --try-native (requires mobi/ebooklib)

Paragraphs not preserved

Use --preserve-html -vv to inspect HTML structure
The script should automatically handle most HTML structures
If issues persist, file a bug report with sample HTML

📝 Output Format

Generated Word documents include:

✅ Book title as heading
✅ Proper paragraph breaks
✅ Bold, italic, and underlined text
✅ Highlights marked in yellow
✅ Clean, professional formatting

🔒 Privacy & DRM

This tool:

✅ Works with your own ebooks and highlights
✅ Processes everything locally on your computer
❌ Cannot process DRM-protected ebooks
❌ Does not remove DRM

💡 Tips

Batch processing is default - Run once to process all your highlighted books
Use -vv for debugging - See exactly what's happening at each step
Keep HTML files - Use --preserve-html to inspect extraction issues
Try vector matching - If default matching misses highlights, use -m vector
Organize your library - Good metadata in Calibre = better matching

🤝 Contributing

Issues and pull requests welcome! Please include:

Sample (anonymized) My Clippings.txt entry
Ebook format and source
Command used and full output with -vv

📄 License

This script is provided as-is for personal use. License is MIT. Use responsibly and respect copyright laws.

🙏 Credits

Uses these excellent libraries:

python-docx - Word document generation
BeautifulSoup4 - HTML parsing
thefuzz - Fuzzy string matching
Calibre - Ebook conversion
sentence-transformers - AI semantic matching (optional)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
highlight.py		highlight.py

CrispStrobe/highlighter

Folders and files

Latest commit

History

Repository files navigation