Automatically generate Word documents from your Kindle ebooks with all your highlights marked in yellow. Works with your Calibre library or individual ebook files.
- π¦ Batch mode by default - Processes ALL books from your Calibre library that match your highlights
- π Preserves formatting - Bold, italic, underline, and paragraph structure maintained
- π― Three matching methods - Choose between regex (fast), difflib (balanced), or vector (most accurate)
- π Calibre integration - Automatically scans your library and matches books to highlights
- π Fuzzy title matching - Handles different editions and formatting variations
- π Multi-language support - Recognizes highlights in English, German, French, Spanish, and Italian
- πΎ HTML preservation - Optionally keep intermediate files for debugging
- π Detailed logging - Verbose modes (-v, -vv) for troubleshooting
pip install python-docx beautifulsoup4 thefuzz python-Levenshtein lxml- Calibre (for ebook conversion) - Download from https://calibre-ebook.com/
pip install sentence-transformers torch nltkpip install mobi ebooklibConnect your Kindle to your computer and copy the My Clippings.txt file from:
Kindle/documents/My Clippings.txt
Default: Process ALL matched books from Calibre library
python kindle_highlighter.py --clippings "My Clippings.txt" -vProcess only the best matching book
python kindle_highlighter.py --clippings "My Clippings.txt" --no-batch -vProcess a specific ebook file
python kindle_highlighter.py --ebook "book.mobi" --clippings "My Clippings.txt" -vpython kindle_highlighter.py --list-bookspython kindle_highlighter.py --clippings "My Clippings.txt" --preserve-html -vvpython kindle_highlighter.py --ebook "book.mobi" --clippings "My Clippings.txt" --compare -vpython kindle_highlighter.py --clippings "My Clippings.txt" -m vector -vpython kindle_highlighter.py --library-path "/path/to/library" --clippings "My Clippings.txt" -v--clippings PATH- Path to your "My Clippings.txt" file
--list-books- List all books in Calibre library and exit--ebook PATH- Process a specific ebook file (overrides library mode)
--library-path PATH- Custom Calibre library path (default: ~/Calibre Library/)--no-batch- Process only the best match instead of all books with 95%+ confidence
-m, --method {regex,diff,vector}- Matching method (default: diff)regex- Fast, exact matches onlydiff- Balanced, handles minor variations (default)vector- Most accurate, uses AI semantic matching
--similarity-threshold FLOAT- Threshold for difflib method (default: 0.9)--vector-threshold FLOAT- Threshold for vector method (default: 0.65)--compare- Run all three methods and compare results
-o, --output PATH- Custom output path (single file mode only)--preserve-html- Keep intermediate HTML files for inspection
--calibre-path PATH- Custom path to Calibre's ebook-convert binary--try-native- Attempt native Python conversion if Calibre fails-v, --verbose- Increase verbosity (-v for info, -vv for debug)
- Scans your Calibre library for all ebooks
- Parses your My Clippings.txt file
- Matches ebook titles to clipping titles using fuzzy matching
- Converts ebooks to HTML format
- Extracts text while preserving paragraph structure and formatting
- Finds your highlights in the book text
- Generates Word documents with highlights marked in yellow
- Uses regular expressions for exact matching
- Handles minor spacing and hyphen variations
- Best for: Books with identical text to Kindle version
- Uses Python's SequenceMatcher for fuzzy matching
- Adjustable similarity threshold
- Best for: Most books, handles minor formatting differences
- Uses AI-powered semantic similarity
- Requires additional dependencies (sentence-transformers, torch)
- Best for: Books with significant reformatting or different editions
The script automatically matches books based on title similarity:
- 95-100% - Excellent match, processed by default in batch mode
- 90-94% - Good match, use
--no-batchto process manually - 85-89% - Uncertain match, verify manually with
--ebook - <85% - Low confidence, not processed automatically
- Try different matching method:
-m vector - Lower the similarity threshold:
--similarity-threshold 0.8 - Use
--preserve-htmlto inspect the extracted text - Verify the ebook is not DRM-protected
- Check title format in Calibre vs. My Clippings.txt
- Use
-vvto see matching scores - Process manually with
--ebookif automatic matching fails
- Install Calibre: https://calibre-ebook.com/
- Or specify path:
--calibre-path /path/to/ebook-convert - Or use native conversion:
--try-native(requires mobi/ebooklib)
- Use
--preserve-html -vvto inspect HTML structure - The script should automatically handle most HTML structures
- If issues persist, file a bug report with sample HTML
Generated Word documents include:
- β Book title as heading
- β Proper paragraph breaks
- β Bold, italic, and underlined text
- β Highlights marked in yellow
- β Clean, professional formatting
This tool:
- β Works with your own ebooks and highlights
- β Processes everything locally on your computer
- β Cannot process DRM-protected ebooks
- β Does not remove DRM
- Batch processing is default - Run once to process all your highlighted books
- Use
-vvfor debugging - See exactly what's happening at each step - Keep HTML files - Use
--preserve-htmlto inspect extraction issues - Try vector matching - If default matching misses highlights, use
-m vector - Organize your library - Good metadata in Calibre = better matching
Issues and pull requests welcome! Please include:
- Sample (anonymized) My Clippings.txt entry
- Ebook format and source
- Command used and full output with
-vv
This script is provided as-is for personal use. License is MIT. Use responsibly and respect copyright laws.
Uses these excellent libraries:
- python-docx - Word document generation
- BeautifulSoup4 - HTML parsing
- thefuzz - Fuzzy string matching
- Calibre - Ebook conversion
- sentence-transformers - AI semantic matching (optional)