A Python script to extract all images from Microsoft Word documents (.docx) and optionally convert EMF files to PNG format.
- ✅ Extracts all images from .docx files
- ✅ Automatically identifies file types (PNG, JPG, EMF, etc.)
- ✅ Renames files with correct extensions
- ✅ Automatically renames images based on content (using OCR)
- ✅ Attempts to convert EMF files to PNG (when possible)
- ✅ No external dependencies required for basic extraction
- ✅ Command-line interface with options
This project uses uv for dependency management.
# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | shYou can run the script directly with uv run. It will automatically handle dependencies.
uv run extract_word_images.py o1.docxIf you want to work on the project or install dependencies in a virtual environment:
uv syncIf you prefer not to use uv, you can install dependencies manually:
pip install aspose-words pillowExtract all images from a Word document, convert EMFs, and rename based on content:
uv run extract_word_images.py document.docxThis will create an extracted_images folder containing all images.
uv run extract_word_images.py document.docx -o my_imagesIf you want to skip the OCR process (faster):
uv run extract_word_images.py document.docx --no-ocrIf you just want to extract the files without attempting conversion:
uv run extract_word_images.py document.docx --no-convertuv run extract_word_images.py --help-
Extraction: Word documents (.docx) are actually ZIP archives. The script:
- Unzips the document
- Finds all files in the
word/media/folder - Extracts them to the output directory
-
File Type Detection: The script reads the file signature (magic bytes) to identify the actual file type, regardless of the extension.
-
EMF Conversion: The script attempts to convert EMF files to PNG using:
- Aspose.Words (primary, pure Python)
- LibreOffice (if installed on macOS)
- ImageMagick (fallback)
- Pillow (fallback)
-
OCR Renaming: The script uses EasyOCR to:
- Read text from each image
- Generate a descriptive filename (e.g.,
Biceps_brachii_Site_4.png) - Rename the file automatically
$ uv run extract_word_images.py o1.docx
Processing: o1.docx
Output directory: extracted_images
------------------------------------------------------------
Found 9 image(s) in the document.
Extracted: image8.tmp -> image8.emf
Extracted: image1.png
...
Extracted 9 file(s)
Converting EMF files to PNG...
✓ Converted: image2.emf -> image2.png (using Aspose.Words)
...
Analyzing images and renaming based on content...
(This requires OCR and may take some time)
✓ Renamed: image5.png -> Biceps_brachii_Site_4.png
(Text: 'Biceps brachii Site 4...')
✓ Renamed: image6.png -> Motor_NCS_L_Median.png
(Text: 'Motor NCS L Median...')
Renamed 9 file(s) based on content
============================================================
✓ Processing complete! Images saved to: extracted_images
============================================================OCR is computationally intensive. It may take a few seconds per image. If you have many images and don't need descriptive names, use --no-ocr.
EMF is a Windows-specific format. The script tries multiple methods. If all fail:
- Ensure
aspose-wordsis installed (uv sync) - Install LibreOffice:
brew install --cask libreoffice - Use an online converter: https://convertio.co/emf-png/
The script can extract and identify:
- PNG, JPEG/JPG, GIF, BMP, TIFF
- EMF (Enhanced Metafile)
- ICO
- And other formats embedded in Word documents
Free to use and modify.
Created to simplify image extraction from Word documents.