Word Document Image Extractor

A Python script to extract all images from Microsoft Word documents (.docx) and optionally convert EMF files to PNG format.

Features

✅ Extracts all images from .docx files
✅ Automatically identifies file types (PNG, JPG, EMF, etc.)
✅ Renames files with correct extensions
✅ Automatically renames images based on content (using OCR)
✅ Attempts to convert EMF files to PNG (when possible)
✅ No external dependencies required for basic extraction
✅ Command-line interface with options

Installation & Usage with uv (Recommended)

This project uses uv for dependency management.

1. Install uv

# On macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

2. Run the script

You can run the script directly with uv run. It will automatically handle dependencies.

uv run extract_word_images.py o1.docx

3. Project Setup (Optional)

If you want to work on the project or install dependencies in a virtual environment:

uv sync

Manual Installation (Legacy)

If you prefer not to use uv, you can install dependencies manually:

pip install aspose-words pillow

Usage

Basic Usage (with uv)

Extract all images from a Word document, convert EMFs, and rename based on content:

uv run extract_word_images.py document.docx

This will create an extracted_images folder containing all images.

Specify Output Directory

uv run extract_word_images.py document.docx -o my_images

Disable OCR Renaming

If you want to skip the OCR process (faster):

uv run extract_word_images.py document.docx --no-ocr

Extract Without Converting EMF Files

If you just want to extract the files without attempting conversion:

uv run extract_word_images.py document.docx --no-convert

Help

uv run extract_word_images.py --help

How It Works

Extraction: Word documents (.docx) are actually ZIP archives. The script:
- Unzips the document
- Finds all files in the word/media/ folder
- Extracts them to the output directory
File Type Detection: The script reads the file signature (magic bytes) to identify the actual file type, regardless of the extension.
EMF Conversion: The script attempts to convert EMF files to PNG using:
- Aspose.Words (primary, pure Python)
- LibreOffice (if installed on macOS)
- ImageMagick (fallback)
- Pillow (fallback)
OCR Renaming: The script uses EasyOCR to:
- Read text from each image
- Generate a descriptive filename (e.g., Biceps_brachii_Site_4.png)
- Rename the file automatically

Example

$ uv run extract_word_images.py o1.docx
Processing: o1.docx
Output directory: extracted_images
------------------------------------------------------------
Found 9 image(s) in the document.
  Extracted: image8.tmp -> image8.emf
  Extracted: image1.png
  ...

Extracted 9 file(s)

Converting EMF files to PNG...
  ✓ Converted: image2.emf -> image2.png (using Aspose.Words)
  ...

Analyzing images and renaming based on content...
(This requires OCR and may take some time)
  ✓ Renamed: image5.png -> Biceps_brachii_Site_4.png
    (Text: 'Biceps brachii Site 4...')
  ✓ Renamed: image6.png -> Motor_NCS_L_Median.png
    (Text: 'Motor NCS L Median...')

Renamed 9 file(s) based on content

============================================================
✓ Processing complete! Images saved to: extracted_images
============================================================

Troubleshooting

OCR Slowness

OCR is computationally intensive. It may take a few seconds per image. If you have many images and don't need descriptive names, use --no-ocr.

EMF Files Won't Convert

EMF is a Windows-specific format. The script tries multiple methods. If all fail:

Ensure aspose-words is installed (uv sync)
Install LibreOffice: brew install --cask libreoffice
Use an online converter: https://convertio.co/emf-png/

Supported Image Formats

The script can extract and identify:

PNG, JPEG/JPG, GIF, BMP, TIFF
EMF (Enhanced Metafile)
ICO
And other formats embedded in Word documents

License

Free to use and modify.

Author

Created to simplify image extraction from Word documents.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
image_extractor		image_extractor
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
example_usage.py		example_usage.py
extract_word_images.py		extract_word_images.py
o1.docx		o1.docx
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Document Image Extractor

Features

Installation & Usage with uv (Recommended)

1. Install uv

2. Run the script

3. Project Setup (Optional)

Manual Installation (Legacy)

Usage

Basic Usage (with uv)

Specify Output Directory

Disable OCR Renaming

Extract Without Converting EMF Files

Help

How It Works

Example

Troubleshooting

OCR Slowness

EMF Files Won't Convert

Supported Image Formats

License

Author

About

Uh oh!

Releases

Packages

Languages

vikbht/image

Folders and files

Latest commit

History

Repository files navigation

Word Document Image Extractor

Features

Installation & Usage with uv (Recommended)

1. Install uv

2. Run the script

3. Project Setup (Optional)

Manual Installation (Legacy)

Usage

Basic Usage (with uv)

Specify Output Directory

Disable OCR Renaming

Extract Without Converting EMF Files

Help

How It Works

Example

Troubleshooting

OCR Slowness

EMF Files Won't Convert

Supported Image Formats

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages