Skip to content
/ OCR Public

This Python script automates the extraction of text from images using Tesseract OCR. It processes all images in the test_images/ folder and saves the extracted text as .txt files in the extracted_texts/ directory, maintaining the original image filenames.

Notifications You must be signed in to change notification settings

Mrigank005/OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ–ΌοΈ OCR Text Extractor

This Python script automates the extraction of text from images using Tesseract OCR. It processes all images in the test_images/ folder and saves the extracted text as .txt files in the extracted_texts/ directory, maintaining the original image filenames.


πŸ“ Project Structure


OCR-Text-Extractor/
β”œβ”€β”€ OCR.py
β”œβ”€β”€ test_images/
β”‚   └── image1.jpg
β”‚   └── image2.png
β”œβ”€β”€ extracted_texts/
β”‚   └── image1.txt
β”‚   └── image2.txt
└── README.md

βš™οΈ Features

  • Batch processes .jpg, .jpeg, and .png images.
  • Supports multiple languages (default: English and Hindi).
  • Automatically creates the extracted_texts/ folder if it doesn't exist.
  • Provides informative logging for each processed file.([GitHub][2])

πŸš€ Getting Started

1. Clone the Repository

git clone https://github.com/Mrigank005/OCR
cd OCR

2. Install Dependencies

Ensure you have Python 3 installed. Then, install the required Python libraries:

pip install pillow pytesseract

3. Install Tesseract OCR Engine

  • Windows: Download and install from Tesseract OCR Windows Installer.

  • macOS: Use Homebrew:([GitHub][1])

    brew install tesseract
  • Linux (Debian/Ubuntu):

    sudo apt-get install tesseract-ocr

Ensure Tesseract is added to your system's PATH.

4. Add Images

Place the images you want to process into the test_images/ directory.

5. Run the Script

python OCR.py

The extracted text files will be saved in the extracted_texts/ directory.


πŸ“ Customization

  • Language Support: The script defaults to English and Hindi. To modify the languages, edit the langs parameter in the extract_text_and_save function within OCR.py:

    def extract_text_and_save(image_path, langs=["eng", "hin"]):

Refer to Tesseract OCR Language Data for available language codes.([GitHub][1])

  • Tesseract Path: If Tesseract isn't in your system's PATH, specify its location in OCR.py:

    import pytesseract
    pytesseract.pytesseract.tesseract_cmd = r'/path/to/tesseract'

πŸ§ͺ Sample Output

For an image named page1.jpg in test_images/, the script will generate page1.txt in extracted_texts/ containing the recognized text.


πŸ™Œ Acknowledgements


About

This Python script automates the extraction of text from images using Tesseract OCR. It processes all images in the test_images/ folder and saves the extracted text as .txt files in the extracted_texts/ directory, maintaining the original image filenames.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages