Author: Vilal Ali
AudioMitra is an innovative platform designed for document and content transformation, providing authentic and trustworthy conversions by leveraging the computational power of Artificial Intelligence and the cognitive abilities of human-in-the-loop validation. It aims to streamline language processing tasks, enhancing accessibility and efficiency.
Many individuals and organizations require efficient and accurate conversion of documents and content into various formats. Traditional methods are often time-consuming, prone to errors, and lack the flexibility for diverse output needs. Specifically, there's a need for:
- Reliable Text Extraction: Converting digital images or scanned documents into editable text files using Optical Character Recognition (OCR).
- Accessible Audio Content: Transforming text into audio formats through Text-to-Speech (TTS) conversion, enabling the creation of audiobooks or accessible versions of content.
- Quality Assurance: Ensuring the authenticity and reliability of these transformations, often requiring a blend of automated and human validation.
AudioMitra addresses these challenges by offering a flexible solution that adapts human and machine involvement based on the required accuracy and nature of the transformations.
The primary users include:
- Educational Institutions (Schools, Colleges, Universities): For digitizing textbooks, lecture notes, and other educational materials into various accessible formats.
- Libraries: For converting physical books into digital text and audio formats.
- Content Creators: For generating audio versions of their written content.
- Individuals with Accessibility Needs: Requiring text and audio formats of documents.
AudioMitra provides a comprehensive solution for text extraction, content validation, and text-to-speech/audio conversion. The system leverages open-source APIs for both OCR and TTS, ensuring flexibility and scalability while offering options to integrate other advanced APIs.
Key aspects of the solution include:
- Text Extraction: Utilizes an Open-Source OCR API (with options for others) to accurately convert text from images.
- Content Validation: OCR'd content undergoes a validation process, allowing for human intervention to edit and refine text, ensuring high accuracy.
- Output Formats: Generates both validated text content and high-quality audio files from the refined text.
- Web Accessibility & Multi-user System: Accessible via web browsers with user authentication, supporting various roles such as validators, authors, and voice-over contributors, promoting collaborative efficiency and accessibility.
- Backend: Flask (Python), Node.js, MySQL, MongoDB, Tesseract OCR, Google Cloud TTS (or other open-source TTS APIs)
- Frontend: React Framework, Bootstrap, Tailwind CSS
- Orchestration/Architecture: Microservices, API Gateway, Event-Driven Architecture, Security Best Practices.
- Design Patterns: Factory Pattern, Observer Pattern, Singleton Pattern.
AudioMitra is primarily categorized under the Education domain. Its core focus is on transforming educational documents and content from image to text and then to speech, directly enhancing learning experiences and the accessibility of educational materials. This also extends to Digital Libraries and Content Accessibility initiatives.
- Image to Text Transformation: Utilizes an Open-Source API for OCR, converting images into editable text.
- Text Editing and Review: Provides robust tools for text editing and human review, ensuring the accuracy and quality of the extracted text.
- Text to Audio Transformation: Converts validated text into audio format using an Open-Source/Cloud TTS API, facilitating the creation of audiobooks and accessible content.
- Browser-Based Access: Offers a user-friendly and accessible web interface for document conversion.
- Role-Based Access Control: Supports various user roles, including validators, authors, and voice-over artists, enabling collaborative workflows.
- Microservices Architecture: For modularity, scalability, and independent deployment of services.
- API Gateway: Centralized entry point for managing and securing API requests.
- Event-Driven Architecture: For asynchronous communication and responsiveness between services.
- Security: Implementing authentication, authorization, and data encryption.
- Factory Pattern: To create diverse objects seamlessly (e.g., different OCR or TTS service instances).
- Observer Pattern: To efficiently track state changes (e.g., document status updates during validation).
- Singleton Pattern: To ensure singular instances of critical resources (e.g., database connections).
This timeline assumes a dedicated team of 4 members, with each member potentially taking on multiple roles to ensure the project's success.
- Research and Planning: 1 week
- Design: 2 weeks
- Development: 4 weeks
- Testing: 1 week
- Refinement: 1 week
- Total Estimated Time: 9 weeks
AudioMitra stands as a powerful platform poised to revolutionize content accessibility and management. By ingeniously combining OCR and TTS technologies with human-in-the-loop validation, it ensures that content transformations are not only efficient but also highly authentic and reliable. Its support for multilingual content and a collaborative multi-user environment makes it an invaluable tool for enhancing educational resources, enriching digital libraries, and advancing global knowledge dissemination. With its robust MERN-like tech stack and thoughtful architectural design, AudioMitra is set to significantly impact how we interact with and consume information.
Follow these steps to set up and run the AudioMitra application locally.
Ensure you have the following installed on your development machine:
- Node.js (LTS version recommended, v18.x or higher) - Download from nodejs.org.
- npm (comes with Node.js)
- Python 3.9+ - Download from python.org.
virtualenv:sudo apt install python3-virtualenv(orpip install virtualenv)ffmpeg:sudo apt install ffmpeg(required for audio processing)tesseract-ocr:sudo apt install tesseract-ocr(required for OCR functionality)- MySQL Server: Install and configure locally (refer to MySQL Installation Guide).
- MongoDB Server (Optional, for future use/alternative data storage; currently MySQL is defined)
Start by cloning the AudioMitra repository to your local machine:
git clone https://github.com/vilalali/audioMitra.git
cd audioMitraThe frontend provides the user interface for document and content transformation.
- Navigate to the frontend directory:
cd audioMitra-frontend - Install Node.js Packages:
npm install --no-lockfile
- Update API Base URL:
- Edit the
cred.jsfile located ataudioMitra-frontend/src/creds.js. - Set
API_URLto point to your Node.js backend server. Important: If running on the same machine, usehttp://localhost:portor your machine's local IP address.// audioMitra-frontend/src/creds.js const API_URL = 'http://localhost:8002'; // Ensure this matches your Node.js backend port export { API_URL };
- Edit the
- Start the Frontend Server:
The frontend should now be running, typically accessible at
npm start
http://localhost:3000(check your console for the exact port if different).
The Python backend handles OCR and TTS processing.
- Navigate to the Python backend directory:
cd ../audioMitra-backend - Establish Python Virtual Environment:
python3 -m venv audioMitraVenv source audioMitraVenv/bin/activate - Install Python Packages:
pip install -r audioMitraRequirement.txt
audioMitraRequirement.txtcontent:blinker==1.7.0 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 click==8.1.7 cryptography==42.0.5 Flask==3.0.2 Flask-Cors==4.0.0 idna==3.6 itsdangerous==2.1.2 Jinja2==3.1.3 jwt==1.3.1 MarkupSafe==2.1.5 mysql-connector-python==8.2.0 packaging==24.0 pillow==10.2.0 protobuf==4.21.12 pycparser==2.22 pydub==0.25.1 pytesseract==0.3.10 pytz==2024.1 requests==2.31.0 urllib3==2.2.1 Werkzeug==3.0.1
- Start Python Backend Server:
python3 app.py
- Test the Python API:
- Open your web browser and navigate to:
http://localhost:5000/home(assuming Flask runs on port 5000). - Expected Output:
Hello, your backend is running successfully.
- Open your web browser and navigate to:
The Node.js backend likely serves as an API Gateway or handles specific services.
- Ensure you're in the
audioMitra-backenddirectory:# If you're still in the Python venv, deactivate it first: # deactivate cd ../audioMitra-backend # Or ensure you are in the directory containing package.json for Node.js
- Install Node.js Packages:
npm install --no-lockfile
- Start Node.js Backend Server:
Ensure this server is running on the port specified in your
npm run start # or if 'start' script uses nodemon: # npm start # or you might have a custom script like: # npm run run # or # npm run-script run
audioMitra-frontend/src/creds.js(e.g.,8002).
AudioMitra uses MySQL for user and content data storage.
- Install MySQL if you haven't already.
- Create a MySQL user and database:
(Adjust username, password, and host as needed for your setup.)
CREATE USER 'audioMitra'@'localhost' IDENTIFIED BY 'XXXX'; CREATE DATABASE audioMitraData; GRANT ALL PRIVILEGES ON audioMitraData.* TO 'audioMitra'@'localhost'; FLUSH PRIVILEGES;
- Create a
.envfile in the root directory of your Node.js backend (audioMitra-backend/.env) with your MySQL credentials:# audioMitra-backend/.env SQL_HOST="localhost" SQL_USER="audioMitra" SQL_PASSWORD="password" SQL_DATABASE="audioMitraData" - Database Schema:
- You'll need to create the tables in the
audioMitraDatadatabase. - User Table: Stores user authentication and profile information.
CREATE TABLE User ( timeStamp VARCHAR(255) NOT NULL, userID VARCHAR(255) PRIMARY KEY, -- Add other user fields like username, password_hash, email, role, etc. username VARCHAR(255) UNIQUE, password_hash VARCHAR(255), email VARCHAR(255) );
- Ocr Table: Stores details of extracted text content.
CREATE TABLE Ocr ( ocrTimeStamp VARCHAR(255) NOT NULL, ocrID VARCHAR(255) PRIMARY KEY, userID VARCHAR(255), -- Foreign key to User originalImagePath VARCHAR(255), extractedText TEXT, status VARCHAR(50), -- e.g., 'pending_validation', 'validated' FOREIGN KEY (userID) REFERENCES User(userID) );
- Translation Table (If translation is implemented):
CREATE TABLE Translation ( translationTimeStamp VARCHAR(255) NOT NULL, translationID VARCHAR(255) PRIMARY KEY, ocrID VARCHAR(255), -- Foreign key to Ocr targetLanguage VARCHAR(50), translatedText TEXT, status VARCHAR(50), -- e.g., 'pending_review', 'reviewed' FOREIGN KEY (ocrID) REFERENCES Ocr(ocrID) );
- Speech Table (If TTS is implemented):
CREATE TABLE Speech ( speechTimeStamp VARCHAR(255) NOT NULL, speechID VARCHAR(255) PRIMARY KEY, contentID VARCHAR(255), -- Could link to Ocr or Translation sourceText TEXT, audioFilePath VARCHAR(255), language VARCHAR(50), voiceType VARCHAR(50), FOREIGN KEY (contentID) REFERENCES Ocr(ocrID) -- Or Translation(translationID) );
- You'll need to create the tables in the