🏷️ Project Title

OCR & PDF Text Extraction Microservice

🧾 Executive Summary

The Personal AI Factory v1 – OCR & PDF Text Extraction Microservice is a production-grade, serverless backend component designed to extract clean, machine-readable text from textual PDF documents. Built using TypeScript and deployed as a Vercel Serverless Function, this microservice exposes a single HTTP API endpoint that fetches a remote PDF file, parses its textual content using pdf-parse, and returns structured JSON output.

This service is optimized for automation-first architectures, specifically downstream integration with n8n pipelines. Version 1 explicitly supports text-based PDFs only and does not perform OCR on scanned documents or images.

📑 Table of Contents

🧩 Project Overview
🎯 Objectives & Goals
✅ Acceptance Criteria
💻 Prerequisites
⚙️ Installation & Setup
🔗 API Documentation
🖥️ UI / Frontend
🔢 Status Codes
🚀 Features
🧱 Tech Stack & Architecture
🛠️ Workflow & Implementation
🧪 Testing & Validation
🔍 Validation Summary
🧰 Verification Testing Tools
🧯 Troubleshooting & Debugging
🔒 Security & Secrets
☁️ Deployment (Vercel)
⚡ Quick-Start Cheat Sheet
🧾 Usage Notes
🧠 Performance & Optimization
🌟 Enhancements & Features
🧩 Maintenance & Future Work
🏆 Key Achievements
🧮 High-Level Architecture
🗂️ Folder Structure
🧭 How to Demonstrate Live
💡 Summary, Closure & Compliance

🧩 Project Overview

This microservice functions as a stateless PDF text extraction API within the Personal AI Factory ecosystem.

Accepts a publicly accessible PDF URL
Downloads the PDF at runtime
Parses textual content using pdf-parse
Returns extracted text as structured JSON
Designed for synchronous HTTP execution

🎯 Objectives & Goals

Provide a reliable text extraction layer for automation workflows
Eliminate dependency on paid OCR services for textual PDFs
Maintain fast cold-start and execution times
Enable seamless integration with n8n HTTP Request nodes
Serve as a foundational V1 component for future OCR and AI expansion

✅ Acceptance Criteria

HTTP 200 returned for valid textual PDFs
Structured error responses for invalid input
No OCR processing in Version 1
Deployable on Vercel without custom infrastructure
JSON output compatible with automation tools

💻 Prerequisites

Node.js 18 or higher
Vercel CLI (for deployment)
Publicly accessible PDF URLs
Basic REST API knowledge

⚙️ Installation & Setup

Clone the repository
Install dependencies
Verify Node.js version compatibility
Review TypeScript configuration
Prepare Vercel deployment

🔗 API Documentation

Endpoint: /api/ocr-summarize

Method: GET / POST

Input: Public PDF URL (fileURL)

Output: JSON with extracted text

🖥️ UI / Frontend

This project does not include a frontend or UI layer. It is designed for backend-to-backend and automation-based consumption via n8n, Postman, or Curl.

🔢 Status Codes

Status	Description
200	Successful extraction
400	Invalid or missing fileURL
500	Internal server error

🚀 Features

Textual PDF parsing
Serverless execution
Automation-friendly JSON output
No paid OCR dependencies

🧱 Tech Stack & Architecture

Runtime: Vercel Serverless Functions (Node.js 18)
Language: TypeScript
PDF Parsing: pdf-parse
HTTP Client: node-fetch
Deployment: Vercel

Client / n8n
     |
     v
Vercel Serverless Function
     |
     v
pdf-parse
     |
     v
JSON Response

🛠️ Workflow & Implementation

Receive HTTP request
Validate input parameters
Fetch PDF from URL
Parse text using pdf-parse
Return structured JSON response

🧪 Testing & Validation

ID	Area	Command	Expected Output	Explanation
T-01	API	GET with valid PDF	200 + text	Valid textual PDF
T-02	API	Missing fileURL	400 error	Validation check

🔍 Validation Summary

Input validation enforced
Error handling implemented
Automation compatibility verified

🧰 Verification Testing Tools

Curl
Postman
n8n HTTP Request node

🧯 Troubleshooting & Debugging

Ensure PDF is publicly accessible
Confirm PDF is text-based
Check Vercel function logs

🔒 Security & Secrets

No secrets or API keys required
Stateless execution
Public-file access only

☁️ Deployment (Vercel)

Node.js 18 runtime
2048 MB memory
60-second execution limit

⚡ Quick-Start Cheat Sheet

Deploy to Vercel
Copy endpoint URL
Provide public PDF URL
Receive extracted text

🧾 Usage Notes

Textual PDFs only
No OCR support in V1
Designed for preprocessing pipelines

🧠 Performance & Optimization

Lightweight dependencies
Fast cold-start execution
Performance dependent on PDF size

🌟 Enhancements & Features

Current version supports textual PDF extraction only.

🧩 Maintenance & Future Work

OCR for scanned PDFs
AI summarization layer
Chunking and vector storage

🏆 Key Achievements

Production-ready serverless microservice
Zero-cost alternative for textual PDF extraction
Automation-first design

🧮 High-Level Architecture

The service acts as an independent extraction node within the Personal AI Factory, feeding structured text into downstream automation and AI systems.

🗂️ Folder Structure

ocr-summarizer-microservice/
├── api/
│   └── ocr-summarize.ts
├── types/
│   └── pdf-parse.d.ts
├── node_modules/
├── package.json
├── tsconfig.json
├── README.md
└── .gitignore

🧭 How to Demonstrate Live

Deploy to Vercel
Send GET request with PDF URL
Observe extracted text response

💡 Summary, Closure & Compliance

This repository delivers a compliant, production-ready, serverless PDF text extraction microservice aligned with Personal AI Factory v1 standards.

License: MIT
Author: Ansh Srivastava
Status: Stable – Production Ready (V1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🏷️ Project Title

🧾 Executive Summary

📑 Table of Contents

🧩 Project Overview

🎯 Objectives & Goals

✅ Acceptance Criteria

💻 Prerequisites

⚙️ Installation & Setup

🔗 API Documentation

🖥️ UI / Frontend

🔢 Status Codes

🚀 Features

🧱 Tech Stack & Architecture

🛠️ Workflow & Implementation

🧪 Testing & Validation

🔍 Validation Summary

🧰 Verification Testing Tools

🧯 Troubleshooting & Debugging

🔒 Security & Secrets

☁️ Deployment (Vercel)

⚡ Quick-Start Cheat Sheet

🧾 Usage Notes

🧠 Performance & Optimization

🌟 Enhancements & Features

🧩 Maintenance & Future Work

🏆 Key Achievements

🧮 High-Level Architecture

🗂️ Folder Structure

🧭 How to Demonstrate Live

💡 Summary, Closure & Compliance

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api		api
types		types
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

bitsandbrains/ocr-pdf-text-extraction-service

Folders and files

Latest commit

History

Repository files navigation

🏷️ Project Title

🧾 Executive Summary

📑 Table of Contents

🧩 Project Overview

🎯 Objectives & Goals

✅ Acceptance Criteria

💻 Prerequisites

⚙️ Installation & Setup

🔗 API Documentation

🖥️ UI / Frontend

🔢 Status Codes

🚀 Features

🧱 Tech Stack & Architecture

🛠️ Workflow & Implementation

🧪 Testing & Validation

🔍 Validation Summary

🧰 Verification Testing Tools

🧯 Troubleshooting & Debugging

🔒 Security & Secrets

☁️ Deployment (Vercel)

⚡ Quick-Start Cheat Sheet

🧾 Usage Notes

🧠 Performance & Optimization

🌟 Enhancements & Features

🧩 Maintenance & Future Work

🏆 Key Achievements

🧮 High-Level Architecture

🗂️ Folder Structure

🧭 How to Demonstrate Live

💡 Summary, Closure & Compliance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages