Skip to content

Serverless OCR & PDF Text Extraction microservice for Personal AI Factory v1. Built with TypeScript and Vercel Serverless Functions, using pdf-parse, and node-fetch for high-performance parsing of machine-readable PDFs. Supports extracting clean text from textual PDFs and exposes a clean HTTP API returning structured JSON output for downstream n8n.

Notifications You must be signed in to change notification settings

bitsandbrains/ocr-pdf-text-extraction-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿท๏ธ Project Title

OCR & PDF Text Extraction Microservice

๐Ÿงพ Executive Summary

The Personal AI Factory v1 โ€“ OCR & PDF Text Extraction Microservice is a production-grade, serverless backend component designed to extract clean, machine-readable text from textual PDF documents. Built using TypeScript and deployed as a Vercel Serverless Function, this microservice exposes a single HTTP API endpoint that fetches a remote PDF file, parses its textual content using pdf-parse, and returns structured JSON output.

This service is optimized for automation-first architectures, specifically downstream integration with n8n pipelines. Version 1 explicitly supports text-based PDFs only and does not perform OCR on scanned documents or images.

๐Ÿ“‘ Table of Contents

  1. ๐Ÿงฉ Project Overview
  2. ๐ŸŽฏ Objectives & Goals
  3. โœ… Acceptance Criteria
  4. ๐Ÿ’ป Prerequisites
  5. โš™๏ธ Installation & Setup
  6. ๐Ÿ”— API Documentation
  7. ๐Ÿ–ฅ๏ธ UI / Frontend
  8. ๐Ÿ”ข Status Codes
  9. ๐Ÿš€ Features
  10. ๐Ÿงฑ Tech Stack & Architecture
  11. ๐Ÿ› ๏ธ Workflow & Implementation
  12. ๐Ÿงช Testing & Validation
  13. ๐Ÿ” Validation Summary
  14. ๐Ÿงฐ Verification Testing Tools
  15. ๐Ÿงฏ Troubleshooting & Debugging
  16. ๐Ÿ”’ Security & Secrets
  17. โ˜๏ธ Deployment (Vercel)
  18. โšก Quick-Start Cheat Sheet
  19. ๐Ÿงพ Usage Notes
  20. ๐Ÿง  Performance & Optimization
  21. ๐ŸŒŸ Enhancements & Features
  22. ๐Ÿงฉ Maintenance & Future Work
  23. ๐Ÿ† Key Achievements
  24. ๐Ÿงฎ High-Level Architecture
  25. ๐Ÿ—‚๏ธ Folder Structure
  26. ๐Ÿงญ How to Demonstrate Live
  27. ๐Ÿ’ก Summary, Closure & Compliance

๐Ÿงฉ Project Overview

This microservice functions as a stateless PDF text extraction API within the Personal AI Factory ecosystem.

  • Accepts a publicly accessible PDF URL
  • Downloads the PDF at runtime
  • Parses textual content using pdf-parse
  • Returns extracted text as structured JSON
  • Designed for synchronous HTTP execution

๐ŸŽฏ Objectives & Goals

  • Provide a reliable text extraction layer for automation workflows
  • Eliminate dependency on paid OCR services for textual PDFs
  • Maintain fast cold-start and execution times
  • Enable seamless integration with n8n HTTP Request nodes
  • Serve as a foundational V1 component for future OCR and AI expansion

โœ… Acceptance Criteria

  • HTTP 200 returned for valid textual PDFs
  • Structured error responses for invalid input
  • No OCR processing in Version 1
  • Deployable on Vercel without custom infrastructure
  • JSON output compatible with automation tools

๐Ÿ’ป Prerequisites

  • Node.js 18 or higher
  • Vercel CLI (for deployment)
  • Publicly accessible PDF URLs
  • Basic REST API knowledge

โš™๏ธ Installation & Setup

  1. Clone the repository
  2. Install dependencies
  3. Verify Node.js version compatibility
  4. Review TypeScript configuration
  5. Prepare Vercel deployment

๐Ÿ”— API Documentation

Endpoint: /api/ocr-summarize

Method: GET / POST

Input: Public PDF URL (fileURL)

Output: JSON with extracted text

๐Ÿ–ฅ๏ธ UI / Frontend

This project does not include a frontend or UI layer. It is designed for backend-to-backend and automation-based consumption via n8n, Postman, or Curl.

๐Ÿ”ข Status Codes

StatusDescription
200Successful extraction
400Invalid or missing fileURL
500Internal server error

๐Ÿš€ Features

  • Textual PDF parsing
  • Serverless execution
  • Automation-friendly JSON output
  • No paid OCR dependencies

๐Ÿงฑ Tech Stack & Architecture

  • Runtime: Vercel Serverless Functions (Node.js 18)
  • Language: TypeScript
  • PDF Parsing: pdf-parse
  • HTTP Client: node-fetch
  • Deployment: Vercel
Client / n8n
     |
     v
Vercel Serverless Function
     |
     v
pdf-parse
     |
     v
JSON Response
  

๐Ÿ› ๏ธ Workflow & Implementation

  1. Receive HTTP request
  2. Validate input parameters
  3. Fetch PDF from URL
  4. Parse text using pdf-parse
  5. Return structured JSON response

๐Ÿงช Testing & Validation

IDAreaCommandExpected OutputExplanation
T-01APIGET with valid PDF200 + textValid textual PDF
T-02APIMissing fileURL400 errorValidation check

๐Ÿ” Validation Summary

  • Input validation enforced
  • Error handling implemented
  • Automation compatibility verified

๐Ÿงฐ Verification Testing Tools

  • Curl
  • Postman
  • n8n HTTP Request node

๐Ÿงฏ Troubleshooting & Debugging

  • Ensure PDF is publicly accessible
  • Confirm PDF is text-based
  • Check Vercel function logs

๐Ÿ”’ Security & Secrets

  • No secrets or API keys required
  • Stateless execution
  • Public-file access only

โ˜๏ธ Deployment (Vercel)

  • Node.js 18 runtime
  • 2048 MB memory
  • 60-second execution limit

โšก Quick-Start Cheat Sheet

  1. Deploy to Vercel
  2. Copy endpoint URL
  3. Provide public PDF URL
  4. Receive extracted text

๐Ÿงพ Usage Notes

  • Textual PDFs only
  • No OCR support in V1
  • Designed for preprocessing pipelines

๐Ÿง  Performance & Optimization

  • Lightweight dependencies
  • Fast cold-start execution
  • Performance dependent on PDF size

๐ŸŒŸ Enhancements & Features

Current version supports textual PDF extraction only.

๐Ÿงฉ Maintenance & Future Work

  • OCR for scanned PDFs
  • AI summarization layer
  • Chunking and vector storage

๐Ÿ† Key Achievements

  • Production-ready serverless microservice
  • Zero-cost alternative for textual PDF extraction
  • Automation-first design

๐Ÿงฎ High-Level Architecture

The service acts as an independent extraction node within the Personal AI Factory, feeding structured text into downstream automation and AI systems.

๐Ÿ—‚๏ธ Folder Structure

ocr-summarizer-microservice/
โ”œโ”€โ”€ api/
โ”‚   โ””โ”€โ”€ ocr-summarize.ts
โ”œโ”€โ”€ types/
โ”‚   โ””โ”€โ”€ pdf-parse.d.ts
โ”œโ”€โ”€ node_modules/
โ”œโ”€โ”€ package.json
โ”œโ”€โ”€ tsconfig.json
โ”œโ”€โ”€ README.md
โ””โ”€โ”€ .gitignore
  

๐Ÿงญ How to Demonstrate Live

  1. Deploy to Vercel
  2. Send GET request with PDF URL
  3. Observe extracted text response

๐Ÿ’ก Summary, Closure & Compliance

This repository delivers a compliant, production-ready, serverless PDF text extraction microservice aligned with Personal AI Factory v1 standards.

License: MIT
Author: Ansh Srivastava
Status: Stable โ€“ Production Ready (V1)

About

Serverless OCR & PDF Text Extraction microservice for Personal AI Factory v1. Built with TypeScript and Vercel Serverless Functions, using pdf-parse, and node-fetch for high-performance parsing of machine-readable PDFs. Supports extracting clean text from textual PDFs and exposes a clean HTTP API returning structured JSON output for downstream n8n.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published