Skip to content

Latest commit

 

History

History

README.md

Intelligent document processing using State of the Art AI models.

If you find Docuglean helpful, please ⭐ this repository to show your support!

What is Docuglean AI?

Docuglean is a unified SDK for intelligent document processing using State of the Art AI models. Docuglean provides multilingual and multimodal capabilities with plug-and-play APIs for document OCR, structured data extraction, annotation, classification, summarization, and translation. It also comes with inbuilt tools and supports different types of documents out of the box.

Features

  • 🚀 Easy to Use: Simple, intuitive API with detailed documentation. Just pass in a file and get markdown in response.
  • 🔍 OCR Capabilities: Extract text from images and scanned documents
  • 📊 Structured Data Extraction: Use Zod schemas for type-safe data extraction
  • 📑 Document Classification: Intelligently split multi-section documents by category with automatic chunking
  • 📄 Multimodal Support: Process PDFs and images with ease
  • 🤖 Multiple AI Providers: Support for OpenAI, Mistral, and Google Gemini, with more coming soon
  • Batch Processing: Process multiple documents concurrently with automatic error handling
  • 🔒 Type Safety: Full TypeScript support with comprehensive types
  • 📝 Document Parsers: Local parsing for DOC, DOCX, PPTX, XLSX, XLS, ODS, ODT, ODP, CSV, TSV, and PDF files (no API required)

Coming Soon

  • 📝 summarize(): TLDRs of long documents
  • 🌐 translate(): Support for multilingual documents
  • 🏷️ classify(): Document type classifier (receipt, ID, invoice, etc.)
  • 🔍 search(query): LLM-powered search across documents
  • 🤖 More Models. More Providers: Integration with Meta's Llama, Together AI, OpenRouter and lots more.
  • 🌍 Multilingual: Support for multiple languages (coming soon)
  • 🎯 Smart Classification: Automatic document type detection (coming soon)

Quick Start

Installation

npm i docuglean-ocr

Features in Detail

OCR Function - Pure OCR Processing

Extracts text from documents and images. Returns text content with basic metadata (varies by provider).

import { ocr, extract } from 'docuglean-ocr';

// Extract raw text from documents (supports URLs and local files)
const ocrResult = await ocr({
  filePath: 'https://arxiv.org/pdf/2302.12854',
  provider: 'openai',
  model: 'gpt-4o-mini',
  apiKey: 'your-api-key'
});

// Mistral OCR with local file
const mistralResult = await ocr({
  filePath: './document.pdf',
  provider: 'mistral',
  model: 'mistral-ocr-latest',
  apiKey: 'your-api-key'
});

// Local OCR (no API, PDFs only) using pdf2json
const localResult = await ocr({
  filePath: './document.pdf',
  provider: 'local',
  apiKey: 'local'
});
console.log('Local text:', (localResult as any).text.substring(0, 200) + '...');

Extract Function - Document Analysis & Information Extraction

Structured extraction for analyzing document content and extracting specific information based on custom prompts.

import { z } from 'zod';

// Define schema for structured extraction
const ReceiptSchema = z.object({
  date: z.string(),
  total: z.number(),
  items: z.array(z.object({
    name: z.string(),
    price: z.number()
  }))
});

// Extract structured data from documents
const extractResult = await extract({
  filePath: './receipt.pdf',
  provider: 'openai',
  model: 'gpt-4o-mini',
  apiKey: 'your-api-key',
  responseFormat: ReceiptSchema,
  prompt: 'Extract receipt details including date, total, and items'
});

// You can now access fields directly:
console.log('Date:', extractResult.date);
console.log('Total:', extractResult.total);
console.log('First item name:', extractResult.items[0]?.name);

Document Classification - Split Documents by Category

Intelligently classify and split documents into categories based on content. Perfect for processing multi-section documents like medical records, legal contracts, or research papers.

import { classify } from 'docuglean-ocr';

// Classify a patient medical record
const result = await classify(
  './patient-record.pdf',
  [
    {
      name: 'Patient Intake Forms',
      description: 'Pages with patient registration, insurance information, and consent forms'
    },
    {
      name: 'Medical History',
      description: 'Pages containing past medical history, medications, allergies, and family history'
    },
    {
      name: 'Lab Results',
      description: 'Pages with laboratory test results, blood work, and diagnostic reports'
    },
    {
      name: 'Treatment Notes',
      description: 'Pages with doctor\'s notes, treatment plans, and prescriptions'
    }
  ],
  'your-api-key',
  'mistral' // or 'openai', 'gemini'
);

// Access the results
result.splits.forEach(split => {
  console.log(`\n${split.name}:`);
  console.log(`  Pages: ${split.pages}`);
  console.log(`  Confidence: ${split.conf}`);
});

// Example output:
// Patient Intake Forms:
//   Pages: 1,2,3,4
//   Confidence: high
// Medical History:
//   Pages: 5,6,7
//   Confidence: high
// Lab Results:
//   Pages: 8,9,10,11,12
//   Confidence: high
// Treatment Notes:
//   Pages: 13,14,15,16
//   Confidence: high

Key Features:

  • 🎯 Automatic Chunking: Handles large documents (100+ pages) by automatically splitting into chunks
  • Concurrent Processing: Processes chunks in parallel for faster results
  • 🎚️ Confidence Scores: Returns "high" or "low" confidence for each classification
  • 📊 Page-Level Granularity: Get exact page numbers for each category
  • 🔧 Configurable: Adjust chunk size and concurrency limits

Advanced Options:

const result = await classify(
  './large-document.pdf',
  [...],
  'your-api-key',
  'openai',
  {
    model: 'gpt-4o-mini', // Optional: specify model
    chunkSize: 75, // Pages per chunk (default: 75)
    maxConcurrent: 5 // Max parallel requests (default: 5)
  }
);

Batch Processing - Process Multiple Documents Concurrently

Process multiple documents concurrently with automatic error handling for maximum speed.

import { batchOcr, batchExtract } from 'docuglean-ocr';
import { z } from 'zod';

// Batch OCR - Process multiple files
const ocrResults = await batchOcr([
  {
    filePath: './invoice1.pdf',
    provider: 'openai',
    apiKey: 'your-api-key',
    model: 'gpt-4o-mini'
  },
  {
    filePath: './invoice2.pdf',
    provider: 'mistral',
    apiKey: 'your-api-key',
    model: 'pixtral-12b-2409'
  },
  {
    filePath: './receipt.png',
    provider: 'local',
    apiKey: 'not-needed'
  }
]);

// Handle results - errors don't stop processing
ocrResults.forEach((result, index) => {
  if (result.success) {
    console.log(`File ${index + 1} processed:`, result.result);
  } else {
    console.error(`File ${index + 1} failed:`, result.error);
  }
});

// Batch Extract - Extract structured data from multiple files
const InvoiceSchema = z.object({
  invoice_number: z.string(),
  vendor: z.string(),
  total: z.number()
});

const extractResults = await batchExtract([
  {
    filePath: './invoice1.pdf',
    provider: 'openai',
    apiKey: 'your-api-key',
    responseFormat: InvoiceSchema
  },
  {
    filePath: './invoice2.pdf',
    provider: 'openai',
    apiKey: 'your-api-key',
    responseFormat: InvoiceSchema
  }
]);

// Get successful extractions
const successful = extractResults.filter(r => r.success);
console.log(`Processed ${successful.length}/${extractResults.length} files`);

Key Features:

  • ✅ Automatic error handling
  • ✅ Results returned in same order as input
  • ✅ Mix different providers in single batch
  • ✅ Simple success/failure status for each file

Provider Options

Currently supported providers and models:

  • OpenAI: gpt-4.1-mini, gpt-4.1, gpt-4o-mini, gpt-4o, o1-mini, o1, o3, o4-mini
  • Mistral: mistral-ocr-latest for OCR. All currently available models except for codestral-mamba are supported for structured outputs.
  • Google Gemini: gemini-2.5-flash, gemini-2.5-pro, gemini-1.5-flash, gemini-1.5-pro
  • Local: No API required - supports DOC, DOCX, PPTX, XLSX, XLS, ODS, ODT, ODP, CSV, TSV, and PDF files
  • More coming soon: Together AI, OpenRouter, Anthropic etc

Document Parsers (Local - No API Required)

Extract text from various document formats without any AI provider:

import { parseDocumentLocal, parsePdf, parseDocx, parseCsv } from 'docuglean-ocr';

// Parse any supported document format
const result = await parseDocumentLocal('./document.pdf');
console.log(result.text);

// Or use specific parsers
const pdf = await parsePdf('./document.pdf');           // PDF
const docx = await parseDocx('./document.docx');        // DOCX (also supports DOC)
const pptx = await parsePptx('./presentation.pptx');    // PowerPoint
const xlsx = await parseSpreadsheet('./data.xlsx');     // Excel (XLSX, XLS)
const csv = await parseCsv('./data.csv');               // CSV/TSV
const odt = await parseOdt('./document.odt');           // OpenDocument Text
const odp = await parseOdp('./presentation.odp');       // OpenDocument Presentation
const ods = await parseOds('./spreadsheet.ods');        // OpenDocument Spreadsheet

Supported Formats:

  • Word: DOC, DOCX (via mammoth)
  • PowerPoint: PPTX (via officeparser)
  • Excel: XLSX, XLS, ODS (via officeparser)
  • CSV/TSV: CSV, TSV (via d3-dsv)
  • OpenDocument: ODT, ODP, ODS (via officeparser)
  • PDF: PDF (via pdf2json, or convert to images via pdf-poppler)

Configuration

OCR Configuration

interface OCRConfig {
  filePath: string;
  provider?: 'openai' | 'mistral' | 'gemini';
  model?: string;
  apiKey: string;
  prompt?: string;
  options?: {
    mistral?: {
      includeImageBase64?: boolean;
    };
    openai?: {
      maxTokens?: number;
    };
    gemini?: {
      temperature?: number;
      topP?: number;
      topK?: number;
    };
  };
}

Extraction Configuration

interface ExtractConfig {
  filePath: string;
  apiKey: string;
  provider?: 'openai' | 'mistral' | 'gemini';
  model?: string;
  prompt?: string;
  responseFormat?: z.ZodType<any>;
  systemPrompt?: string;
}

Additional Examples

// Structured extraction with Gemini
const geminiReceipt = await extract({
  filePath: './receipt.pdf',
  provider: 'gemini',
  apiKey: 'your-gemini-api-key',
  responseFormat: ReceiptSchema,
  prompt: 'Extract receipt information including date, total, and all items'
});

// Structured extraction with different schema
const DocumentSchema = z.object({
  title: z.string(),
  authors: z.array(z.string()),
  summary: z.string()
});

const documentInfo = await extract({
  filePath: './research-paper.pdf',
  provider: 'openai',
  apiKey: 'your-api-key',
  responseFormat: DocumentSchema,
  prompt: 'Extract document metadata and summary'
});

// Summarization via extract
const SummarySchema = z.object({
  title: z.string().optional(),
  summary: z.string(),
  keyPoints: z.array(z.string()),
});
const summary = await extract({
  filePath: './long-report.pdf',
  provider: 'openai',
  apiKey: 'your-api-key',
  responseFormat: SummarySchema,
  prompt: 'Provide a concise 3-sentence summary of the document.'
});
console.log('Summary:', summary.summary);

Note: you can also use extract with a targeted "search" prompt (e.g., "Find all occurrences of X and return matching passages") to perform semantic search within a document.

Check out our test folder for more comprehensive examples and use cases, including:

  • Receipt parsing
  • Document summarization
  • Image OCR
  • Structured data extraction
  • Custom schema validation

Stay Up to Date

⭐ Star this repo to get notified about new releases and updates!

Contributing

We welcome contributions! Please refer to the CONTRIBUTING.md file for information about how to get involved. We welcome issues, questions, and pull requests.

License

Apache 2.0 - see the LICENSE file for details.