Docuglean is a unified SDK for intelligent document processing using State of the Art AI models. Docuglean provides multilingual and multimodal capabilities with plug-and-play APIs for document OCR, structured data extraction, annotation, classification, summarization, and translation. It also comes with inbuilt tools and supports different types of documents out of the box.
- 🚀 Easy to Use: Simple, intuitive API with detailed documentation. Just pass in a file and get markdown in response.
- 🔍 OCR Capabilities: Extract text from images and scanned documents
- 📊 Structured Data Extraction: Use Zod schemas for type-safe data extraction
- 📑 Document Classification: Intelligently split multi-section documents by category with automatic chunking
- 📄 Multimodal Support: Process PDFs and images with ease
- 🤖 Multiple AI Providers: Support for OpenAI, Mistral, and Google Gemini, with more coming soon
- ⚡ Batch Processing: Process multiple documents concurrently with automatic error handling
- 🔒 Type Safety: Full TypeScript support with comprehensive types
- 📝 Document Parsers: Local parsing for DOC, DOCX, PPTX, XLSX, XLS, ODS, ODT, ODP, CSV, TSV, and PDF files (no API required)
- 📝 summarize(): TLDRs of long documents
- 🌐 translate(): Support for multilingual documents
- 🏷️ classify(): Document type classifier (receipt, ID, invoice, etc.)
- 🔍 search(query): LLM-powered search across documents
- 🤖 More Models. More Providers: Integration with Meta's Llama, Together AI, OpenRouter and lots more.
- 🌍 Multilingual: Support for multiple languages (coming soon)
- 🎯 Smart Classification: Automatic document type detection (coming soon)
npm i docuglean-ocrExtracts text from documents and images. Returns text content with basic metadata (varies by provider).
import { ocr, extract } from 'docuglean-ocr';
// Extract raw text from documents (supports URLs and local files)
const ocrResult = await ocr({
filePath: 'https://arxiv.org/pdf/2302.12854',
provider: 'openai',
model: 'gpt-4o-mini',
apiKey: 'your-api-key'
});
// Mistral OCR with local file
const mistralResult = await ocr({
filePath: './document.pdf',
provider: 'mistral',
model: 'mistral-ocr-latest',
apiKey: 'your-api-key'
});
// Local OCR (no API, PDFs only) using pdf2json
const localResult = await ocr({
filePath: './document.pdf',
provider: 'local',
apiKey: 'local'
});
console.log('Local text:', (localResult as any).text.substring(0, 200) + '...');Structured extraction for analyzing document content and extracting specific information based on custom prompts.
import { z } from 'zod';
// Define schema for structured extraction
const ReceiptSchema = z.object({
date: z.string(),
total: z.number(),
items: z.array(z.object({
name: z.string(),
price: z.number()
}))
});
// Extract structured data from documents
const extractResult = await extract({
filePath: './receipt.pdf',
provider: 'openai',
model: 'gpt-4o-mini',
apiKey: 'your-api-key',
responseFormat: ReceiptSchema,
prompt: 'Extract receipt details including date, total, and items'
});
// You can now access fields directly:
console.log('Date:', extractResult.date);
console.log('Total:', extractResult.total);
console.log('First item name:', extractResult.items[0]?.name);Intelligently classify and split documents into categories based on content. Perfect for processing multi-section documents like medical records, legal contracts, or research papers.
import { classify } from 'docuglean-ocr';
// Classify a patient medical record
const result = await classify(
'./patient-record.pdf',
[
{
name: 'Patient Intake Forms',
description: 'Pages with patient registration, insurance information, and consent forms'
},
{
name: 'Medical History',
description: 'Pages containing past medical history, medications, allergies, and family history'
},
{
name: 'Lab Results',
description: 'Pages with laboratory test results, blood work, and diagnostic reports'
},
{
name: 'Treatment Notes',
description: 'Pages with doctor\'s notes, treatment plans, and prescriptions'
}
],
'your-api-key',
'mistral' // or 'openai', 'gemini'
);
// Access the results
result.splits.forEach(split => {
console.log(`\n${split.name}:`);
console.log(` Pages: ${split.pages}`);
console.log(` Confidence: ${split.conf}`);
});
// Example output:
// Patient Intake Forms:
// Pages: 1,2,3,4
// Confidence: high
// Medical History:
// Pages: 5,6,7
// Confidence: high
// Lab Results:
// Pages: 8,9,10,11,12
// Confidence: high
// Treatment Notes:
// Pages: 13,14,15,16
// Confidence: highKey Features:
- 🎯 Automatic Chunking: Handles large documents (100+ pages) by automatically splitting into chunks
- ⚡ Concurrent Processing: Processes chunks in parallel for faster results
- 🎚️ Confidence Scores: Returns "high" or "low" confidence for each classification
- 📊 Page-Level Granularity: Get exact page numbers for each category
- 🔧 Configurable: Adjust chunk size and concurrency limits
Advanced Options:
const result = await classify(
'./large-document.pdf',
[...],
'your-api-key',
'openai',
{
model: 'gpt-4o-mini', // Optional: specify model
chunkSize: 75, // Pages per chunk (default: 75)
maxConcurrent: 5 // Max parallel requests (default: 5)
}
);Process multiple documents concurrently with automatic error handling for maximum speed.
import { batchOcr, batchExtract } from 'docuglean-ocr';
import { z } from 'zod';
// Batch OCR - Process multiple files
const ocrResults = await batchOcr([
{
filePath: './invoice1.pdf',
provider: 'openai',
apiKey: 'your-api-key',
model: 'gpt-4o-mini'
},
{
filePath: './invoice2.pdf',
provider: 'mistral',
apiKey: 'your-api-key',
model: 'pixtral-12b-2409'
},
{
filePath: './receipt.png',
provider: 'local',
apiKey: 'not-needed'
}
]);
// Handle results - errors don't stop processing
ocrResults.forEach((result, index) => {
if (result.success) {
console.log(`File ${index + 1} processed:`, result.result);
} else {
console.error(`File ${index + 1} failed:`, result.error);
}
});
// Batch Extract - Extract structured data from multiple files
const InvoiceSchema = z.object({
invoice_number: z.string(),
vendor: z.string(),
total: z.number()
});
const extractResults = await batchExtract([
{
filePath: './invoice1.pdf',
provider: 'openai',
apiKey: 'your-api-key',
responseFormat: InvoiceSchema
},
{
filePath: './invoice2.pdf',
provider: 'openai',
apiKey: 'your-api-key',
responseFormat: InvoiceSchema
}
]);
// Get successful extractions
const successful = extractResults.filter(r => r.success);
console.log(`Processed ${successful.length}/${extractResults.length} files`);Key Features:
- ✅ Automatic error handling
- ✅ Results returned in same order as input
- ✅ Mix different providers in single batch
- ✅ Simple success/failure status for each file
Currently supported providers and models:
- OpenAI:
gpt-4.1-mini,gpt-4.1,gpt-4o-mini,gpt-4o,o1-mini,o1,o3,o4-mini - Mistral:
mistral-ocr-latestfor OCR. All currently available models except for codestral-mamba are supported for structured outputs. - Google Gemini:
gemini-2.5-flash,gemini-2.5-pro,gemini-1.5-flash,gemini-1.5-pro - Local: No API required - supports DOC, DOCX, PPTX, XLSX, XLS, ODS, ODT, ODP, CSV, TSV, and PDF files
- More coming soon: Together AI, OpenRouter, Anthropic etc
Extract text from various document formats without any AI provider:
import { parseDocumentLocal, parsePdf, parseDocx, parseCsv } from 'docuglean-ocr';
// Parse any supported document format
const result = await parseDocumentLocal('./document.pdf');
console.log(result.text);
// Or use specific parsers
const pdf = await parsePdf('./document.pdf'); // PDF
const docx = await parseDocx('./document.docx'); // DOCX (also supports DOC)
const pptx = await parsePptx('./presentation.pptx'); // PowerPoint
const xlsx = await parseSpreadsheet('./data.xlsx'); // Excel (XLSX, XLS)
const csv = await parseCsv('./data.csv'); // CSV/TSV
const odt = await parseOdt('./document.odt'); // OpenDocument Text
const odp = await parseOdp('./presentation.odp'); // OpenDocument Presentation
const ods = await parseOds('./spreadsheet.ods'); // OpenDocument SpreadsheetSupported Formats:
- Word: DOC, DOCX (via mammoth)
- PowerPoint: PPTX (via officeparser)
- Excel: XLSX, XLS, ODS (via officeparser)
- CSV/TSV: CSV, TSV (via d3-dsv)
- OpenDocument: ODT, ODP, ODS (via officeparser)
- PDF: PDF (via pdf2json, or convert to images via pdf-poppler)
interface OCRConfig {
filePath: string;
provider?: 'openai' | 'mistral' | 'gemini';
model?: string;
apiKey: string;
prompt?: string;
options?: {
mistral?: {
includeImageBase64?: boolean;
};
openai?: {
maxTokens?: number;
};
gemini?: {
temperature?: number;
topP?: number;
topK?: number;
};
};
}interface ExtractConfig {
filePath: string;
apiKey: string;
provider?: 'openai' | 'mistral' | 'gemini';
model?: string;
prompt?: string;
responseFormat?: z.ZodType<any>;
systemPrompt?: string;
}// Structured extraction with Gemini
const geminiReceipt = await extract({
filePath: './receipt.pdf',
provider: 'gemini',
apiKey: 'your-gemini-api-key',
responseFormat: ReceiptSchema,
prompt: 'Extract receipt information including date, total, and all items'
});
// Structured extraction with different schema
const DocumentSchema = z.object({
title: z.string(),
authors: z.array(z.string()),
summary: z.string()
});
const documentInfo = await extract({
filePath: './research-paper.pdf',
provider: 'openai',
apiKey: 'your-api-key',
responseFormat: DocumentSchema,
prompt: 'Extract document metadata and summary'
});
// Summarization via extract
const SummarySchema = z.object({
title: z.string().optional(),
summary: z.string(),
keyPoints: z.array(z.string()),
});
const summary = await extract({
filePath: './long-report.pdf',
provider: 'openai',
apiKey: 'your-api-key',
responseFormat: SummarySchema,
prompt: 'Provide a concise 3-sentence summary of the document.'
});
console.log('Summary:', summary.summary);Note: you can also use extract with a targeted "search" prompt (e.g., "Find all occurrences of X and return matching passages") to perform semantic search within a document.
Check out our test folder for more comprehensive examples and use cases, including:
- Receipt parsing
- Document summarization
- Image OCR
- Structured data extraction
- Custom schema validation
⭐ Star this repo to get notified about new releases and updates!
We welcome contributions! Please refer to the CONTRIBUTING.md file for information about how to get involved. We welcome issues, questions, and pull requests.
Apache 2.0 - see the LICENSE file for details.
