Skip to content

n8layman/ohseer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ohseer

Lifecycle: experimental

A unified R interface to multiple OCR (Optical Character Recognition) APIs. Process documents with Claude (Opus 4.6/Sonnet 4.5), Mistral OCR 3, Tensorlake, or AWS Textract using a single, consistent function.

Documentation

πŸ“š Full documentation: https://n8layman.github.io/ohseer/

Part of the EcoExtract Suite

OhSeeR is the foundational first step in the EcoExtract Suite, a collection of R packages designed for extracting and structuring ecological data from academic literature.

Workflow: Source PDF Documents β†’ OhSeeR (OCR) β†’ sanitizeR (text cleaning) β†’ whispeR (prompts) β†’ LLM API β†’ structuR (structured data) β†’ auditR (validation) β†’ Structured Dataset

Features

  • Unified interface: Use ohseer_ocr() with any provider
  • Provider fallback: Automatic failover if one provider fails
  • Multiple OCR providers:
    • Claude Opus 4.6: #1 on OCR Arena leaderboards, structured outputs with JSON schemas
    • Tensorlake: Highest accuracy (91.7%), best for tables and forms
    • Mistral OCR 3: Native markdown output, cost-effective
    • AWS Textract: Reliable option for structured data extraction
  • Consistent output: Same interface across all providers
  • Lightweight: No heavy dependencies, uses httr2 for all API calls

Installation

# Using pak (recommended)
pak::pak("n8layman/ohseer")

# Using devtools
devtools::install_github("n8layman/ohseer")

# Using remotes
remotes::install_github("n8layman/ohseer")

Authentication

Set up API keys as environment variables:

# Set for the current session
Sys.setenv(
  ANTHROPIC_API_KEY = "your-claude-key",        # For Claude
  TENSORLAKE_API_KEY = "your-tensorlake-key",   # For Tensorlake
  MISTRAL_API_KEY = "your-mistral-key",         # For Mistral
  AWS_ACCESS_KEY_ID = "your-aws-key",           # For AWS Textract
  AWS_SECRET_ACCESS_KEY = "your-aws-secret"     # For AWS Textract
)

Or create a .env file in your project directory:

# .env
ANTHROPIC_API_KEY=your-claude-key
TENSORLAKE_API_KEY=your-tensorlake-key
MISTRAL_API_KEY=your-mistral-key
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret

⚠️ Security: Never commit .env files to version control. Add .env to your .gitignore.

Getting API Keys

Quick Start

Basic Usage

library(ohseer)

# Process with default provider (Tensorlake)
result <- ohseer_ocr("document.pdf")

# Access extracted pages
pages <- result$pages
provider_used <- result$provider

Choose a Specific Provider

# Use Claude for highest accuracy
result <- ohseer_ocr("document.pdf", provider = "claude")

# Use Mistral for cost-effectiveness
result <- ohseer_ocr("document.pdf", provider = "mistral")

# Use Tensorlake for best table extraction
result <- ohseer_ocr("document.pdf", provider = "tensorlake")

Provider Fallback

Automatically try multiple providers in order until one succeeds:

# Try Tensorlake first (highest quality), fall back to Mistral (lower cost)
result <- ohseer_ocr(
  "document.pdf",
  provider = c("tensorlake", "mistral", "claude")
)

# Check which provider succeeded
message("Used provider: ", result$provider)

# Check if any providers failed
if (!is.na(result$error_log)) {
  errors <- jsonlite::fromJSON(result$error_log)
  print(errors)
}

Select Specific Pages

# Process only first 2 pages
result <- ohseer_ocr("document.pdf", pages = c(1, 2))

# Process specific pages
result <- ohseer_ocr("document.pdf", pages = c(1, 5, 10))

Provider-Specific Options

Each provider accepts its own custom parameters via ...:

# Mistral: extract headers and footers separately
result <- ohseer_ocr(
  "document.pdf",
  provider = "mistral",
  extract_header = TRUE,
  extract_footer = TRUE
)

# Claude: use Sonnet instead of Opus, custom schema
result <- ohseer_ocr(
  "document.pdf",
  provider = "claude",
  model = "claude-sonnet-4-5",
  schema = my_custom_schema
)

# Tensorlake: use different model
result <- ohseer_ocr(
  "document.pdf",
  provider = "tensorlake",
  model = "high-quality-v1"
)

Output Format

All providers return a consistent structure when using ohseer_ocr():

result <- ohseer_ocr("document.pdf")

# Result structure:
# $provider  - Character: which provider was used
# $pages     - List: extracted page data (format varies by provider)
# $raw       - List: raw API response
# $error_log - Character (JSON): errors from failed providers, or NA

Note: Each provider returns pages in its own native format. See provider-specific vignettes for details:

Provider Comparison

Provider Accuracy Speed Cost Best For
Claude Opus 4.6 ⭐⭐⭐⭐⭐ (#1 OCR Arena) Medium High Structured outputs, custom schemas
Tensorlake ⭐⭐⭐⭐⭐ (91.7%) Fast $0.01/page Tables, forms, batch processing
Mistral OCR 3 ⭐⭐⭐ Very Fast Low Markdown output, cost-sensitive
AWS Textract ⭐⭐⭐⭐ (88.4%) Fast Medium AWS ecosystem, reliability

Advanced Usage

For provider-specific functions and advanced features, see:

Notes

License

MIT License

About

ohseer is an R package that provides an interface to the Mistral optical character recognition (OCR) API for document processing in R

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors