ohseer

A unified R interface to multiple OCR (Optical Character Recognition) APIs. Process documents with Claude (Opus 4.6/Sonnet 4.5), Mistral OCR 3, Tensorlake, or AWS Textract using a single, consistent function.

Documentation

📚 Full documentation: https://n8layman.github.io/ohseer/

Part of the EcoExtract Suite

OhSeeR is the foundational first step in the EcoExtract Suite, a collection of R packages designed for extracting and structuring ecological data from academic literature.

Workflow: Source PDF Documents → OhSeeR (OCR) → sanitizeR (text cleaning) → whispeR (prompts) → LLM API → structuR (structured data) → auditR (validation) → Structured Dataset

Features

Unified interface: Use ohseer_ocr() with any provider
Provider fallback: Automatic failover if one provider fails
Multiple OCR providers:
- Claude Opus 4.6: #1 on OCR Arena leaderboards, structured outputs with JSON schemas
- Tensorlake: Highest accuracy (91.7%), best for tables and forms
- Mistral OCR 3: Native markdown output, cost-effective
- AWS Textract: Reliable option for structured data extraction
Consistent output: Same interface across all providers
Lightweight: No heavy dependencies, uses httr2 for all API calls

Installation

# Using pak (recommended)
pak::pak("n8layman/ohseer")

# Using devtools
devtools::install_github("n8layman/ohseer")

# Using remotes
remotes::install_github("n8layman/ohseer")

Authentication

Set up API keys as environment variables:

# Set for the current session
Sys.setenv(
  ANTHROPIC_API_KEY = "your-claude-key",        # For Claude
  TENSORLAKE_API_KEY = "your-tensorlake-key",   # For Tensorlake
  MISTRAL_API_KEY = "your-mistral-key",         # For Mistral
  AWS_ACCESS_KEY_ID = "your-aws-key",           # For AWS Textract
  AWS_SECRET_ACCESS_KEY = "your-aws-secret"     # For AWS Textract
)

Or create a .env file in your project directory:

# .env
ANTHROPIC_API_KEY=your-claude-key
TENSORLAKE_API_KEY=your-tensorlake-key
MISTRAL_API_KEY=your-mistral-key
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret

⚠️ Security: Never commit .env files to version control. Add .env to your .gitignore.

Getting API Keys

Claude: console.anthropic.com → API Keys
Tensorlake: cloud.tensorlake.ai → Dashboard → API Key
Mistral: mistral.ai → Try the API → API keys
AWS Textract: aws.amazon.com → IAM → Create access key with AmazonTextractFullAccess

Quick Start

Basic Usage

library(ohseer)

# Process with default provider (Tensorlake)
result <- ohseer_ocr("document.pdf")

# Access extracted pages
pages <- result$pages
provider_used <- result$provider

Choose a Specific Provider

# Use Claude for highest accuracy
result <- ohseer_ocr("document.pdf", provider = "claude")

# Use Mistral for cost-effectiveness
result <- ohseer_ocr("document.pdf", provider = "mistral")

# Use Tensorlake for best table extraction
result <- ohseer_ocr("document.pdf", provider = "tensorlake")

Provider Fallback

Automatically try multiple providers in order until one succeeds:

# Try Tensorlake first (highest quality), fall back to Mistral (lower cost)
result <- ohseer_ocr(
  "document.pdf",
  provider = c("tensorlake", "mistral", "claude")
)

# Check which provider succeeded
message("Used provider: ", result$provider)

# Check if any providers failed
if (!is.na(result$error_log)) {
  errors <- jsonlite::fromJSON(result$error_log)
  print(errors)
}

Select Specific Pages

# Process only first 2 pages
result <- ohseer_ocr("document.pdf", pages = c(1, 2))

# Process specific pages
result <- ohseer_ocr("document.pdf", pages = c(1, 5, 10))

Provider-Specific Options

Each provider accepts its own custom parameters via ...:

# Mistral: extract headers and footers separately
result <- ohseer_ocr(
  "document.pdf",
  provider = "mistral",
  extract_header = TRUE,
  extract_footer = TRUE
)

# Claude: use Sonnet instead of Opus, custom schema
result <- ohseer_ocr(
  "document.pdf",
  provider = "claude",
  model = "claude-sonnet-4-5",
  schema = my_custom_schema
)

# Tensorlake: use different model
result <- ohseer_ocr(
  "document.pdf",
  provider = "tensorlake",
  model = "high-quality-v1"
)

Output Format

All providers return a consistent structure when using ohseer_ocr():

result <- ohseer_ocr("document.pdf")

# Result structure:
# $provider  - Character: which provider was used
# $pages     - List: extracted page data (format varies by provider)
# $raw       - List: raw API response
# $error_log - Character (JSON): errors from failed providers, or NA

Note: Each provider returns pages in its own native format. See provider-specific vignettes for details:

Provider Comparison

Provider	Accuracy	Speed	Cost	Best For
Claude Opus 4.6	⭐⭐⭐⭐⭐ (#1 OCR Arena)	Medium	High	Structured outputs, custom schemas
Tensorlake	⭐⭐⭐⭐⭐ (91.7%)	Fast	$0.01/page	Tables, forms, batch processing
Mistral OCR 3	⭐⭐⭐	Very Fast	Low	Markdown output, cost-sensitive
AWS Textract	⭐⭐⭐⭐ (88.4%)	Fast	Medium	AWS ecosystem, reliability

Advanced Usage

For provider-specific functions and advanced features, see:

Complete Function Reference
Unified Interface Guide
Provider guides: Tensorlake | Mistral | Claude

Notes

This package is experimental and the API may change
Large files may take time to process depending on provider
Check provider documentation for pricing and rate limits:

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.git-crypt		.git-crypt
.github/workflows		.github/workflows
R		R
data/articles		data/articles
man		man
renv		renv
test_results		test_results
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
COST_COMPARISON.md		COST_COMPARISON.md
DESCRIPTION		DESCRIPTION
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md
TENSORLAKE_QUICKSTART.md		TENSORLAKE_QUICKSTART.md
TENSORLAKE_SETUP.md		TENSORLAKE_SETUP.md
USAGE_EXAMPLES.md		USAGE_EXAMPLES.md
_pkgdown.yml		_pkgdown.yml
mistral_sample.json		mistral_sample.json
quick_test.R		quick_test.R
renv.lock		renv.lock
test_claude_models.sh		test_claude_models.sh
test_claude_only.R		test_claude_only.R
test_mistral_error.R		test_mistral_error.R
test_mistral_fixed.R		test_mistral_fixed.R
test_mistral_schema.R		test_mistral_schema.R
test_ocr_providers.R		test_ocr_providers.R
test_output.log		test_output.log
test_provider_fallback.R		test_provider_fallback.R
test_unified_interface.R		test_unified_interface.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ohseer

Documentation

Part of the EcoExtract Suite

Features

Installation

Authentication

Getting API Keys

Quick Start

Basic Usage

Choose a Specific Provider

Provider Fallback

Select Specific Pages

Provider-Specific Options

Output Format

Provider Comparison

Advanced Usage

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ohseer

Documentation

Part of the EcoExtract Suite

Features

Installation

Authentication

Getting API Keys

Quick Start

Basic Usage

Choose a Specific Provider

Provider Fallback

Select Specific Pages

Provider-Specific Options

Output Format

Provider Comparison

Advanced Usage

Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages