Skip to content

Implement two-tier Docling parsing for faster ingestion#388

Open
ArnavAgrawal03 wants to merge 2 commits intomainfrom
arnav/investigate-ingestion-slowdown
Open

Implement two-tier Docling parsing for faster ingestion#388
ArnavAgrawal03 wants to merge 2 commits intomainfrom
arnav/investigate-ingestion-slowdown

Conversation

@ArnavAgrawal03
Copy link
Collaborator

Summary

Implement a two-tier document parsing strategy to resolve the significant ingestion slowdown caused by always-on OCR. The parser now tries fast text-layer extraction first (no OCR, no table detection), and only falls back to the full OCR pipeline if no text is found, which is exactly when it's needed (scanned/image PDFs).

Impact

For text-based PDFs (the majority), parsing time drops from minutes back to seconds, matching the old unstructured library behavior. Scanned documents still get OCR'd automatically. This should reduce typical ingestion times from 2+ hours back to minutes for batch uploads.

Changes

  • Split Docling converter into two cached instances: fast (no OCR) and full (OCR+tables)
  • _parse_document_local() now tries fast first, returns immediately if text is found
  • Falls back to full OCR only when needed, with clear logging of the strategy used

🤖 Generated with Claude Code

ArnavAgrawal03 and others added 2 commits February 18, 2026 17:09
…tion

Replace always-on OCR with a fast-first approach: try text-layer extraction
without OCR/table detection first (seconds), then fall back to full OCR only
if no text is found (scanned/image PDFs). This restores the performance of
the old unstructured parser while maintaining quality for documents that need it.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…_URI in entrypoint

The docker-build CI was failing because:
- The entrypoint script hardcoded `pg_isready -h postgres` instead of
  parsing the host from POSTGRES_URI
- No PostgreSQL instance was available during the CI test

The entrypoint now extracts host/port/user/db from POSTGRES_URI, and
the CI workflow spins up a pgvector/pgvector:pg16 container on a
shared Docker network before running the health check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant