DocuCheck is a Python-based tool that extracts factual claims from documents (like PDFs) and performs automated fact-checking using a generative AI model (Google’s Gemini).
It produces a human-readable HTML report (report.html) summarizing extracted claims, internal consistency checks, and external verification results.
- Structural Text Extraction: Parses PDFs to understand document structure (headings vs. paragraphs) for better context.
- AI-Powered Claim Extraction: Uses a generative model to identify and extract factual claims from text.
- Internal Consistency Analysis: Detects contradictions within the document’s own claims.
- External Fact-Checking: Verifies claims against the model’s external knowledge to identify outdated or invalid information.
- Modern HTML Reporting: Generates a clean, single-page HTML report with a visual summary dashboard.
- Result Caching: Caches analysis results to avoid re-processing and repeated API calls.
Docucheck/
├── main.py # Main CLI entry point with argument parsing
├── extractor.py # Handles PDF parsing and claim extraction
├── verifier.py # Handles internal/external fact-checking
├── reporter.py # Generates the final HTML report
├── caching.py # Manages file hashing and result caching
└── utils.py # Shared utilities (e.g., JSON parsing)
run.py # The main script to execute the package
requirements.txt # Python dependencies
.env.example # Example environment variables
.gitignore # Ignores .env, .venv, pycache, etc.
LICENSE # MIT License
README.md # This file
Clone the repository and navigate into the project directory:
git clone https://github.com/<your-username>/docucheck.git
cd docucheckWindows (PowerShell):
python -m venv .venv
.\.venv\Scripts\Activate.ps1macOS/Linux (bash):
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtCopy the example .env file and add your Gemini API key.
Windows (PowerShell):
copy .env.example .env
# Now edit the .env file with Notepad or VS CodemacOS/Linux (bash):
cp .env.example .env
# Now edit the .env file with nano, vim, or VS CodeYour .env file should look like this:
GEMINI_API_KEY=your-api-key-goes-hereRun the application using run.py, passing the path to the document you want to analyze.
python run.py "path/to/your/document.pdf"This will analyze the PDF and save the report as report.html in the same directory.
| Argument | Description | Required | Default |
|---|---|---|---|
input_file |
Path to the input file (.pdf, .txt, etc.) | ✅ | — |
-o, --output |
Path to save the output HTML report | ❌ | report.html |
-l, --limit |
Limit the number of claims to externally fact-check (0 = all) | ❌ | 0 |
--force |
Force re-analysis and bypass cached results | ❌ | False |
python run.py "my_research.pdf" -o "MyReport.html" -l 5 --forceThis command:
- Analyzes
my_research.pdf - Saves the report as
MyReport.html - Only fact-checks the first 5 claims
- Bypasses the cache
run.pyexecutes themain()function inDocucheck/__main__.py.- The script parses command-line arguments.
caching.pygenerates a SHA-256 hash of the input file and checks for a cached result.- If cache exists (and
--forcenot used), analysis is skipped and report generation begins. - If no cache:
extractor.pyuses PyMuPDF to extract structured text.- Text is sent to Gemini API for claim extraction.
verifier.py:- Checks for internal contradictions.
- Performs external fact-checking using Gemini’s knowledge.
reporter.pycompiles all results into a single HTML report.caching.pysaves all results (claims, contradictions, checks) to a.Docucheck_CacheJSON file.
Distributed under the MIT License.
See LICENSE for more information.