An AI-powered fact-checking service that processes video transcriptions and verifies factual claims against authoritative sources. The system extracts fact-checkable statements with precise timecodes from Iconik CSV transcriptions and runs them through a multi-stage verification pipeline.
- Statement Extraction: Uses a two-stage AI pipeline (Claude Haiku for filtering, Claude Sonnet for analysis) to identify fact-checkable claims from transcriptions
- Multi-Source Verification: Cross-references claims against fact-checking organizations, official government data, and authoritative web sources
- Precise Timecodes: Maintains exact timestamps (MM:SS.mmm) linking statements to their location in the source video
- Credibility Scoring: Automatically classifies sources (fact-checkers, government, academic, major news) and weights evidence accordingly
- Cost-Optimized: Two-stage extraction reduces API costs by ~60% compared to single-stage processing
CSV Transcription → Haiku Filter (fast/cheap) → Sonnet Analysis (accurate) → Fact-Checkable Statements
The two-stage approach:
- Stage 1 (Haiku): Quickly scans all segments to identify which contain potential factual claims
- Stage 2 (Sonnet): Deep analysis of candidate segments to extract precise statements with categories and confidence scores
Statement → Google Fact Check → FRED API → Tavily Search → AI Synthesis → Verdict
Four verification stages:
- Existing Fact-Checks: Queries Google Fact Check API for existing reviews from PolitiFact, Snopes, etc.
- Official Data: Queries FRED (Federal Reserve Economic Data) for economic statistics
- Web Search: Finds authoritative sources via Tavily with source credibility classification
- AI Synthesis: Claude analyzes all evidence and produces a verdict with citations
verified_true- Claim is accurate (requires 2+ supporting sources)verified_false- Claim is inaccurate (requires 2+ contradicting sources)mostly_true- Mostly accurate with minor issuesmostly_false- Mostly inaccurate with some truthmixed- Contains both accurate and inaccurate elementsneeds_context- Technically true but misleading without contextunverifiable- Insufficient evidence
# Clone the repository
git clone <repository-url>
cd fact-checker
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate # On fish: source .venv/bin/activate.fish
# Install dependencies
pip3 install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your API keysCreate a .env file with your API keys:
# Required
ANTHROPIC_API_KEY=your_anthropic_key
# Optional (enhances verification)
GOOGLE_FACT_CHECK_API_KEY=your_google_key
FRED_API_KEY=your_fred_key
TAVILY_API_KEY=your_tavily_key| API | Cost | Sign Up |
|---|---|---|
| Anthropic | Paid | https://console.anthropic.com/ |
| Google Fact Check | Free | https://developers.google.com/fact-check/tools/api |
| FRED | Free | https://fred.stlouisfed.org/docs/api/api_key.html |
| Tavily | Free tier | https://tavily.com/ |
from src.parsers.csv_parser import parse_iconik_csv
from src.extractors.statement_extractor import extract_statements
# Parse Iconik CSV transcription
segments = parse_iconik_csv("path/to/transcription.csv")
# Extract fact-checkable statements
result = extract_statements(segments, source_file="transcription.csv")
for stmt in result.statements:
print(f"[{stmt.start_timecode}] {stmt.statement}")
print(f" Category: {stmt.category}, Confidence: {stmt.confidence}")from src.verifiers import verify_statements
# Verify extracted statements
verification_result = verify_statements(
result.statements,
source_file="transcription.csv"
)
for verified in verification_result.verified_statements:
print(f"Statement: {verified.original_statement.statement}")
print(f"Verdict: {verified.verdict.value}")
print(f"Explanation: {verified.verdict_explanation}")# Run extraction on sample data
python3 test_extraction.py
# Run verification on extracted statements
python3 test_verification.pysrc/
├── models/
│ ├── transcript.py # Segment, FactCheckableStatement, ExtractionResult
│ └── verification.py # Verdict, EvidenceSource, VerifiedStatement
├── parsers/
│ └── csv_parser.py # Iconik CSV parser
├── extractors/
│ └── statement_extractor.py # Two-stage AI extraction
├── sources/
│ ├── google_fact_check.py # Google Fact Check API client
│ ├── fred.py # FRED economic data client
│ └── tavily_search.py # Tavily search with source classification
└── verifiers/
├── pipeline.py # Multi-stage verification orchestration
└── synthesizer.py # AI evidence synthesis
The extraction pipeline uses Claude Haiku as a fast pre-filter before Claude Sonnet for detailed analysis. This reduces costs by ~60% since most transcript segments don't contain fact-checkable claims. The filtering stage processes 500 segments per API call, while the analysis stage processes 200 at a time.
Sources are automatically classified and scored:
- Fact-check organizations (0.95): PolitiFact, Snopes, FactCheck.org
- Government (0.90): .gov domains, WHO, World Bank, IMF
- Academic (0.85): .edu domains, Nature, Science, peer-reviewed journals
- Major news (0.70): NYT, WSJ, Reuters, AP, BBC
- Other news (0.50): Other news sources
- Other (0.30): Everything else
Definitive verdicts (verified_true or verified_false) require at least 2 corroborating sources. Claims with only a single source are downgraded to mostly_true or mostly_false to reflect the uncertainty.
The AI synthesizer is constrained to only use information from provided evidence and must cite specific sources for every factual assertion. This prevents hallucination and ensures verifiable explanations.
The service accepts Iconik CSV transcriptions with the following columns:
segment_text: The transcribed text for each segmenttime_start_milliseconds: Segment start time in millisecondstime_end_milliseconds: Segment end time in millisecondstranscription: JSON with word-level timing and speaker info (optional)
{
"source_file": "transcription.csv",
"extraction_date": "2025-01-26T12:00:00Z",
"statements": [
{
"statement": "unemployment is at a 50 year low",
"start_timecode": "02:15.000",
"end_timecode": "02:18.000",
"context": "Unemployment is at a 50 year low.",
"category": "claim",
"confidence": 0.85
}
]
}{
"source_file": "transcription.csv",
"verification_date": "2025-01-26T12:05:00Z",
"total_cost": 0.15,
"verified_statements": [
{
"original_statement": { ... },
"verdict": "mostly_true",
"verdict_explanation": "According to FRED data, the unemployment rate...",
"evidence_sources": [ ... ],
"existing_fact_checks": [ ... ],
"official_data": [ ... ],
"confidence": 0.85
}
]
}| Operation | Approximate Cost |
|---|---|
| Extract statements (per 100 segments) | ~$0.01 |
| Verify statement (per statement) | ~$0.03 |
Costs primarily come from Claude API calls. External APIs (Google, FRED, Tavily) have free tiers sufficient for moderate usage.
# Run unit tests
pytest tests/
# Run with coverage
pytest tests/ --cov=src[Add license information]