🔍 Deep Research Agent
Autonomous Multi-Hop Investigation, Risk Analysis & Evaluation System
This project is a take-home–assessment–ready implementation of an autonomous deep research agent that investigates a person or company, performs multi-hop web research, extracts facts and risks, assigns confidence, produces an overall risk ranking, and includes a formal evaluation framework using hidden facts.
The system is designed to show: • strong agent orchestration • reliable source-grounded extraction • explainable risk scoring • and a measurable evaluation methodology
⸻
📌 What this agent does • Investigates a person or company • Performs 3–4 hop web research (“digital breadcrumbs”) • Uses multiple search providers (Tavily, Brave, hybrid) • Extracts structured facts with evidence & confidence • Identifies risk signals across multiple categories • Assigns confidence per fact and per risk • Builds a relationship graph • Computes category-level risk (Low / Medium / High) • Computes an overall risk ranking • Evaluates itself using hidden facts via run_eval.py
⸻
🧠 Models & Search Providers
LLMs • Claude → planning, judging, reporting, evaluation • Gemini → fast extraction & fallback judging
Search • Tavily • Brave • Hybrid mode (both, with verification)
⸻
🧭 Core Agent Architecture (LangGraph)
plan_queries ↓ search (Tavily / Brave / Hybrid) ↓ fetch_pages ↓ extract_facts ↓ validate_and_merge ↓ update_leads ↓ judge (VOI + rubric) ├─ continue → next iteration └─ stop → write reports
This loop continues until value-of-information (VOI) and a stop/continue rubric determine that additional searching no longer adds meaningful signal.
⸻
🔍 Parallel Search (Tavily + Brave)
The agent supports hybrid parallel search using multiple providers to improve recall and verification.
Search behavior: • Executes Tavily and Brave searches concurrently within each iteration • Merges and deduplicates results before page fetching • Uses independent providers to reduce bias and increase coverage
This approach balances: • breadth (via Tavily) • independent verification (via Brave)
⸻
⚡ Latency & Concurrency
Network-bound steps are optimized to keep multi-hop research responsive.
Optimizations include: • Parallel execution of search queries with bounded concurrency • Concurrent page fetching using asynchronous I/O • Respect for provider rate limits
This allows deeper investigations without excessive runtime cost.
⸻
🧩 Breadcrumb Reliability Improvements
To support deep multi-hop research, this version includes:
Lead Queue (deterministic)
-
Extracts candidate “next-hop” entities (companies, affiliates, related parties, subsidiaries, executives) from validated facts and relationship edges
-
Prioritizes leads based on relevance and evidence strength
-
Persists top leads in the agent state so the planner can intentionally chase deeper breadcrumbs across iterations
Filing / Registry Query Templates • Injected only when relevant • Examples: • SEC / EDGAR filings • ownership structures • related-party disclosures
Strong Visited-URL Memory • Canonicalizes URLs • Avoids re-fetching duplicates • Hash-based deduplication
⸻
🧾 Facts Extraction (NEW – added, nothing removed)
During each iteration, the agent extracts structured facts from fetched content.
Each fact includes: • a clear factual statement • the entity it refers to • one or more supporting evidence URLs • a confidence score, adjusted by: • source agreement • repetition across independent sources • verification checks . Facts are merged and deduplicated across iterations so that: repeated confirmations increase confidence contradictions trigger additional verification steps
These extracted facts are later used for: • risk detection • confidence calibration • evaluation against hidden facts
⸻
🌐 Sources & Evidence Handling (NEW – added, nothing removed)
The agent relies strictly on publicly accessible sources, including:
- reputable news outlets
- company disclosures
- government and regulatory sites
- Wikipedia / Wikidata
- archived pages (Wayback Machine)
- public LinkedIn metadata (no login, no scraping)
For each source, the system records: • URL • title • snippet or extracted passage • which facts or risks reference it
If a page is unavailable or blocked: • the agent attempts a Wayback Machine fallback • otherwise logs the failure and continues with independent sources
Sources are reused during: • fact validation • risk confirmation • evaluation (run_eval.py) when checking hidden facts
🌐 Wikipedia & Wikidata Enrichment
The pipeline uses Wikipedia and Wikidata at two distinct stages.
Usage: • Bootstrap (before search): • Resolve aliases • Reduce name ambiguity for query planning • Enrichment (after validation): • Attach canonical IDs and URLs • Add short contextual descriptions to the final report
This improves clarity, grounding, and consistency across sources.
🕰 Wayback Machine Fallback
To handle unavailable or blocked pages, the agent supports a Wayback Machine fallback.
Fallback behavior: • Queries the Wayback Machine CDX API when fetches fail • Retrieves the closest archived snapshot • Extracts text from the archived version
This increases robustness against link rot, blocking, and transient failures.
🔒 LinkedIn Handling (Public-Only)
This project does not scrape behind login walls.
If a LinkedIn URL appears: • The fetch step attempts public-only extraction using OpenGraph metadata • If access is blocked, the URL is logged as partial evidence • The agent continues using independent public sources such as: • SEC filings • corporate registries • press releases • reputable media
⸻
Risk Categories The agent evaluates risk across multiple categories, such as: • Legal / Regulatory • Financial • Governance • Security • Ethics & Compliance • Reputation & Public Controversy
⸻
🧮 Overall Risk Ranking & Category-Level Decisions (Detailed)
At the top of each generated report, the system presents: • Risk level per category: Low, Medium, or High • Overall risk ranking for the entity: Low, Medium, or High
This section explains exactly how those levels are computed.
📊 Risk Categories
Each extracted risk signal is first classified into one of the following categories: • Legal / Regulatory • Financial • Governance • Security • Ethics & Compliance • Operational • Reputational / Public Controversy
Each category is evaluated independently first, before computing the overall risk.
⚖️ Risk Weighting Strategy
Not all risk categories contribute equally to the final decision.
The system applies higher weight to risks that have direct material or regulatory impact, and lower weight to softer signals: I Higher-weight categories • Legal / Regulatory • Financial • Security • Governance
Lower-weight categories • Reputation • Public controversy • Personal or character-based issues
This prevents reputational noise from overpowering material risk.
🧩 Risk Signal Scoring (Per Risk)
Each individual risk signal includes: • Severity (low / medium / high) • Confidence score (based on evidence strength and source agreement) • Category weight
A simplified internal scoring model is: risk_score = severity × confidence × category_weight
Where: • severity reflects how serious the issue is • confidence reflects how well-supported it is • category_weight reflects material importance
⸻
🧠 Category-Level Risk Decision
For each category: 1. All risk scores in that category are aggregated 2. The aggregate score is compared against thresholds
Example thresholds (conceptual): • Low: minimal or weakly supported risk signals • Medium: multiple signals or one strong signal with moderate confidence • High: strong, well-supported risk signals with material impact
This produces a clear Low / Medium / High decision per category.
🧾 Overall Risk Ranking
The overall risk ranking is computed by: 1. Taking all category-level scores 2. Applying category weights again 3. Aggregating into a final score 4. Mapping the final score to: • Low • Medium • High
This ensures that: • One high-confidence legal or financial risk can elevate overall risk • Multiple medium risks across categories are reflected • Low-impact reputational issues alone do not dominate the outcome
⸻
🧪 Consensus-Style Verification (Independent Source Check)
When medium or high risk signals are detected, the agent performs a second-pass verification step.
Verification process: • Triggers an independent Brave Search sweep • Fetches and re-extracts corroborating or contradicting evidence • Adjusts confidence scores based on cross-source agreement
This reduces false positives and increases trust in elevated risk assessments.
⸻
🔄 Contradiction-Driven Follow-Up
When verification evidence contradicts earlier extracted facts, the agent responds explicitly.
Behaviour: • Records a medium-severity risk flag • Forces at least one additional research iteration • Injects follow-up queries derived from contradiction signals into the planner
This ensures unresolved or conflicting claims are not ignored.
✅ Why This Design
This approach makes the system: • Explainable — every risk is traceable to facts and sources • Robust — avoids overreacting to weak signals • Auditable — decisions can be inspected category by category • Realistic — mirrors how human analysts assess risk
⸻
🔧 Setup (Prerequisites)
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCopy the example env file:
cp .env.example .env🔑 API Keys (.env)
Fill in the keys you plan to use:
Models
ANTHROPIC_API_KEY(Model A: planning/judging/reporting)GOOGLE_API_KEY(Model B: extraction)
**Search **
TAVILY_API_KEYBRAVE_API_KEYif using--search-provider tavilyor--search-provider hybrid
Run a Single Entity
python -m deep_research_agent.cli
--entity "Sam Altman"
--entity-type person
--max-iters 5
--outdir outputs/demo_sam_altman
Run Evaluation
python eval/run_eval.py
--cases eval/cases.json
--outdir outputs/eval_runs
--max-iters 6
--search-provider hybrid
⸻
📊 Outputs (per run)
Each run writes to --outdir: • report_full.json – full structured output • report.json – slim structured report • report.md – human-readable report • graph.json – relationship graph (nodes, edges, evidence) • stdout.txt / stderr.txt
⸻
⸻
🧪 Evaluation Framework (Hidden Facts)
The project includes an evaluation set in eval/cases.json.
Each case contains: • an entity (person or company) • multiple hidden facts • facts that are not obvious • discoverable only via deeper research
Evaluation Goal is to measure how many hidden facts were actually recovered by the agent’s research.
🔬 How Evaluation Works (run_eval.py) 1. Runs the agent normally (CLI) 2. Loads report_full.json 3. Builds a large evidence pool from: • summary • key findings • facts (all sub-fields) • risks (descriptions, evidence) • markdown report • sources (titles, snippets, URLs) 4. Groups evidence into candidate packs 5. Sends aggregated evidence to a judge model 6. If needed, opens agent-collected URLs 7. Produces: • match / no-match • similarity score • evidence quote • match explanation
Output • eval_result.md • eval_result.json • eval_summary.json
Each clearly shows: • which hidden facts matched • where the evidence came from • overall match percentage
⸻
🕸 Relationship Graph
Each run outputs: • graph.json (nodes + edges + evidence) • Optional .gml for visualization
⸻
🧠 Why this matters
This project demonstrates: • autonomous research • multi-provider verification • structured reasoning • explainable risk decisions • measurable evaluation, not just output generation
It is designed to be auditable, defensible, and extensible.