GitHub - RojinMajd/deep-research-agent

🔍 Deep Research Agent

Autonomous Multi-Hop Investigation, Risk Analysis & Evaluation System

This project is a take-home–assessment–ready implementation of an autonomous deep research agent that investigates a person or company, performs multi-hop web research, extracts facts and risks, assigns confidence, produces an overall risk ranking, and includes a formal evaluation framework using hidden facts.

The system is designed to show: • strong agent orchestration • reliable source-grounded extraction • explainable risk scoring • and a measurable evaluation methodology

⸻

📌 What this agent does • Investigates a person or company • Performs 3–4 hop web research (“digital breadcrumbs”) • Uses multiple search providers (Tavily, Brave, hybrid) • Extracts structured facts with evidence & confidence • Identifies risk signals across multiple categories • Assigns confidence per fact and per risk • Builds a relationship graph • Computes category-level risk (Low / Medium / High) • Computes an overall risk ranking • Evaluates itself using hidden facts via run_eval.py

⸻

🧠 Models & Search Providers

LLMs • Claude → planning, judging, reporting, evaluation • Gemini → fast extraction & fallback judging

Search • Tavily • Brave • Hybrid mode (both, with verification)

⸻

🧭 Core Agent Architecture (LangGraph)

plan_queries ↓ search (Tavily / Brave / Hybrid) ↓ fetch_pages ↓ extract_facts ↓ validate_and_merge ↓ update_leads ↓ judge (VOI + rubric) ├─ continue → next iteration └─ stop → write reports

This loop continues until value-of-information (VOI) and a stop/continue rubric determine that additional searching no longer adds meaningful signal.

⸻

🔍 Parallel Search (Tavily + Brave)

The agent supports hybrid parallel search using multiple providers to improve recall and verification.

Search behavior: • Executes Tavily and Brave searches concurrently within each iteration • Merges and deduplicates results before page fetching • Uses independent providers to reduce bias and increase coverage

This approach balances: • breadth (via Tavily) • independent verification (via Brave)

⸻

⚡ Latency & Concurrency

Network-bound steps are optimized to keep multi-hop research responsive.

Optimizations include: • Parallel execution of search queries with bounded concurrency • Concurrent page fetching using asynchronous I/O • Respect for provider rate limits

This allows deeper investigations without excessive runtime cost.

⸻

🧩 Breadcrumb Reliability Improvements

To support deep multi-hop research, this version includes:

Lead Queue (deterministic)

Extracts candidate “next-hop” entities (companies, affiliates, related parties, subsidiaries, executives) from validated facts and relationship edges
Prioritizes leads based on relevance and evidence strength
Persists top leads in the agent state so the planner can intentionally chase deeper breadcrumbs across iterations

Filing / Registry Query Templates • Injected only when relevant • Examples: • SEC / EDGAR filings • ownership structures • related-party disclosures

Strong Visited-URL Memory • Canonicalizes URLs • Avoids re-fetching duplicates • Hash-based deduplication

⸻

🧾 Facts Extraction (NEW – added, nothing removed)

During each iteration, the agent extracts structured facts from fetched content.

Each fact includes: • a clear factual statement • the entity it refers to • one or more supporting evidence URLs • a confidence score, adjusted by: • source agreement • repetition across independent sources • verification checks . Facts are merged and deduplicated across iterations so that: repeated confirmations increase confidence contradictions trigger additional verification steps

These extracted facts are later used for: • risk detection • confidence calibration • evaluation against hidden facts

⸻

🌐 Sources & Evidence Handling (NEW – added, nothing removed)

The agent relies strictly on publicly accessible sources, including:

reputable news outlets
company disclosures
government and regulatory sites
Wikipedia / Wikidata
archived pages (Wayback Machine)
public LinkedIn metadata (no login, no scraping)

For each source, the system records: • URL • title • snippet or extracted passage • which facts or risks reference it

If a page is unavailable or blocked: • the agent attempts a Wayback Machine fallback • otherwise logs the failure and continues with independent sources

Sources are reused during: • fact validation • risk confirmation • evaluation (run_eval.py) when checking hidden facts

🌐 Wikipedia & Wikidata Enrichment

The pipeline uses Wikipedia and Wikidata at two distinct stages.

Usage: • Bootstrap (before search): • Resolve aliases • Reduce name ambiguity for query planning • Enrichment (after validation): • Attach canonical IDs and URLs • Add short contextual descriptions to the final report

This improves clarity, grounding, and consistency across sources.

🕰 Wayback Machine Fallback

To handle unavailable or blocked pages, the agent supports a Wayback Machine fallback.

Fallback behavior: • Queries the Wayback Machine CDX API when fetches fail • Retrieves the closest archived snapshot • Extracts text from the archived version

This increases robustness against link rot, blocking, and transient failures.

🔒 LinkedIn Handling (Public-Only)

This project does not scrape behind login walls.

If a LinkedIn URL appears: • The fetch step attempts public-only extraction using OpenGraph metadata • If access is blocked, the URL is logged as partial evidence • The agent continues using independent public sources such as: • SEC filings • corporate registries • press releases • reputable media

⸻

⚠️ Risk Analysis

Risk Categories The agent evaluates risk across multiple categories, such as: • Legal / Regulatory • Financial • Governance • Security • Ethics & Compliance • Reputation & Public Controversy

⸻

🧮 Overall Risk Ranking & Category-Level Decisions (Detailed)

At the top of each generated report, the system presents: • Risk level per category: Low, Medium, or High • Overall risk ranking for the entity: Low, Medium, or High

This section explains exactly how those levels are computed.

📊 Risk Categories

Each extracted risk signal is first classified into one of the following categories: • Legal / Regulatory • Financial • Governance • Security • Ethics & Compliance • Operational • Reputational / Public Controversy

Each category is evaluated independently first, before computing the overall risk.

⚖️ Risk Weighting Strategy

Not all risk categories contribute equally to the final decision.

The system applies higher weight to risks that have direct material or regulatory impact, and lower weight to softer signals: I Higher-weight categories • Legal / Regulatory • Financial • Security • Governance

Lower-weight categories • Reputation • Public controversy • Personal or character-based issues

This prevents reputational noise from overpowering material risk.

🧩 Risk Signal Scoring (Per Risk)

Each individual risk signal includes: • Severity (low / medium / high) • Confidence score (based on evidence strength and source agreement) • Category weight

A simplified internal scoring model is: risk_score = severity × confidence × category_weight

Where: • severity reflects how serious the issue is • confidence reflects how well-supported it is • category_weight reflects material importance

⸻

🧠 Category-Level Risk Decision

For each category: 1. All risk scores in that category are aggregated 2. The aggregate score is compared against thresholds

Example thresholds (conceptual): • Low: minimal or weakly supported risk signals • Medium: multiple signals or one strong signal with moderate confidence • High: strong, well-supported risk signals with material impact

This produces a clear Low / Medium / High decision per category.

🧾 Overall Risk Ranking

The overall risk ranking is computed by: 1. Taking all category-level scores 2. Applying category weights again 3. Aggregating into a final score 4. Mapping the final score to: • Low • Medium • High

This ensures that: • One high-confidence legal or financial risk can elevate overall risk • Multiple medium risks across categories are reflected • Low-impact reputational issues alone do not dominate the outcome

⸻

🧪 Consensus-Style Verification (Independent Source Check)

When medium or high risk signals are detected, the agent performs a second-pass verification step.

Verification process: • Triggers an independent Brave Search sweep • Fetches and re-extracts corroborating or contradicting evidence • Adjusts confidence scores based on cross-source agreement

This reduces false positives and increases trust in elevated risk assessments.

⸻

🔄 Contradiction-Driven Follow-Up

When verification evidence contradicts earlier extracted facts, the agent responds explicitly.

Behaviour: • Records a medium-severity risk flag • Forces at least one additional research iteration • Injects follow-up queries derived from contradiction signals into the planner

This ensures unresolved or conflicting claims are not ignored.

✅ Why This Design

This approach makes the system: • Explainable — every risk is traceable to facts and sources • Robust — avoids overreacting to weak signals • Auditable — decisions can be inspected category by category • Realistic — mirrors how human analysts assess risk

⸻

🔧 Setup (Prerequisites)

1) Install dependencies

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2) Add API keys

Copy the example env file:

cp .env.example .env

🔑 API Keys (.env)

2) Add API keys

Fill in the keys you plan to use:

Models

ANTHROPIC_API_KEY (Model A: planning/judging/reporting)
GOOGLE_API_KEY (Model B: extraction)

**Search **

TAVILY_API_KEY
BRAVE_API_KEY if using --search-provider tavily or --search-provider hybrid

▶️ How to Run

Run a Single Entity

python -m deep_research_agent.cli
--entity "Sam Altman"
--entity-type person
--max-iters 5
--outdir outputs/demo_sam_altman

Run Evaluation

python eval/run_eval.py
--cases eval/cases.json
--outdir outputs/eval_runs
--max-iters 6
--search-provider hybrid

⸻

📊 Outputs (per run)

Each run writes to --outdir: • report_full.json – full structured output • report.json – slim structured report • report.md – human-readable report • graph.json – relationship graph (nodes, edges, evidence) • stdout.txt / stderr.txt

⸻

⚠️ Notes & Limitations • Some web sources are inaccessible due to paywalls or logins • Blocked sources are treated as partial evidence only • Independent corroboration is preferred whenever possible • This is a prototyping system; production deployments would add: • stricter rate limiting • retries and backoff • PDF parsers • stronger schema validation

⸻

🧪 Evaluation Framework (Hidden Facts)

The project includes an evaluation set in eval/cases.json.

Each case contains: • an entity (person or company) • multiple hidden facts • facts that are not obvious • discoverable only via deeper research

Evaluation Goal is to measure how many hidden facts were actually recovered by the agent’s research.

🔬 How Evaluation Works (run_eval.py) 1. Runs the agent normally (CLI) 2. Loads report_full.json 3. Builds a large evidence pool from: • summary • key findings • facts (all sub-fields) • risks (descriptions, evidence) • markdown report • sources (titles, snippets, URLs) 4. Groups evidence into candidate packs 5. Sends aggregated evidence to a judge model 6. If needed, opens agent-collected URLs 7. Produces: • match / no-match • similarity score • evidence quote • match explanation

Output • eval_result.md • eval_result.json • eval_summary.json

Each clearly shows: • which hidden facts matched • where the evidence came from • overall match percentage

⸻

🕸 Relationship Graph

Each run outputs: • graph.json (nodes + edges + evidence) • Optional .gml for visualization

⸻

🧠 Why this matters

This project demonstrates: • autonomous research • multi-provider verification • structured reasoning • explainable risk decisions • measurable evaluation, not just output generation

It is designed to be auditable, defensible, and extensible.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deep_research_agent		deep_research_agent
eval		eval
.gitignore		.gitignore
.gitignore# Python		.gitignore# Python
README.md		README.md
clean_test.py		clean_test.py
requirements.txt		requirements.txt
test_gemini_extract.py		test_gemini_extract.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1) Install dependencies

2) Add API keys

2) Add API keys

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1) Install dependencies

2) Add API keys

2) Add API keys

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages