A comprehensive multi-agent research system that conducts autonomous scientific research from literature review through publication-quality PDF generation. The system coordinates 8 specialized subagents to execute a complete research pipeline: literature review, theory formalization, experimental design, data collection, experimentation, statistical analysis, and report writing.
# Install dependencies
pip install -r requirements.txt
# Make Django migrations
cd research_platform
python manage.py makemigrations
python manage.py migrate
# Run the server
python manage.py runserverThe system executes a complete 11-step scientific research pipeline:
- Lead Agent Orchestration - Decomposes research query into 2-4 distinct subtopics
- Parallel Literature Review - Spawns 2-4 literature-reviewer subagents simultaneously; each creates
evidence_sheet.jsonwith quantitative metrics - Wait & Verify - Lead agent confirms all literature reviews complete and evidence sheet exists
- Theory Formalization - Theorist subagent formalizes mathematical/conceptual framework and hypothesis
- Experimental Design - Experimental-designer creates
experiment_plan.jsonspecifying parameter grids, ablations, robustness checks - Data Collection - Data-collector identifies real-world datasets or justifies synthetic data
- Experimentation - Experimentalist implements and executes all configurations →
results_table.json - Statistical Analysis - Analyst performs hypothesis tests with 95% CIs and p-values →
comparison_*.json - Follow-up Experiments - If primary hypothesis fails (discovery mode), automatically proposes and executes diagnostic experiments
- Report Writing - Report-writer synthesizes all outputs into publication-ready LaTeX manuscript
- PDF Compilation - LaTeX-compiler generates final PDF with error handling
The system uses Anthropic's Claude Agent SDK to define 8 specialized subagents, each with specific models, tools, and outputs:
| Agent | Model | Tools | Purpose | Outputs |
|---|---|---|---|---|
| lead-agent | Haiku | Task | Orchestrates entire pipeline; spawns subagents sequentially | Session logs |
| literature-reviewer | Haiku | WebSearch, Write | Surveys academic literature; creates quantitative evidence sheet | lit_review_*.md, evidence_sheet.json |
| theorist | Opus | Write | Formalizes mathematical/conceptual framework; writes pseudocode blueprint | theory_*.md |
| experimental-designer | Sonnet | Read, Write | Designs experiment configurations with parameter grids and ablations | experiment_plan.json |
| data-collector | Sonnet | WebSearch, Write | Identifies real-world datasets; justifies synthetic data if needed | dataset_*.md |
| experimentalist | Opus | Read, Write, Bash | Implements and executes all experiment configurations | results_table.json, results_table.csv, experiment code |
| analyst | Sonnet | Read, Write, Bash | Performs statistical analysis; tests hypotheses; proposes follow-ups | comparison_*.json, analysis_summary.json, followup_plan.json |
| report-writer | Sonnet | Glob, Read, Write | Synthesizes all outputs into publication-ready LaTeX manuscript | *_paper.tex |
| latex-compiler | Sonnet | Read, Write, Bash | Compiles .tex to PDF; fixes compilation errors | Final PDF report |
Sequential Dependency Chain:
Literature Review → Theorist (reads evidence_sheet.json)
→ Experimental Designer (reads evidence_sheet + theory)
→ Data Collector (reads experiment_plan)
→ Experimentalist (reads experiment_plan + data docs)
→ Analyst (reads results_table + experiment_plan + evidence_sheet)
→ Report Writer (reads ALL outputs)
→ LaTeX Compiler (compiles .tex)
Parallel Execution:
- Lead agent spawns 2-4 literature reviewers simultaneously for different subtopics
- All must complete before theorist stage begins
Mixed Model Strategy:
- Opus: Complex reasoning tasks (theory formalization, experimentation)
- Sonnet: Intermediate tasks (experimental design, analysis, report writing)
- Haiku: Orchestration and literature review (cost-effective for coordination)
The system uses type-safe data classes for structured communication between agents. These enable explicit handoffs and prevent misunderstandings:
Core Classes (from research_agent/data_structures.py):
-
EvidenceSheet: Quantitative findings from literature
- Metric ranges, sample sizes, known pitfalls, academic references
- Provides baseline for hypothesis testing
-
ExperimentPlan: Specifies all configurations to test
- Parameter grids (e.g.,
learning_rate: [0.001, 0.01, 0.1]) - Ablations (e.g., remove dropout, change activation function)
- Robustness checklists (domain-specific requirements)
- Data collection guidelines
- Parameter grids (e.g.,
-
ExperimentConfig: Single experiment specification
- Parameter sweep definitions
- Expected runtime estimates
-
ResultsTable: Structured output from experimentalist
- Config name, parameters, metrics, standard errors
- Enables programmatic analysis
-
AnalysisSummary: Statistical comparison results
- Metric name, 95% confidence intervals, p-values
- Conclusions backed by statistical tests
-
FollowUpPlan: Diagnostic hypotheses
- Generated when primary hypothesis fails
- Proposes targeted experiments to identify root causes
-
RobustnessChecklist: Domain-specific robustness requirements
- E.g., for ML: convergence analysis, sensitivity to hyperparameters
All classes support JSON serialization/deserialization for file-based agent communication.
- Parallel Research: Multiple subagents research different subtopics simultaneously for faster literature coverage
- Statistical Rigor: Bootstrap confidence intervals, Diebold-Mariano tests, hypothesis tests with p-values
- Structured Communication: Type-safe data classes prevent inter-agent misunderstandings
- Adaptive Inquiry: Automatically proposes follow-up diagnostic experiments if primary hypothesis fails
- Reproducibility: All code, configurations, data, and analysis saved; full audit trail in session logs
- Mixed Model Strategy: Optimizes cost/performance by using Opus for complex reasoning, Sonnet for intermediate tasks, Haiku for orchestration
- Web Integration Ready: Programmatic API (
agent_api.py) enables integration with web applications
Scientific Research:
- "Research quantum error correction codes and compare stabilizer vs. surface codes"
- "Investigate transformer attention mechanisms and test scaled dot-product vs. alternative variants"
- "Analyze renewable energy storage solutions and benchmark lithium-ion vs. flow batteries"
Machine Learning:
- "Compare gradient descent optimizers (SGD, Adam, RMSprop) on image classification tasks"
- "Evaluate regularization techniques (dropout, L2, early stopping) for preventing overfitting"
Algorithm Analysis:
- "Benchmark sorting algorithms (quicksort, mergesort, heapsort) across different data distributions"
Research outputs are organized in two directories:
files/
├── research_notes/ # Literature review outputs
│ ├── lit_review_*.md
│ └── evidence_sheet.json
├── theory/ # Theory formalization documents
│ └── theory_*.md
├── data/ # Dataset documentation
│ └── dataset_*.md
├── experiments/ # Experiment code and configurations
│ ├── experiment_*.py
│ └── experiment_plan.json
├── results/ # Experiment results
│ ├── results_table.json
│ ├── results_table.csv
│ ├── comparison_*.json
│ ├── analysis_summary.json
│ └── followup_plan.json (if needed)
├── charts/ # PNG visualizations (referenced in paper)
│ └── *.png
└── reports/ # Final LaTeX manuscript and PDF
├── *_paper.tex
└── *_paper.pdf
logs/
└── session_YYYYMMDD_HHMMSS/
├── transcript.txt # Human-readable conversation
├── tool_calls.jsonl # Structured tool usage log
└── agent_prompts.txt # Full system prompts for debugging
Research Agent/
│
├── research_agent/ # Core multi-agent research system
│ ├── agent.py # CLI entry point (interactive mode)
│ ├── agent_api.py # Programmatic API (for web integration)
│ ├── data_structures.py # Type-safe data classes for inter-agent communication
│ ├── statistics.py # Statistical analysis tools (bootstrap CIs, hypothesis tests)
│ ├── prompts/ # Agent prompt templates (12 specialized prompts)
│ │ ├── lead_agent.txt # Pipeline orchestration logic
│ │ ├── researcher.txt # Literature review strategy
│ │ ├── theory.txt # Theory formalization guidelines
│ │ ├── experimental_design.txt
│ │ ├── data_collector.txt
│ │ ├── experimentalist.txt
│ │ ├── analyst.txt
│ │ ├── report_writer.txt
│ │ └── latex_compiler.txt
│ └── utils/
│ ├── subagent_tracker.py # Tracks tool calls via SDK hooks
│ ├── transcript.py # Session logging
│ └── message_handler.py # Processes assistant responses
│
├── research_platform/ # Django web application
│ ├── agents/ # Main Django app
│ │ ├── models.py # Database models (UserProfile, ResearchSession, etc.)
│ │ ├── views.py # Web views (dashboard, session detail, downloads)
│ │ ├── services.py # ResearchAgentService (bridge to research_agent/)
│ │ └── encryption.py # API key encryption with Fernet
│ ├── research_platform/ # Django settings
│ ├── templates/ # HTML templates
│ ├── static/ # CSS, JavaScript
│ └── manage.py # Django management commands
│
├── backend/ # FastAPI REST + WebSocket server
│ ├── main.py # FastAPI app initialization
│ ├── api/ # REST endpoints
│ │ ├── research.py # Research submission
│ │ ├── sessions.py # Session management
│ │ └── websocket.py # Real-time updates
│ └── services/
│ ├── session_manager.py # Session discovery and parsing
│ └── file_watcher.py # Monitors tool_calls.jsonl for updates
│
├── frontend/ # React + TypeScript UI
│ └── src/
│ ├── pages/ # Dashboard, NewResearch, SessionDetail
│ ├── components/ # PipelinePhaseIndicator, SubagentCard, ToolCallTimeline
│ ├── contexts/ # SessionContext (state management)
│ └── services/ # API client (Axios)
│
└── files/ # Research outputs (generated at runtime)
research_agent/ - Core Multi-Agent Research System
- Standalone CLI tool for running research
- Can be used directly via
python research_agent/agent.py - Generates research papers through multi-agent coordination
- Uses Anthropic's Claude Agent SDK
- Entry points:
agent.py- Interactive CLI modeagent_api.py- Programmatic API (used by web integration)
research_platform/ - Django Web Application
- User authentication and profile management
- Encrypted API key storage (Fernet symmetric encryption)
- Session persistence in relational database
- File management for research outputs
- Peer review feedback mechanism for iterative improvements
- Admin dashboard
backend/ - FastAPI REST + WebSocket Server
- REST API for research submission and session management
- WebSocket streaming for real-time progress updates
- File watcher monitors
tool_calls.jsonlfor new events - Broadcasts tool calls and subagent spawns to connected clients
frontend/ - React + TypeScript UI
- Modern web interface for research management
- Dashboard with session overview and status tracking
- Live progress visualization (pipeline phases, subagent activity, tool calls)
- Real-time updates via WebSocket connection
The system has two operational modes:
User (Terminal)
→ research_agent/agent.py
→ Claude API (multi-agent execution)
→ files/ (research outputs)
Use this for direct research execution without the web interface.
User (Browser)
→ React Frontend (UI)
→ FastAPI Backend (REST + WebSocket)
→ Django Platform (auth, persistence, file management)
→ research_agent/agent_api.py (programmatic API)
→ Claude API (multi-agent execution)
→ files/ (research outputs)
The web application provides:
- User authentication and API key encryption
- Session history and management
- Real-time progress tracking with visual pipeline indicators
- File downloads (PDFs, CSVs, logs)
- Peer review feedback for iterative improvements
Integration Points:
- Django's
ResearchAgentServicecallsresearch_agent.agent_api.run_research_query() - FastAPI's
FileWatchermonitorslogs/session_*/tool_calls.jsonlfor real-time updates - React components subscribe to WebSocket for live progress display
The system tracks all tool calls using SDK hooks to enable debugging, logging, and real-time progress visualization in the web UI.
- Who: Which agent (LITERATURE-REVIEWER-1, EXPERIMENTALIST-1, etc.)
- What: Tool name (WebSearch, Write, Bash, etc.)
- When: Timestamp of invocation
- Input/Output: Parameters passed and results returned
Hooks intercept every tool call before and after execution:
from anthropic_agent.hooks import Hooks
hooks = Hooks(
pre_tool_use=[tracker.pre_tool_use_hook],
post_tool_use=[tracker.post_tool_use_hook]
)The parent_tool_use_id links tool calls to their subagent:
- Lead Agent spawns a Researcher via
Tasktool → gets ID "task_123" - All tool calls from that Researcher include
parent_tool_use_id = "task_123" - Hooks use this ID to identify which subagent made the call
transcript.txt - Human-readable conversation:
You: Research quantum error correction codes...
Agent: [Spawning LITERATURE-REVIEWER-1: stabilizer codes]
[LITERATURE-REVIEWER-1] → WebSearch (query='stabilizer codes quantum error correction')
[LITERATURE-REVIEWER-1] → Write (file='files/research_notes/lit_review_stabilizer_codes.md')
[Spawning EXPERIMENTALIST-1: implement experiments]
[EXPERIMENTALIST-1] → Read (file='files/theory/experiment_plan.json')
[EXPERIMENTALIST-1] → Bash (command='python experiments/run_qec_simulation.py')
tool_calls.jsonl - Structured JSON (enables web UI real-time updates):
{"event":"tool_call_start","agent_id":"LITERATURE-REVIEWER-1","tool_name":"WebSearch","timestamp":"2025-01-15T10:23:45Z","query":"stabilizer codes"}
{"event":"tool_call_complete","agent_id":"LITERATURE-REVIEWER-1","success":true,"output_size":15234}
{"event":"subagent_spawn","agent_id":"EXPERIMENTALIST-1","parent":"lead-agent","timestamp":"2025-01-15T10:25:12Z"}The FastAPI backend's FileWatcher monitors tool_calls.jsonl:
- Polls every 500ms for new entries
- Parses JSON events
- Broadcasts via WebSocket to connected React clients
- React components update in real-time:
- Pipeline phase indicators advance
- Subagent cards display active agents
- Tool call timeline shows chronological activity
This enables users to watch research progress live in the browser without refreshing.
The system includes comprehensive statistical tools in research_agent/statistics.py:
Bootstrap Confidence Intervals:
- Non-parametric resampling for metric uncertainty quantification
- Configurable confidence levels (default: 95%)
- Handles small sample sizes robustly
Hypothesis Testing:
- Diebold-Mariano test for comparing predictive accuracy
- Paired t-tests for metric comparisons
- Multiple testing correction (Bonferroni, Holm-Bonferroni)
Risk-Adjusted Metrics:
- Sharpe ratio calculations
- Drawdown analysis
- Custom risk metrics per domain
All statistical claims in generated papers are backed by these rigorous tests, with p-values and confidence intervals reported transparently.
The system supports two research modes:
Discovery Mode (default):
- If primary hypothesis fails statistical tests, automatically generates
followup_plan.json - Proposes diagnostic experiments to identify root causes
- Executes highest-priority follow-up automatically
- Iterates until hypothesis supported or conclusive negative result
Demo Mode (mode=demo):
- Single-pass execution without follow-ups
- Faster execution for demonstrations
- Still includes full statistical analysis
Specify mode in initial query or via command-line argument.
The system automatically detects available system RAM and applies memory limits:
Default Behavior:
- Limits research agent to 25% of system RAM
- Prevents runaway processes during experimentation
- Configurable via
RESEARCH_AGENT_MEMORY_LIMITenvironment variable
Production Recommendation:
- Set explicit limits based on workload
- Monitor memory usage during large-scale experiments
- Consider containerization (Docker) with resource constraints
API Key Encryption:
- User API keys encrypted with Fernet (symmetric encryption)
- Master key stored in
ENCRYPTION_KEYenvironment variable - Keys only decrypted in memory during research execution
- Never logged or exposed in plaintext
Authentication:
- Django user authentication required for all operations
- Session-based auth for web interface
- CORS configured for localhost development
File Access:
- Users can only access their own research sessions
- File downloads require authentication
- Session directories isolated per user
Every research session is fully reproducible:
Saved Artifacts:
- All experiment code with parameter configurations
- Complete datasets or dataset documentation
- Statistical analysis scripts
- Raw results (CSV, JSON)
- Session logs with full conversation history
- Agent prompts used for each subagent
Audit Trail:
transcript.txtprovides human-readable execution flowtool_calls.jsonlprovides machine-readable structured logagent_prompts.txtshows exact prompts given to each agent
To reproduce a session:
- Navigate to
logs/session_YYYYMMDD_HHMMSS/ - Review
transcript.txtfor research context - Check
files/experiments/for code and configurations - Rerun experiments with same parameters
- Compare results against
files/results/results_table.csv
This project is based on the research agent from the Anthropic Team's Claude Agent SDK docs. The original research agent was a search and summarization agent that searched for information regarding a specified topic and returned a report of what it found. This project significantly expands upon that agent by enabling it to conduct scientific research and simulations and giving it a more easily accessible user interface.