A modular framework for orchestrating structured debates between multiple large language models (LLMs) with specialized judge evaluation. This project implements an adversarial training approach to enhance LLM argumentative reasoning.
- Multi-Agent Architecture: Orchestrates debates between opposing LLM agents
- Structured Debate Protocol: Implements formal opening, rebuttal, and closing rounds
- Adversarial Critique System: Agents analyze and critique opposing arguments
- Evidence Self-Check Mechanism: Ensures factual accuracy and reduces source fabrication
- Multi-Dimensional Judge Framework: Seven specialized judges evaluate different aspects of argument quality
- Local-Based: Compatible with Ollama-hosted models
- Python 3.8+
- Ollama for local model hosting
- YAML for configuration files
- Required Python packages (see Environment Setup)
-
Clone this repository:
git clone https://github.com/[username]/multi-agent-llm-debate.git cd multi-agent-llm-debate -
Create and activate the conda environment:
conda env create -f debate-env.yml conda activate debate-env
-
Install Ollama following instructions at ollama.ai
-
Download required models via Ollama: It's present in the first cell code which can be edited. Selective downloads / Download All can be done.
┌──────────────────────────────────────────────────────────────────────────────┐
│ PREPARATION & CONFIG LAYER │
│ YAML Prompts → Theory Integration → Judge Config → Model Selection │
└──────────────────────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────────────────┐
│ DEBATE EXECUTION LAYER │
│ Agent Init → Round Control → Evidence Check → Critique Gen → Response │
│ (FOR/AGAINST) → (Opening/Rebuttal/Closing) → (Invisible Prep) → (Output) │
└──────────────────────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────────────────┐
│ MULTI-JUDGE EVALUATION LAYER │
│ 7 Specialized Judges → Parallel Scoring → Consensus Algorithm → Meta-Judge │
│ (Logic/Fact/Rhetoric/Strategy/Ethics/Belief/Audience) → Weighted Aggregate │
└──────────────────────────────────────────────────────────────────────────────┘
↓
┌──────────────────────────────────────────────────────────────────────────────┐
│ STORAGE & PERSISTENCE LAYER │
│ JSON Debate Logs → Judgment Records → Transcript Generation → Results API │
└──────────────────────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════════════════════════════════════
EVIDENCE CHECK TRACK:
─────────────────────
Previous Response → Claim Extraction → Source Verification → Strength Analysis
↓ ↓ ↓ ↓
[Self-Critique] [Find Claims] [Check Citations] [Rate: Strong/Med/Weak]
↓
{evidence_report.json}
↓
═══════════════════════════════════════════════════════════════════════════════════════════════
↓
ADVERSARIAL CRITIQUE TRACK: ↓
──────────────────────────── ↓
Opponent Argument → 5-Dim Weakness Detection → Vulnerability Map ↓
↓ ↓ ↓ ↓
[Latest Args] [Logic/Fact/Assume/Rhetoric/Strategy] [Counter] ↓
↓ ↓
{critique.json} ↓
↓ ↓
═══════════════════════════════════════════════════════════════════════════════════════════════
↓ ↓
ENHANCED PROMPT ASSEMBLY: ↓ ↓
────────────────────────── ↓ ↓
Base Debate Prompt + {evidence_report} + {critique} → Merge → Token Optimize → Final Prompt
↓
[Generate Response]
↓
═══════════════════════════════════════════════════════════════════════════════════════════════
INPUT: {combined_arguments, topic, stance, word_limit}
↓
┌───────────────────────────────────────────────────────┐
│ PARALLEL JUDGE EVALUATION │
└───────────────────────────────────────────────────────┘
↓
╔═══════════════════════════════════════════════════════╗
║ ║
║ LOGICAL_JUDGE → Fallacy Detection ║ → Score: 8.1/10
║ Internal Consistency ║ Critique: 300 words
║ Reasoning Chains ║
║ ║
║ FACTUAL_JUDGE → Source Verification ║ → Score: 7.4/10
║ Evidence Quality ║ Critique: 300 words
║ Citation Integrity ║
║ ║
║ RHETORICAL_JUDGE → Persuasion Analysis ║ → Score: 8.5/10
║ Emotional Appeal ║ Critique: 300 words
║ Language Effectiveness ║
║ ║
║ STRATEGIC_JUDGE → Argument Selection ║ → Score: 7.8/10
║ Adaptive Response ║ Critique: 300 words
║ Framing Control ║
║ ║
║ ETHICAL_JUDGE → Fair Representation ║ → Score: 9.2/10
║ Intellectual Honesty ║ Critique: 300 words
║ Respectful Conduct ║
║ ║
║ BELIEF_JUDGE → Audience Impact ║ → Score: 6.9/10
║ Mind-Change Potential ║ Critique: 300 words
║ Cross-Segment Appeal ║
║ ║
║ AUDIENCE_JUDGE → Comprehension (4 dims) ║ → Score: 7.5/10
║ Engagement Metrics ║ Panel Response: 300 words
║ ║
╚═══════════════════════════════════════════════════════╝
↓
┌───────────────────────────────────────────────────────┐
│ META-JUDGE CONSENSUS │
│ │
│ • Inter-Judge Correlation (r = 0.64-0.91) │
│ • Composite Score Calculation │
│ │
│ FINAL OUTPUT: │
│ ───────────── │
│ Composite Score: 7.7/10 │
│ Consensus Strengths: [...] │
│ Consensus Weaknesses: [...] │
│ Definitive Assessment: 300 words │
└───────────────────────────────────────────────────────┘
.
├── .ipynb_checkpoints/ # Jupyter notebook checkpoints
├── prompts/ # YAML configuration files for debate prompts
│ ├── debate_prompts.yml # Core debate prompts
│ ├── judge_prompts.yml # Judge evaluation prompts
├── results/ # Debate outputs and judge evaluations
│ ├── agent_records/ # Saved debate transcripts
│ ├── judge_records/ # Evaluation results
│ ├── perfect_debate_transcripts/ # Curated debate examples, for Judgement Pipeline
├── debate-env.yml # Conda environment configuration
├── MultiLLM Debate.ipynb # Main notebook for running debates
├── OLLAMA EDA, Test Scripts.ipynb # Ollama exploration and testing scripts
PromptManagerclass loads and formats debate prompts from YAML files- Modular design allows testing different prompt strategies
- Phase-specific guidance for opening, rebuttal, and closing rounds
MultiAgentDebateclass orchestrates structured interactions- Implements preparation, critique, and rebuttal phases
- Manages context and maintains debate state
- Generates enhanced arguments based on adversarial feedback
JudgeEvaluatorclass assesses debate quality across multiple dimensions- Specialized judges for logical, factual, rhetorical, and ethical aspects
- Meta-judge synthesizes evaluations into composite assessment
Edit the YAML files in the prompts/ directory to customize:
- Debate instructions and structure
- Critique guidelines
- Evidence check parameters
- Judge evaluation criteria
Update the OllamaDebateManager.models dictionary to include new models:
self.models = {
"custom_model": "model_name:tag",
# Add more models here
}Debate results and judge evaluations are saved to:
results/agent_records/- Full debate transcriptsresults/judge_records/- Judge evaluations and scores
- Agent Coordination, Persistent Memory Systems, Inter-Agent Communication, Scalable Agent Framework.
- API Integration.
- Model Orchestration (Custom class handling model lifecycle, health checks, and failover mechanisms.)
- Assessment: Scoring Algorithms, Meta-Evaluation, Performance Metrics.
- YAML-Based Configuration, Parameter Management.
If you use this framework in your research, please cite:
@misc{markapudi2025socraiticcircle,
title={SocrAItic Circle: Enhancing LLM Reasoning Through Multi-Agent Debate Frameworks},
author={Markapudi, Joel},
year={2025},
institution={Northeastern University}
}
TBD
Authorship of all Code Notebooks, Environment Setup, Prompts Files - Joel Markapudi.