| title | emoji | colorFrom | colorTo | sdk | app_port | pinned |
|---|---|---|---|---|---|---|
Multi-Judge Grading Demo |
⚖️ |
blue |
green |
docker |
7860 |
false |
Leveraging Artificial Intelligence to Enhance Institutional Quality Assurance: Automating Annual Action Plan Scoring as a Use Case
2026 ACGME Annual Educational Conference — Meaning in Medicine
Video Presentation of Technique
This project demonstrates a method for simulating expert evaluator panels using calibrated AI judge personas. Three AI judges—each trained with different human rater examples—independently evaluate medical residency program action plans against the same rubric. Their assessments are then reconciled by a consensus arbiter that produces a final score with transparent agreement statistics.
The technique showcases how few-shot calibration (providing example evaluations in the prompt context, without any model training or weight updates) can create meaningfully different evaluation perspectives from the same language model. When judges disagree, their different scoring tendencies reflect the distinct evaluation philosophies of the human raters they were calibrated against. This disagreement becomes a feature rather than a bug, helping program directors understand which aspects of their proposals are universally strong versus which might need refinement based on different evaluative lenses.
The system provides real-time feedback as each judge completes their evaluation, with per-item comments, scores, and a final consensus assessment. After grading, users can ask follow-up questions through an interactive chat interface to better understand the results.
This demonstration was developed for the 2026 ACGME Annual Educational Conference to showcase practical applications of AI in medical education quality assurance. The calibrated multi-judge approach addresses a common challenge: expert evaluators often disagree on assessment scores, and understanding why they disagree can be as valuable as the scores themselves.
Traditional inter-rater reliability approaches treat evaluator disagreement as measurement error to be minimized. This project takes a different perspective: disagreement can be informative when it reflects different but equally valid evaluative philosophies. By making judge calibration explicit and providing transparent consensus statistics, this system helps program directors understand evaluation results in terms of different evaluative priorities (structure vs. feasibility vs. actionability) rather than dismissing disagreement as noise.
The technique demonstrates that AI can support quality assurance not by replacing human judgment, but by making the diversity of expert perspectives more visible and actionable.
Each judge is calibrated with five example evaluations from a specific human rater, giving them distinct scoring tendencies:
| Judge | Persona | Calibration Tendency |
|---|---|---|
| Rater A | The Professor | Strict on structure, quantitative targets, metric specificity |
| Rater B | The Editor | Generous on feasibility and clarity, values achievable plans |
| Rater C | The Practitioner | Strict on actionability, data richness, practical impact |
How calibration works: Each judge receives the same evaluation rubric but sees different example evaluations during their training. These examples demonstrate how their assigned human rater scored proposals and what aspects they emphasized in their feedback. The AI judges learn to emulate these evaluation patterns—Rater A demands detailed metrics, Rater B focuses on clear communication, and Rater C prioritizes real-world implementation.
The shared rubric uses a 5-point scale with clear anchors:
- 1 = Poor: fundamental gaps; lacks feasibility, clarity, or alignment
- 2 = Weak: notable issues; partial feasibility or unclear execution
- 3 = Adequate: meets minimum; feasible but needs improvements
- 4 = Strong: solid plan with minor refinements suggested
- 5 = Excellent: clear, feasible, well-aligned, high impact
Each judge applies this scale through their calibrated lens. The same action plan might score a 5 from Rater B (who values clear articulation) while receiving a 3 from Rater A (who demands more quantitative detail). Both are valid assessments reflecting different evaluative priorities.
After all judges complete their evaluations, a consensus arbiter synthesizes their perspectives:
- Score constraint: The final score must fall within the range of judge scores (between the minimum and maximum). The arbiter cannot invent scores outside what the judges found.
- Agreement levels (calculated using a simple mathematical rule, not AI interpretation):
- Strong: Judges' scores differ by 0-1 point
- Moderate: Judges' scores differ by 2 points
- Weak: Judges' scores differ by 3-4 points
- Statistics: Mean, median, and score spread are calculated mathematically to provide objective benchmarks alongside the AI-generated synthesis.
- Synthesis approach: The arbiter uses AI to combine and explain the judges' feedback, but only reads the judges' comments—not the original proposal—ensuring it truly reconciles different perspectives rather than adding a fourth independent opinion.
When judges agree, the consensus highlights shared themes. When they disagree, the arbiter explains why based on the judges' different calibration perspectives and the specific evidence each cited.
User enters proposal
↓
Browser (React UI with real-time updates)
↓
Server (Express + CopilotKit Runtime)
↓
Three AI Judges evaluate in parallel
↓
Consensus Arbiter synthesizes results
↓
Azure OpenAI (gpt-5.1-codex-mini)
↓
Final assessment displayed with chat Q&A
As each judge completes their evaluation, the backend immediately sends updated state to the frontend. Users see the timeline update in real time—first Rater A completes, then Rater B, then Rater C, and finally the consensus emerges. No polling or refresh required.
| Layer | Technology |
|---|---|
| LLM | gpt-5.1-codex-mini via Azure OpenAI v1 API |
| Backend | Express.js 5 + CopilotKit Runtime 1.51 |
| Frontend | React 18 + CopilotKit UI components |
| Structured Output | Zod 4 schemas + OpenAI SDK's zodTextFormat |
| Orchestration | Reactive Extensions for JavaScript (RxJS) 7.8 Observables for event streaming |
| Deployment | Docker (single container on Hugging Face Spaces, port 7860) |
| Runtime | Node.js 24+ |
Note: The implementation uses the OpenAI SDK v6 directly to access Azure's Responses API. LangChain is not used—the OpenAI SDK provides the necessary abstractions for reasoning model interactions and structured output with Zod integration.
# 1. Clone repository and install dependencies (monorepo)
npm install --workspaces
# 2. Create .env file in project root with Azure OpenAI credentials
cat > .env << EOF
AZURE_OPENAI_API_KEY=your_api_key_here
AZURE_OPENAI_RESOURCE=your_resource_name
AZURE_OPENAI_DEPLOYMENT=your_deployment_name
EOF
# 3. Start development servers
./start-dev-server.sh # Backend (port 7860, loads .env automatically)
./start-dev-client.sh # Frontend (port 5173, new terminal)
# 4. Open browser to http://localhost:5173Development workflow: The client dev server proxies /api/* requests to the backend server on
port 7860. Changes to React components hot-reload automatically. Changes to server code trigger
automatic restarts via tsx watch mode.
# Build image
docker build -t grading-demo .
# Run with environment file
docker run -p 7860:7860 --env-file .env grading-demoThe Dockerfile uses multi-stage builds to optimize image size—dependencies are installed in a builder stage, then only production artifacts are copied to the final image.
Required configuration for Azure OpenAI connection:
| Variable | Required | Purpose |
|---|---|---|
AZURE_OPENAI_API_KEY |
Yes | Azure OpenAI authentication key |
AZURE_OPENAI_RESOURCE |
Yes | Azure resource name (e.g., my-openai-resource) |
AZURE_OPENAI_DEPLOYMENT |
Yes | Model deployment name (must be gpt-5.1-codex-mini) |
These variables are validated at server startup. If any are missing, the server will fail fast with a clear error message rather than starting with incomplete configuration.
This reasoning model has non-standard parameter support compared to standard chat models:
- ❌ Not supported:
temperature,max_tokens,top_p(will cause API errors if passed) - ✅ Use instead:
max_output_tokensparameter for controlling response length - ✅ API endpoint: Azure Responses API (
/openai/v1/responsespath) - ✅ Reasoning effort: Defaults to
none, can be adjusted if needed for more complex reasoning
Why Responses API? Reasoning models use a different API surface than chat models:
- Standard Chat Completions API uses
messagesarray andchoices[].message - Responses API uses
input/instructionsandoutput.content[].text - Only Responses API supports the parameter constraints of reasoning models
- Structured output works differently—JSON Schema strict mode is fully supported
Implementation note: The 3-tier structured output fallback system accounts for these differences, attempting strict JSON Schema validation first and gracefully degrading if needed.
SPEC.md— Complete technical specification (source of truth for implementation details)CODE_STANDARDS.md— Development workflow, tooling conventions, and quality standardsCLAUDE.md— Project context and guidance for AI-assisted developmentprompts/README.md— Prompt template documentation (judge system, consensus arbiter, few-shot calibration examples, and utility prompts for data analysis)
This section provides detailed architecture information for developers who want to understand or extend the implementation.
┌──────────────────────────────────────────────────────────┐
│ Browser │
│ ┌──────────────────────────────┐ ┌────────────────────┐│
│ │ Grading UI │ │ Chat Sidebar ││
│ │ • Timeline (judge progress) │ │ • Follow-up Q&A ││
│ │ • Judge cards (3 columns) │ │ • Context-aware ││
│ │ • Consensus panel │ │ explanations ││
│ │ (React + CopilotKit hooks) │ │ (CopilotChat) ││
│ └──────────────────────────────┘ └────────────────────┘│
│ ↕ Agent-User Interaction (AG-UI) protocol │
├──────────────────────────────────────────────────────────┤
│ Express Server (port 7860) │
│ ┌────────────────────────────────────────────────────┐ │
│ │ CopilotKit Runtime │ │
│ │ ├─ gradeDocument agent │ │
│ │ │ └─ Orchestrator │ │
│ │ │ ├─ Judge A (Rater A calibration) │ │
│ │ │ ├─ Judge B (Rater B calibration) │ │
│ │ │ ├─ Judge C (Rater C calibration) │ │
│ │ │ └─ Consensus arbiter │ │
│ │ │ │ │
│ │ └─ default agent (BuiltInAgent for chat Q&A) │ │
│ └────────────────────────────────────────────────────┘ │
│ ├── GET / → React static build │
│ ├── POST /api/copilotkit → CopilotKit Runtime │
│ └── GET /api/health → liveness probe │
│ │
│ ↕ HTTPS (Azure OpenAI v1 API) │
│ Azure OpenAI (gpt-5.1-codex-mini deployment) │
└──────────────────────────────────────────────────────────┘
Progressive state emission: The backend uses the Agent-User Interaction (AG-UI) protocol to
stream state updates as events. Each judge completion triggers a STATE_SNAPSHOT event that updates
the frontend UI immediately—no polling or page refresh required.
3-tier structured output fallback: To ensure judges always return valid JSON, the system attempts three increasingly permissive validation strategies:
- JSON Schema with strict validation (preferred)
- JSON Schema without strict mode (fallback)
- JSON Object mode with runtime validation (last resort)
This graceful degradation ensures reliable structured output even when the AI service has temporary limitations.
Agent-User Interaction protocol: CopilotKit's AG-UI protocol handles all real-time communication
over a single HTTP endpoint. Events like STATE_SNAPSHOT and RUN_FINISHED stream from server to
client automatically—no custom WebSocket or Server-Sent Events implementation needed.
Azure Responses API: The reasoning model (gpt-5.1-codex-mini) requires the Responses API rather than the standard Chat Completions API. This specialized endpoint supports reasoning-specific parameters and JSON Schema strict mode validation.