Version: 0.1.0 (Planning)
Planning documentation for the ACCESS-CI intelligent documentation agent.
An AI-powered question-answering system for ACCESS (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support). ACCESS allocates computing resources from Resource Providers—supercomputers, cloud platforms, and storage systems—to researchers across the US.
Users ask questions like:
- "What GPUs does Delta have?"
- "How do I request an allocation?"
- "Is Expanse currently down?"
This tool answers those questions accurately, with citations to source data.
Current State: A RAG-based QA system trained on PDFs and documentation. It answers questions but can become stale and lacks access to live system data.
Problem:
- Static knowledge goes stale between retraining
- No access to real-time data (outages, current events, user allocations)
- MCP servers exist but aren't integrated into the QA system
Solution: An intelligent agent that routes queries to the optimal handler:
- Static questions (hardware specs, documentation) → fine-tuned model with baked-in knowledge
- Dynamic questions (outages, user allocations) → live MCP API calls
- Combined questions → both working together
Key Benefits:
- Fresh answers for dynamic data via MCP integration
- Faster responses for static questions via fine-tuned model
- Maintained citation capability
- Human review ensures quality before and after training
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER ASKS A QUESTION │
└─────────────────────────────────────┬───────────────────────────────────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ "What GPUs does │ │ "What GPU │ │ "Is Delta down?" │
│ Delta have?" │ │ resources are │ │ │
│ │ │ available now?" │ │ DYNAMIC │
│ STATIC │ │ COMBINED │ │ │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Fine-tuned model │ │ Model + live MCP │ │ Live MCP call │
│ (fast, cached) │ │ (comprehensive) │ │ (real-time) │
└────────┬─────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└────────────────────┼────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ RESPONSE WITH CITATIONS │
│ "Delta has 4x NVIDIA A100 GPUs per node [source link]" │
└─────────────────────────────────────────────────────────────┘
Replace the current RAG LLM with an intelligent agent system:
- Query Classification: Route queries to the right handler (static vs dynamic)
- Fine-Tuned Model: Handles static queries (baked-in knowledge from MCP data + docs)
- Live MCP Calls: Handles dynamic queries (outages, events, metrics, user-specific data)
- Citations Preserved: Maintain source links users rely on
- Action Tools: Future authenticated operations (create events, etc.)
| # | Document | Purpose | Key Sections |
|---|---|---|---|
| 01 | agent-architecture.md | System design + roadmap | Architecture, phases, success metrics, data governance |
| 02 | training-data.md | Data preparation | MCP extraction, Q&A templates, deduplication |
| 03 | review-system.md | Human review (Argilla) | Pre-training approval, post-deployment feedback, domain reviewers |
| 04 | model-training.md | Model & infrastructure | GH200 setup, model selection, pilot comparison |
| 05 | events-actions.md | MCP action tools | Announcements (Phase 1), Events (Phase 2) |
| 06 | mcp-authentication.md | Authentication architecture | OAuth 2.1, CILogon proxy, token strategy |
| 07 | backend-integration-spec.md | Backend API contract | Service tokens, X-Acting-User, authorization patterns |
| 08 | observability.md | Distributed tracing & audit | Grafana Cloud, OpenTelemetry, dashboards |
| Document | Purpose |
|---|---|
| drupal-announcements-api-spec.md | Drupal API spec for Announcements (Phase 1 pilot) |
Start with 01-agent-architecture.md - covers the full system design and implementation phases.
The data pipeline docs describe a continuous flow:
- 02-training-data.md - Sources, extraction, Q&A generation
- 03-review-system.md - Human review (before training + after deployment)
- 04-model-training.md - Training the fine-tuned model
For AI agents to take actions on behalf of users:
- 05-events-actions.md - Overview: phased approach, key patterns
- 06-mcp-authentication.md - OAuth 2.1 authentication with CILogon
- 07-backend-integration-spec.md - Contract for backend API teams
- drupal-announcements-api-spec.md - Phase 1: Drupal developer spec
- Phase: Planning / Pilot Preparation
- Next Steps:
- Export existing "good" Q&A pairs from production
- Run MCP extraction for compute-resources + software-discovery
- Set up training infrastructure on GH200
- Begin model pilot comparing architectures
- access_mcp - MCP servers for ACCESS data
- n8n workflows - Query routing and orchestration
- cyberteam_drupal - Drupal integration for events
This is a planning repository. To propose changes:
- Create a branch
- Edit the relevant document(s)
- Open a PR with a description of what changed and why