AIDA (Autonomous Incident Diagnostic Agent) is a state-of-the-art MLOps project that demonstrates a complete, end-to-end system for an AI agent that acts as a junior Site Reliability Engineer (SRE).
When a production alert fires, AIDA is automatically dispatched to investigate. It uses a suite of tools and a RAG-based knowledge system to diagnose the problem, proposes a root cause, and crucially, learns from human feedback to improve its performance over time via an automated fine-tuning pipeline.
- ⚡ Dual LLM Engine: Seamlessly switch between a powerful cloud-based model (
OpenAI GPT-4o) for maximum reasoning capability and a self-hosted, fine-tuned model (Google Gemma-2B) for a private, cost-effective, and specialized solution. - 🧠 Full MLOps Loop: Implements a complete machine learning lifecycle. AIDA's investigations are reviewed by a human expert, and this feedback is used to generate a training dataset to automatically fine-tune and improve the agent.
- 🤖 Autonomous Agent: A sophisticated agent built with LangChain that can reason, plan, and use a "Tool Belt" to interact with its environment (Kubernetes, Prometheus, etc.).
- 📚 RAG-Powered Knowledge: AIDA consults a knowledge base of technical runbooks stored in a ChromaDB vector database, ensuring its diagnostic steps are grounded in best practices.
- 🔧 Production-Grade Infrastructure: The entire system is containerized with Docker Compose, featuring a robust, event-driven architecture with a FastAPI webhook, Redis job queue, and a persistent PostgreSQL backend for MLflow.
- ☁️ Hybrid Cloud Fine-Tuning: A pragmatic training pipeline that exports human-validated data from the local environment to be fine-tuned on a high-powered cloud GPU (like a Google Colab A100), with the resulting model deployed back to the local agent.
AIDA is a multi-component, event-driven system designed for robustness and scalability. The architecture is designed as a hybrid model, with core services running locally while the intensive model training process is offloaded to the cloud.
This project demonstrates a complete, closed-loop MLOps cycle.
An incident is triggered via a curl command simulating Prometheus. The AIDA agent picks up the job, loads its self-hosted, fine-tuned Gemma-2B model, and begins its investigation. Every action is logged as a new "Run" in MLflow.
The agent uses its tools to gather data and formulates a conclusion. All generated artifacts, including the raw alert and the agent's final report, are logged to MLflow for full traceability.
A human engineer opens the Feedback Console to review the case. They can see the initial alert and AIDA's final conclusion side-by-side, providing a clear and efficient overview for validation.
The engineer provides feedback, creating a high-quality data point that associates the agent's trajectory with a correct, human-validated answer. This feedback is saved back to the MLflow run, and the UI updates to reflect the new status.
The export_data.py script gathers all such validated runs. This dataset is then used in the Google Colab notebook to fine-tune the model, creating a new, smarter adapter. This completes the learning loop, and the improved model is ready for deployment.
| Category | Technology |
|---|---|
| AI & Machine Learning | LangChain, Hugging Face (Transformers, PEFT, TRL), Sentence-Transformers, PyTorch, bitsandbytes |
| LLM Engine | OpenAI API (GPT-4o) and/or self-hosted fine-tuned Google Gemma-2B |
| MLOps & Experimenting | MLflow |
| Backend & API | Python 3.11, FastAPI, Redis |
| Frontend & UI | Streamlit |
| Databases | PostgreSQL (for MLflow), ChromaDB (Vector Store), Redis (Queue) |
| DevOps & Infrastructure | Docker, Docker Compose, NVIDIA Container Toolkit |
AIDA/
├── aida_agent/
│ ├── training/
│ │ ├── aida-gemma-2b-sre-adapter-v1/ # Fine-tuned model adapter
│ │ ├── AIDA_Fine_Tuning.ipynb # Colab notebook for training
│ │ └── export_data.py # Script to export training data
│ ├── Dockerfile
│ ├── agent.py
│ ├── entrypoint.sh
│ ├── ingest.py
│ ├── requirements.txt
│ └── tools.py
├── docs/
│ ├── AIDA_architecture.png
│ └── screenshots/
├── feedback_ui/
├── runbooks/
├── webhook_api/
├── .env
├── .gitignore
├── docker-compose.yml
├── huggingface_token.txt
└── Mlflow.Dockerfile- Docker & Docker Compose: Ensure they are installed on your system.
- NVIDIA GPU & Drivers: A CUDA-enabled NVIDIA GPU is required for the self-hosted model.
- NVIDIA Container Toolkit: Must be installed and configured to grant Docker access to the GPU.
.envFile: Create a.envfile in the project root and add your OpenAI API key (this is optional if you only plan to use the local model).OPENAI_API_KEY="sk-..."- Hugging Face Secret: Create a file named
huggingface_token.txtin the project root. Paste your Hugging Face access token into this file. This is required to download the Gemma-2B model.- Add this file to your
.gitignore:echo "huggingface_token.txt" >> .gitignore.
- Add this file to your
-
Create External Volume: The PostgreSQL database requires an external volume to ensure its data persists. Create it once with:
docker volume create aida_postgres_data
-
Build and Start: Perform a clean build and start all services. This will take a very long time on the first run as it downloads the large NVIDIA CUDA base image and all Python dependencies.
# It is highly recommended to use --no-cache on the first build docker compose build --no-cache docker compose up -d -
Populate the Knowledge Base: Execute the ingestion script to process the runbooks into the vector database.
docker compose exec aida_agent python3 ingest.py
You can easily switch between your self-hosted model and the OpenAI API.
- Open
docker-compose.yml. - Find the
aida_agentservice. - Change the
USE_LOCAL_MODELenvironment variable:USE_LOCAL_MODEL=trueto use your fine-tuned Gemma-2B model.USE_LOCAL_MODEL=falseto use the OpenAI API.
- Restart the stack:
docker compose up -d --build(a build is needed to copy any code changes).




