This project implements an advanced conversational agent that can:
- Ingest documents (e.g., PDFs) and store them in a Chroma vector database for retrieval-augmented generation (RAG).
- Dynamically route queries to web search or vectorstore retrieval.
- Integrate with external APIs (Foursquare, OpenStreetMap, Overpass, IPData) for location-based or IP-based queries.
- Generate natural language responses via Large Language Models (LLMs) such as HuggingFace Transformers or Llama Cpp.
- Leverage GPU acceleration for improved performance on EC2 instances.
It includes FastAPI endpoints for synchronous/asynchronous interaction, LangGraph-based state machines for multi-step conversation flows, and classification tools.
- Conversational Agent Project
- Table of Contents
- Project Structure
- Requirements & Installation
- Configuration
- EC2 Instance Setup
- Connecting to the EC2 Instance
- Cloning the Repository & Running the Dev Container on the EC2 Instance
- Starting the Dev Container
- Usage
- Managing Caches
- Managing Dependencies with Poetry and pip
- Key Components
- Notes
- Contributing
A simplified view of the repository (showing the most important directories and files):
.
├── furhat_skills/ # (Optional) Skills or scripts for Furhat robot integration
├── middleware/
│ └── main.py # FastAPI endpoints for document agent interaction
├── my_furhat_backend/
│ ├── agents/
│ │ ├── document_agent.py # State-graph-based DocumentAgent for RAG conversations
│ │ ├── test_2_conversational_agent.py # Extended test workflow with dynamic routing & grading
│ │ └── test_conversational_agent.py # Basic RAG-based conversation test
│ ├── api_clients/
│ │ ├── foursquare_client.py # Foursquare API client
│ │ ├── osm_client.py # OSM (Nominatim) API client
│ │ ├── overpass_client.py # Overpass API client
│ │ └── ipdata_client.py # IPData API client
│ ├── config/
│ │ └── settings.py # Loads environment variables (dotenv)
│ ├── ingestion/
│ │ └── CMRPublished.pdf # Example PDF for ingestion (or other docs)
│ ├── llm_tools/
│ │ └── tools.py # Defines location-based search tools referencing api_clients
│ ├── models/
│ │ ├── chatbot_factory.py # Factory to build HuggingFace/Llama-based chatbots
│ │ ├── classifier.py # Zero-shot classifier
│ │ ├── llm_factory.py # Creates HuggingFace/Llama LLMs
│ │ └── model_pipeline.py # Manages HF pipelines (auth, loading, etc.)
│ ├── RAG/
│ │ └── rag_flow.py # RAG logic (loading, chunking, storing in Chroma)
│ ├── utils/
│ │ └── util.py # Utility functions for text cleaning & formatting
│ └── main.py # FastAPI app with /ask, /transcribe, /response endpoints
├── tests/ # Additional tests or integration tests
├── .env # Environment variables (API keys, etc.) – typically in .gitignore
├── .envrc # direnv configuration file
├── pyproject.toml # Poetry or other build-system configuration
├── poetry.lock # Poetry lockfile
└── README.md # Project documentation (this file)
The project is optimized for running on an EC2 instance with the following specifications:
- Instance Type: g4dn.xlarge (or similar GPU-enabled instance)
- GPU: NVIDIA Tesla T4
- Memory: 16 GB
- Storage: 100 GB GP3
The EC2 instance uses the following storage structure:
/mnt/
├── data/
│ ├── documents/ # PDF documents for RAG
│ └── vector_store/ # Chroma vector store
├── models/
│ ├── caches/ # Model caches
│ │ ├── huggingface/ # HuggingFace model cache
│ │ └── document_agent/ # Document agent cache
│ └── gguf/ # GGUF model files
-
Model Files:
- GGUF models are stored in
/mnt/models/gguf/ - Supported models include:
- Mistral-7B-Instruct-v0.3.Q4_K_M.gguf
- Mistral-Nemo-Instruct-2407-Q4_K_M.gguf
- Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf
- SmolLM2-1.7B-Instruct-Q4_K_M.gguf
- GGUF models are stored in
-
Cache Management:
- Model caches are stored in
/mnt/models/caches/ - To clear caches:
sudo rm -rf /mnt/models/caches/huggingface/* /mnt/models/caches/document_agent/*
- Model caches are stored in
-
Vector Store:
- Located at
/mnt/data/vector_store/ - To clear vector store:
sudo rm -rf /mnt/data/vector_store/*
- Located at
The project includes several GPU optimizations:
-
LLM Processing:
- Uses CUDA acceleration for model inference
- Optimized batch processing for embeddings
- GPU memory management with automatic cache clearing
-
RAG System:
- GPU-accelerated document embeddings
- Optimized chunking parameters for better performance
- Parallel processing for document chunking
- Efficient vector store operations
-
Memory Management:
- Automatic GPU cache clearing
- Memory usage monitoring
- Efficient batch processing
-
Security Groups
- Inbound Rules:
- SSH (Port 22) from your IP
- HTTP (Port 80) from anywhere
- HTTPS (Port 443) from anywhere
- Custom TCP (Port 8000) for FastAPI server
- Outbound Rules:
- All traffic allowed
- Inbound Rules:
-
IAM Roles and Permissions
- Create an IAM role with the following permissions:
- AmazonS3FullAccess (for model storage)
- CloudWatchLogsFullAccess (for logging)
- AmazonEC2ContainerRegistryFullAccess (if using ECR)
- Create an IAM role with the following permissions:
-
Server Configuration
- Install required system packages:
sudo apt-get update sudo apt-get install -y python3-pip python3-venv nvidia-driver-525
- Configure NVIDIA drivers:
sudo apt-get install -y nvidia-cuda-toolkit nvidia-smi # Verify GPU is recognized - Set up environment variables:
echo 'export PYTHONPATH=/mnt/data/app/my_furhat_backend:$PYTHONPATH' >> ~/.bashrc echo 'export CUDA_VISIBLE_DEVICES=0' >> ~/.bashrc source ~/.bashrc
- Install required system packages:
-
Production Deployment
- Use systemd service for automatic startup:
sudo nano /etc/systemd/system/furhat-backend.service
- Add the following configuration:
[Unit] Description=Furhat Backend Service After=network.target [Service] User=ubuntu WorkingDirectory=/mnt/data/app/my_furhat_backend Environment="PATH=/mnt/data/app/my_furhat_backend/.venv/bin" ExecStart=/mnt/data/app/my_furhat_backend/.venv/bin/uvicorn middleware.main:app --host 0.0.0.0 --port 8000 Restart=always [Install] WantedBy=multi-user.target
- Enable and start the service:
sudo systemctl enable furhat-backend sudo systemctl start furhat-backend
- Use systemd service for automatic startup:
-
Monitoring and Logging
- Set up CloudWatch logging:
sudo apt-get install -y amazon-cloudwatch-agent
- Configure log rotation:
sudo nano /etc/logrotate.d/furhat-backend
- Add configuration:
/mnt/data/app/my_furhat_backend/logs/*.log { daily rotate 7 compress delaycompress missingok notifempty create 0640 ubuntu ubuntu }
- Set up CloudWatch logging:
This project uses a hybrid dependency management approach: some dependencies are installed via pip into your environment, and others are managed by Poetry (tracked in the poetry.lock file).
-
Clone the Repository
git clone https://github.com/yourusername/yourproject.git cd yourproject -
Install Dependencies via Poetry
If you haven't already, install Poetry globally. Then run:poetry lock poetry install
This will update the
poetry.lockfile and install dependencies into Poetry's virtual environment. -
Activate the Poetry Environment
It's important to activate your Poetry environment before installing any additional pip dependencies. See the Activating the Poetry Environment section below. -
(Optional) Install Additional pip Dependencies
If there is arequirements.txtfile with extra dependencies, activate your Poetry environment first and then run:pip install -r requirements.txt
If you are having issues with building the wheel for
llama.cppremove it from the requirements file and download the applicable version manually based on your hardware. More details can be found here: https://github.com/abetlen/llama-cpp-python.
Because the .env file is typically listed in .gitignore, you'll need to create your own locally:
- Create a
.envfile in the root directory of the project (same level aspyproject.toml). - Add your API keys and environment variables. For example:
FSQ_KEY=<YOUR_FOURSQUARE_API_KEY> IP_KEY=<YOUR_IPDATA_API_KEY> HF_KEY=<YOUR_HUGGINGFACE_API_KEY>
- Save the file. This file will be loaded at runtime by
my_furhat_backend/config/settings.py.
If you use direnv to automatically load environment variables:
-
You might see an error like:
direnv: error /Users/{home_directory}/my_furhat_backend/.envrc is blocked. Run `direnv allow` to approve its content -
To approve the content of
.envrc, run:direnv allow
You may also need to run:
source ~/.zshrc
This command will load the file and export the necessary environment variables.
-
Your
.envrcmay include:export PATH="$HOME/.local/bin:$PATH"
This ensures that your local binaries are in your PATH.
This section provides instructions for connecting to the EC2 instance hosting the Conversational Agent both via a terminal and through VS Code.
-
Instance Details:
- Instance ID:
i-04ea4401b15d43473 - Public DNS:
ec2-51-21-200-54.eu-north-1.compute.amazonaws.com - Private Key File:
Furhat_W25.pem(the key used to launch the instance, you will have to create your own)
- Instance ID:
-
Prepare Your Private Key: Ensure your private key is secure (i.e., not publicly viewable) by running:
chmod 400 "path/to/Furhat_W25.pem" -
Connect Using SSH: Open your terminal and run:
ssh -i "path/to/Furhat_W25.pem" ec2-user@ec2-51-21-200-54.eu-north-1.compute.amazonaws.comWhen prompted to verify the host's authenticity, type yes. Once connected, you'll be logged in as
ec2-useron your EC2 instance.
VS Code's Remote - SSH extension allows you to seamlessly edit and develop on your remote EC2 instance.
-
Install the Remote - SSH Extension:
- Open VS Code and go to the Extensions view (press
Cmd+Shift+Xon Mac). - Search for "Remote - SSH" (by Microsoft) and install it.
- Open VS Code and go to the Extensions view (press
-
Configure Your SSH Host in VS Code:
- Open the Command Palette (
Cmd+Shift+P). - Select Remote-SSH: Add New SSH Host...
- Enter the following SSH command:
ssh -i "path/to/Furhat_W25.pem" ec2-user@ec2-51-21-200-54.eu-north-1.compute.amazonaws.com - Choose to save the configuration (this will typically update your
~/.ssh/configfile).
- Open the Command Palette (
-
Connect to the EC2 Instance:
- Open the Command Palette again and select Remote-SSH: Connect to Host...
- Choose your newly added host.
- Accept any host key verification prompts.
- VS Code will then open a new window connected to your EC2 instance, allowing you to work on files and run terminals on the remote server.
If you haven't already set up your EC2 instance with Git and Docker, follow these steps:
-
Install Git:
sudo yum update -y sudo yum install git -y
-
Install Docker:
sudo yum install docker -y
(For Amazon Linux 2, you might also use
sudo amazon-linux-extras install docker -yif available.) -
Start Docker and Configure Permissions:
sudo systemctl start docker sudo systemctl enable docker sudo usermod -aG docker ec2-user newgrp docker # Alternatively, log out and log back in to apply the group change
-
Clone the Repository:
git clone https://github.com/yourusername/yourproject.git cd yourproject -
Build and Run the Dev Container:
Assuming your project structure is as follows:
/my-furhat-backend ├── .devcontainer │ ├── devcontainer.json │ └── Dockerfile ├── pyproject.toml ├── poetry.lock └── ... (other project files)From the project root, build the Docker image:
docker build -t my-furhat-backend -f .devcontainer/Dockerfile .Then, run the container interactively:
docker run -it --rm -p 8000:8000 -v "$(pwd)":/app my-furhat-backendThis command:
- Runs the container in interactive mode with a terminal (
-it) - Automatically removes the container when it exits (
--rm) - Maps port 8000 of the container to port 8000 on the host (
-p 8000:8000) - Mounts your current project directory into
/appinside the container (-v "$(pwd)":/app)
- Runs the container in interactive mode with a terminal (
This section explains how to launch an interactive development container using VS Code's Remote - Containers extension. You can use this container for interactive development either locally or on your EC2 instance.
- Open in Dev Container:
- Open your project in VS Code.
- Press
Ctrl+Shift+P(orCmd+Shift+Pon macOS) and select Remote-Containers: Reopen in Container. - VS Code will build the container based on your
.devcontainerconfiguration and open a new window connected to the container. - You now have an interactive development environment with all your dependencies installed.
-
Clone Your Repository on the EC2 Instance:
- SSH into your EC2 instance:
ssh -i "path/to/Furhat_W25.pem" ec2-user@ec2-51-21-200-54.eu-north-1.compute.amazonaws.com - Clone your repository:
git clone https://github.com/yourusername/yourproject.git cd yourproject
- SSH into your EC2 instance:
-
Connect to the EC2 Instance Using VS Code Remote - SSH:
- On your local machine, open VS Code and use the Remote - SSH extension to connect to your EC2 instance (as described in the Connecting to the EC2 Instance section).
-
Open the Dev Container on EC2:
- In the VS Code window connected to the EC2 instance, open the Command Palette and select Remote-Containers: Reopen in Container.
- VS Code will use the dev container configuration (the same
.devcontainerfolder) to build and attach the dev container on the EC2 instance. - Now, you have an interactive development environment running on your EC2 instance that mirrors your local dev container setup.
To start the server (after ensuring your environment is activated):
poetry run uvicorn middleware.main:app --host 0.0.0.0 --port 8000Or, if using pip/venv:
uvicorn middleware.main:app --host 0.0.0.0 --port 8000Endpoints:
POST /ask— Synchronously processes a user query and returns an answer.POST /transcribe— Asynchronously handles transcriptions (stores response for later retrieval).GET /response— Fetches the latest response generated by the agent.POST /get_docs- Fetches the name of the document that is the most similar to the users request
There are multiple ways to test or interact with the conversation agent:
-
Basic RAG-based Conversation
poetry run python my_furhat_backend/agents/test_conversational_agent.py
This script starts a conversation loop using document ingestion and retrieval.
-
Extended Workflow with Grading & Routing
poetry run python my_furhat_backend/agents/test_2_conversational_agent.py
This version adds dynamic routing (vector store vs. web search) and grading chains (document/answer grading).
-
Middleware Service
If you have a middleware service atmiddleware/main.py, run:poetry run python middleware.main.py
Or, if it's a FastAPI service:
uvicorn middleware.main:app --reload
The middleware provides several endpoints for interacting with the document agent:
-
Interactive API Documentation
- Access the Swagger UI at
http://localhost:8000/docs - Access the ReDoc UI at
http://localhost:8000/redoc
- Access the Swagger UI at
-
Using cURL Commands
a. Ask a Question (Synchronous):
curl -X POST "http://localhost:8000/ask" \ -H "Content-Type: application/json" \ -d '{"content": "What is the MIMIR project?"}'
b. Transcribe (Asynchronous):
curl -X POST "http://localhost:8000/transcribe" \ -H "Content-Type: application/json" \ -d '{"content": "Tell me about the project timeline"}'
c. Get Response (For async requests):
curl "http://localhost:8000/response"d. Get Documents:
curl -X POST "http://localhost:8000/get_docs" \ -H "Content-Type: application/json" \ -d '{"content": "Show me the annual report"}'
e. Engage (Generate follow-up):
curl -X POST "http://localhost:8000/engage" \ -H "Content-Type: application/json" \ -d '{ "document": "NorwAi annual report 2023.pdf", "answer": "The project aims to investigate copyright-protected content in language models." }'
-
Using a Python Script with Requests
import requests import json BASE_URL = "http://localhost:8000" def ask_question(question: str) -> dict: response = requests.post( f"{BASE_URL}/ask", json={"content": question} ) return response.json() def transcribe(text: str) -> dict: response = requests.post( f"{BASE_URL}/transcribe", json={"content": text} ) return response.json() def get_response() -> dict: response = requests.get(f"{BASE_URL}/response") return response.json() def get_docs(query: str) -> dict: response = requests.post( f"{BASE_URL}/get_docs", json={"content": query} ) return response.json() def engage(document: str, answer: str) -> dict: response = requests.post( f"{BASE_URL}/engage", json={"document": document, "answer": answer} ) return response.json() # Example usage if __name__ == "__main__": # Ask a question result = ask_question("What is the MIMIR project?") print("Answer:", result["response"]) # Generate a follow-up follow_up = engage( "NorwAi annual report 2023.pdf", result["response"] ) print("Follow-up:", follow_up["prompt"])
-
Document Agent
- Handles document ingestion and RAG-based conversations
- Uses GPU-accelerated embeddings for efficient retrieval
- Implements state-graph-based conversation flow
- Generates natural follow-up questions
-
RAG System
- Efficient document chunking with optimized parameters
- GPU-accelerated embeddings using HuggingFace or LlamaCpp
- Chroma vector store for fast similarity search
- Cross-encoder reranking for improved relevance
-
LLM Integration
- Support for multiple GGUF models
- GPU-accelerated inference
- Optimized memory management
- Efficient batch processing
-
API Layer
- FastAPI-based REST endpoints
- Asynchronous request handling
- Real-time response generation
- Document management and retrieval
-
GPU Usage
- Monitor GPU memory usage with
nvidia-smi - Clear GPU cache when needed using the provided utilities
- Adjust batch sizes based on available GPU memory
- Monitor GPU memory usage with
-
Performance Optimization
- Use appropriate chunk sizes for document processing
- Monitor vector store size and performance
- Clear caches periodically to prevent memory issues
- Use batch processing for large document sets
-
Error Handling
- Check GPU memory before large operations
- Monitor vector store integrity
- Handle model loading errors gracefully
- Implement proper error logging
-
Security
- Secure API endpoints with authentication
- Protect sensitive document content
- Implement rate limiting
- Monitor system resources
The system uses several types of caches to improve performance:
- Question cache: Stores previously asked questions and answers
- Context cache: Stores document context for faster retrieval
- GPU cache: Stores model weights and computations
- Conversation memory: Stores recent conversation history
- Summary cache: Stores document summaries
There are several ways to clear the caches:
You can clear all caches by making a POST request to the /clear_caches endpoint:
curl -X POST http://localhost:8000/clear_cachesYou can clear caches programmatically using the DocumentAgent class:
from my_furhat_backend.agents.document_agent import DocumentAgent
# Initialize the agent
agent = DocumentAgent()
# Clear all caches
agent.clear_all_caches()You can also manually clear the caches by:
-
Deleting the cache files:
rm -rf /mnt/models/caches/huggingface/question_cache.json
-
Clearing GPU cache:
import torch torch.cuda.empty_cache()
-
Restarting the FastAPI server:
# Stop the current server pkill -f "uvicorn middleware.main:app" # Start a new server uvicorn middleware.main:app --host 0.0.0.0 --port 8000
Note: Caches are automatically cleared when:
- The context window is exceeded
- Processing errors occur
- The server is restarted
python -c "from my_furhat_backend.agents.document_agent import DocumentAgent; agent = DocumentAgent(); agent.clear_all_caches()"This project employs a hybrid dependency strategy:
- Poetry: Manages the core dependencies (tracked in
poetry.lock). - pip: Some additional dependencies may be installed via pip.
Correct Flow:
- Install Poetry and run:
poetry lock poetry install
- Activate the Poetry environment (see Activating the Poetry Environment).
- Once inside the Poetry environment, if needed, install extra dependencies from
requirements.txt:pip install -r requirements.txt
When using Poetry, you typically run:
poetry shellIf that command does not work as expected or you prefer a manual approach, activate the environment with:
source $(poetry env info --path)/bin/activateThis command retrieves the virtual environment's path from poetry env info --path and manually activates it.
- Project Directory: Ensure you're in the directory containing your
pyproject.tomlfile, as some Poetry commands require it. - Check for Aliases: Verify there is no shell alias or function named
poetryorshellinterfering with the command. - Reinstall/Upgrade Poetry: Although version 2.0.1 should work, if issues persist, consider reinstalling or updating Poetry.
- Manual Activation: If
poetry shellfails, usingsource $(poetry env info --path)/bin/activateshould let you work within your Poetry virtual environment.
main.pyinmy_furhat_backend/: FastAPI app exposing/ask,/transcribe, and/responseendpoints.rag_flow.pyinRAG/: Handles document ingestion, chunking, storage in a Chroma vector store, and retrieval with optional cross-encoder reranking.document_agent.py: Defines a StateGraph workflow for capturing user input, retrieving context, checking uncertainty, and generating responses.chatbot_factory.py,llm_factory.py,model_pipeline.py: Provide factory methods and pipelines for constructing language models (HuggingFace or Llama Cpp) and chatbots.llm_tools/tools.py: Integration points (tools) that the agent can invoke for location or place queries, referencing the various API clients inapi_clients/.middleware/main.py: Additional FastAPI or bridging service to connect the conversation agent with other systems.tests/: Contains additional tests or integration checks.
- Currently, the tools in
llm_tools/tools.pyand the APIs inapi_clientsare not being used. They were created at the early stages of the project when the original scope was for an agent acting as a concierge.
Contributions are welcome! If you have ideas or bug fixes:
- Fork this repository.
- Create a new branch for your changes.
- Submit a pull request describing your enhancements.
Enjoy using your Conversational Agent! If you have any questions or run into issues, feel free to open an issue or reach out for support.