Skip to content

Tony Soprano AI: A conversational agent built by fine-tuning Microsoft Phi-3 via LoRA. Features a FastAPI backend, GKE cloud deployment with auto-scaling (HPA), and full-stack observability using Prometheus and Grafana for production-grade ML.

Notifications You must be signed in to change notification settings

ChaseH01/NorthJerseyProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

North Jersey Project: Tony Soprano AI

The North Jersey Project is a scalable, cloud-native conversational AI service that simulates the persona of Tony Soprano. This project showcases the full life-cycle of a production-ready AI application, covering specialized model fine-tuning, containerized deployment, and comprehensive system observability.


๐ŸคŒ Project Overview

We built a specialized chatbot capable of character-accurate dialogue generation by fine-tuning a Large Language Model (LLM) and deploying it within a high-performance, auto-scaling infrastructure.


๐Ÿง  Model & Fine-Tuning

Base Model

  • Utilized Microsoft Phi-3 Mini Instruct, a lightweight 3.8B parameter model selected for its efficiency and reasoning capabilities.

Fine-Tuning Process

  • Performed Low-Rank Adaptation (LoRA) on a Google Colab TPU.
  • Adjusted only ~0.33% of the model parameters.
  • Achieved a character-specific linguistic style with a training loss of 0.91.

Dataset

  • Curated a specialized dataset of The Sopranos dialogue.
  • Supplemented with synthetic data to improve character behavior.

Optimization

  • Converted the model from PyTorch to GGUF format.
  • Utilized llama.cpp for significantly improved inference speed in production.

โ˜๏ธ Cloud Infrastructure & Deployment

Orchestration

  • Deployed on Google Kubernetes Engine (GKE).
  • Used C2 (Compute-Optimized) nodes with 8 vCPUs to handle heavy inference loads.

Scalability

  • Implemented a Horizontal Pod Autoscaler (HPA).
  • Automatically spins up new pods when CPU usage exceeds 50%.

Cost Efficiency

  • Leveraged GKE Spot Instances to reduce infrastructure costs while maintaining high availability.

Frontend

  • Built with Next.js (TypeScript/React).
  • Deployed on Vercel with automatic SSL provisioning.

๐Ÿ“Š Observability & Reliability

Monitoring

  • Integrated Prometheus to scrape real-time cluster metrics.
  • Provides visibility into CPU and memory spikes.

Visualization

  • Built custom Grafana dashboards.
  • Tracks pod health and request latency.
  • Verifies that auto-scaling triggers correctly under load.

Infrastructure as Code

  • Managed deployments using Helm for simplified Kubernetes configuration.

๐Ÿ› ๏ธ Tech Stack

AI / ML

  • Microsoft Phi-3
  • LoRA
  • Hugging Face
  • llama.cpp (GGUF)

Backend

  • FastAPI
  • Python
  • Pydantic

Frontend

  • Next.js
  • React
  • TypeScript
  • Tailwind CSS

DevOps

  • Docker
  • Kubernetes (GKE)
  • Helm
  • Prometheus
  • Grafana
  • Vercel

Architecture

  • Backend: FastAPI server using llama.cpp with GGUF model format
  • Frontend: Next.js application with TypeScript and Tailwind CSS
  • Model: Fine-tuned Phi-3 model with custom adapter weights

Requirements

  • Python 3.9+
  • Node.js 18+
  • tony.gguf model file (place in project root)

Setup

Backend

pip install -r requirements.txt
pip install llama-cpp-python
python main.py

Server runs on http://localhost:8000

Frontend

cd frontend
npm install
npm run dev

Set NEXT_PUBLIC_API_ENDPOINT environment variable to your backend URL.

Features

  • Streaming responses via Server-Sent Events (SSE)
  • Multi-turn conversation with context management
  • Token-based input limits and history truncation
  • Prometheus metrics instrumentation

Deployment

Docker and Kubernetes configurations are included. The Dockerfile builds a containerized version of the backend service.

Model Training

Training data and scripts are located in data/ and training/. The model combines base Phi-3 weights with a custom LoRA adapter trained on character-specific dialogue.

Docker Hub

https://hub.docker.com/r/mdallolmo1/soprano-bot

About

Tony Soprano AI: A conversational agent built by fine-tuning Microsoft Phi-3 via LoRA. Features a FastAPI backend, GKE cloud deployment with auto-scaling (HPA), and full-stack observability using Prometheus and Grafana for production-grade ML.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors