The North Jersey Project is a scalable, cloud-native conversational AI service that simulates the persona of Tony Soprano. This project showcases the full life-cycle of a production-ready AI application, covering specialized model fine-tuning, containerized deployment, and comprehensive system observability.
We built a specialized chatbot capable of character-accurate dialogue generation by fine-tuning a Large Language Model (LLM) and deploying it within a high-performance, auto-scaling infrastructure.
- Utilized Microsoft Phi-3 Mini Instruct, a lightweight 3.8B parameter model selected for its efficiency and reasoning capabilities.
- Performed Low-Rank Adaptation (LoRA) on a Google Colab TPU.
- Adjusted only ~0.33% of the model parameters.
- Achieved a character-specific linguistic style with a training loss of 0.91.
- Curated a specialized dataset of The Sopranos dialogue.
- Supplemented with synthetic data to improve character behavior.
- Converted the model from PyTorch to GGUF format.
- Utilized llama.cpp for significantly improved inference speed in production.
- Deployed on Google Kubernetes Engine (GKE).
- Used C2 (Compute-Optimized) nodes with 8 vCPUs to handle heavy inference loads.
- Implemented a Horizontal Pod Autoscaler (HPA).
- Automatically spins up new pods when CPU usage exceeds 50%.
- Leveraged GKE Spot Instances to reduce infrastructure costs while maintaining high availability.
- Built with Next.js (TypeScript/React).
- Deployed on Vercel with automatic SSL provisioning.
- Integrated Prometheus to scrape real-time cluster metrics.
- Provides visibility into CPU and memory spikes.
- Built custom Grafana dashboards.
- Tracks pod health and request latency.
- Verifies that auto-scaling triggers correctly under load.
- Managed deployments using Helm for simplified Kubernetes configuration.
- Microsoft Phi-3
- LoRA
- Hugging Face
- llama.cpp (GGUF)
- FastAPI
- Python
- Pydantic
- Next.js
- React
- TypeScript
- Tailwind CSS
- Docker
- Kubernetes (GKE)
- Helm
- Prometheus
- Grafana
- Vercel
- Backend: FastAPI server using llama.cpp with GGUF model format
- Frontend: Next.js application with TypeScript and Tailwind CSS
- Model: Fine-tuned Phi-3 model with custom adapter weights
- Python 3.9+
- Node.js 18+
tony.ggufmodel file (place in project root)
pip install -r requirements.txt
pip install llama-cpp-python
python main.pyServer runs on http://localhost:8000
cd frontend
npm install
npm run devSet NEXT_PUBLIC_API_ENDPOINT environment variable to your backend URL.
- Streaming responses via Server-Sent Events (SSE)
- Multi-turn conversation with context management
- Token-based input limits and history truncation
- Prometheus metrics instrumentation
Docker and Kubernetes configurations are included. The Dockerfile builds a containerized version of the backend service.
Training data and scripts are located in data/ and training/. The model combines base Phi-3 weights with a custom LoRA adapter trained on character-specific dialogue.