A personal LLM leveraging my corpus as its knowledge base.
Data sources:
- Google
- Emails (mbox)
- Calendar (ics)
- Google Search Histroy (html, json)
- Youtube Search History (html)
- Contacts (vcf)
- Drive (docx)
- Maps (csv)
- Apple
- Notes (txt)
- Contacts (vcf)
- Messages (csv)
- Calendar (ics)
Currently included data:
- Apple notes
- Apple contacts
- Apple messages
- Google calendar
- Google maps
- Download data from sources Google Takeout & iCloud Exports
- Data processing: clean, chunk, and generating embeddings
- Store embeddings into a vector DB
- Leverage an open-source LLM for inference and to perform RAG
- Use Discord or another app to serve as MemEx's UI
- Parsing, chunking, preprocessing
- Langchain + Unstructured.io
- Embedding
- BAAI/bge-large-en-v1.5 + Together.ai
- Vector DB
- Chroma
- Retriever & RAG Model
- meta-llama/Llama-3.3-70B-Instruct-Turbo
- Qwen/Qwen2.5-7B-Instruct-Turbo
- Optional UI
- Discord / iMessage (using BlueBubbles) / Gradio
- Backend hosting
- Fly.io
- Create
.envwith required env variables (reference.env.example) - Generate embeddings from data directory, and store them in vector DB Chroma
python3 src/scripts/generate_embeddings.py --folder_path ./data --chunk_max_characters 1500
- Launch FastAPI server to handle requests between LLM and messaging services
python3 -m src.app
To deploy onto fly.io:
fly auth loginfly deploy(fly launchif first time)- volume should be provisioned alongside
- Transfer local chroma_db data into volume
- Copy sqlite3 file
fly ssh sftp shellput ./chroma_db/chroma.sqlite3 /chroma_db/chroma.sqlite3- exit
- Create
36e27b04-a9...directoryfly ssh consolemkdir -p /chroma_db/36e27b04-a9...- exit
- Copy remaining files
fly ssh sftp shellput ./chroma_db/…/data_level0.bin- repeat
- Copy sqlite3 file
If connection lost:
- restart machine on fly.io dashboard OR
- exit and then
fly ssh sftp shell
[nltk_data] Error loading punkt_tab: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1000)>
Fix: sudo /Applications/Python\ 3.12/Install\ Certificates.command
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Fix: brew install poppler
- BlueBubbles for iMessage API reference: https://documenter.getpostman.com/view/765844/UV5RnfwM#0d8e0e67-fa3b-4446-aa2c-062dca2ce4cd
This project idea was inspired by this post from Linus: https://x.com/thesephist/status/1629272600156176386