Original Specification

Phase 1

Build System: Use Poetry for dependency and environment management.
REST API: Use FastAPI to create the RESTful interface. This will expose endpoints for querying the indexed documents.
Indexing and Vector Store: Use LlamaIndex for indexing the documents, applying context-aware chunking via RecursiveCharacterTextSplitter. Use Chroma as the vector store, which is thread-safe and suited for concurrent requests.
Embedding Model: Use OpenAI’s latest embedding model, which as of now is “text-embedding-3-large” (or whichever the latest model is at the time). This model will convert text chunks into embeddings for indexing and querying.
LLM for Summarization: Use Claude 4.5 Haiku for generating summaries of documents and surrounding chunks. This helps enable context-aware chunking and enhances the quality of the retrieval.
Tokenizer/Embedding Generation: Use OpenAI’s tokenizer (such as tiktoken) for tokenization, which is thread-safe for concurrent use. The embedding model will use this tokenization under the hood when generating vectors.
Claude Skill Integration: Add a Claude skill that interfaces with the REST API, allowing you to query the vector store during generation tasks. This enables dynamic look-ups from the indexed corpus while generating content.
Chunk Sizes: For context-aware chunking, consider a base chunk size of around 512 to 1024 tokens, with overlap of around 50 to 100 tokens. You can adjust these sizes after testing with your specific documents.

This combination gives you a powerful, scalable stack for indexing, embedding, querying, and generating context-aware results via a REST API.

Mono repo

/
   docs/ 
   agent-brain-skill/
   agent-brain-server/
   agent-brain-cli/ (Command line interface to agent-brain-server)

command-line tool, called "agent-brain," that takes a path to a folder containing documents and a port number to run on. This launches the server.
When it starts, it indexes all the documents in that folder, using OpenAI embeddings and stores them in the Chroma vector store.
The tool will expose health endpoints—likely something like a /health or /status route—to indicate if it's up, if indexing is in progress, or if it's finished and ready for querying.
The skill will know how to check this health endpoint to see whether Agent Brain is running. If not, it can spin it up with the proper folder path and port.
Once indexing is complete and the server is ready, the skill can query the vector store over HTTP, sending text queries and getting back relevant document chunks or summaries.
Everything will be running locally, so it stays efficient and fast.
There is an agent-brain CLI to query the DB and test it easily, and turn it off. Add dirs to index, etc.
agent-brain-server exposes OpenAPI schema

It is a fully self-contained system that the skill can start, check, and query as needed. This design gives you flexibility and scalability.

Phase 2

Yes, you can add BM25-style keyword search alongside vector search, and LlamaIndex actually has first-class support for that plus hybrid retrieval.developers.llamaindex+1

BM25 and hybrid in LlamaIndex

LlamaIndex ships a BM25Retriever that runs classic sparse retrieval (BM25) over your corpus.llamaindexxx.readthedocs+1
You can pair that with a standard vector retriever (your Chroma-backed index) and either:
- Expose them as separate modes (keyword vs semantic), or
- Wrap them in a “hybrid” or “fusion” retriever that merges BM25 and vector results (often via reciprocal rank fusion or weighted scores).llamaindex+2

Conceptually you end up with:

Dense retriever: semantic similarity over embeddings (your current Chroma + OpenAI embeddings).
Sparse retriever: BM25 over raw text.
Hybrid retriever: calls both, merges ranked lists, returns a unified set of nodes.trulens+2

Where BM25 actually lives

LlamaIndex can do BM25 internally with BM25Retriever over its document store (no external search engine required).developers.llamaindex+1
Some vector backends (e.g., Milvus, Qdrant) and newer Chroma “sparse search” features also expose BM25-like sparse vectors or full-text/BM25 integrations, which LlamaIndex can use via their hybrid vector store integrations.milvus+2

With your current design (LlamaIndex + Chroma):

Keep Chroma for dense vectors.
Add a BM25Retriever over the same Document/Node objects (LlamaIndex’s internal store).
Create a hybrid retriever that combines:
- VectorIndexRetriever (Chroma-backed)
- BM25Retriever (keyword/BM25)stackoverflow+1

How you’d expose it in your REST API

You could define something like:

mode=vector → only dense retrieval
mode=bm25 → pure keyword/BM25
mode=hybrid (default) → fusion of both lists, N results per type then merged and reranked.llamaindex+1

Original Specification

BM25 and hybrid in LlamaIndex

Where BM25 actually lives

How you’d expose it in your REST API

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!