Knowledge Base Platform is a production-ready RAG backend with a clean API and a modern web UI. It ingests documents, builds a semantic index in Qdrant, and answers questions with grounded citations. It also generates document structure (TOC metadata) to enable section-aware retrieval and precise "show me question X" queries, and provides a retrieve-only API for automation and MCP-style tools.
It can be used as a standalone service or integrated into other products via its API (plugin-style: you bring the data, it provides retrieval, citations, and answers).
- High-signal retrieval: Vector search with Qdrant and configurable chunking.
- Structured navigation: LLM-based TOC extraction enables section-aware search.
- API-first: Use the backend independently from the UI.
- Provider-flexible: OpenAI by default, with optional Anthropic, Voyage, or Ollama.
- Grounded answers: Responses are built from your documents, not guesses.
- Retrieve-only access: Get chunks + context without creating chat history.
- Document ingestion with intelligent chunking (txt, md, fb2, docx):
- Simple: Fast fixed-size chunking with overlap
- Smart: Recursive chunking respecting sentence/paragraph boundaries (recommended for most cases)
- Semantic: Advanced embedding-based boundary detection using cosine similarity to find natural topic changes
- Embedding-based semantic search over unstructured documents
- Qdrant-backed vector index for fast similarity search
- MMR (Maximal Marginal Relevance) for diversity-aware search
- Optional BM25 lexical index (OpenSearch) for hybrid retrieval
- Self-Check Validation (optional): Two-stage answer generation with validation for improved accuracy
- Retrieve-only API for MCP/search tools (no chat side-effects)
- KB export/import: portable archives for backup, migration, and QA
- Chat export (Markdown): separate archive for human reading (not importable)
- Chat history controls: delete individual Q/A pairs from a conversation
- BM25 phrase matching: adds an exact
match_phraseclause for strict wording - Windowed retrieval (context expansion) for neighboring chunk context
- Structured document analysis and TOC metadata
- RAG answers with citations
- FastAPI backend + React frontend
- Docker-first dev setup
- KB-level retrieval defaults stored per knowledge base
- JWT-based admin auth for protected endpoints
- API: FastAPI, async SQLAlchemy
- Vector DB: Qdrant
- Lexical Search: OpenSearch (BM25)
- Metadata DB: PostgreSQL
- Embeddings: text embeddings for unstructured data
- RAG: Custom retrieval (dense or hybrid) + LLM generation pipeline
- Frontend: Vite + React
- Documents are uploaded and chunked.
- Each chunk is embedded into vectors.
- Vectors are stored in Qdrant.
- A query is embedded and matched by similarity.
- The top chunks are assembled into context for the LLM.
- The LLM returns a grounded answer with sources.
- Optional: a TOC/structure pass enables section-aware retrieval.
Vectorization turns unstructured text into numeric vectors that capture meaning, not just keywords. This lets the system retrieve semantically similar chunks even when the wording differs. It is especially useful for study notes, specs, or large documents where exact keyword matches miss relevant sections.
The system performs semantic retrieval: it embeds the user query, finds the closest chunk vectors, and assembles them into a context window for the LLM. You can also enable hybrid retrieval (BM25 + vectors), which boosts exact keyword matches while preserving semantic recall. Because the answer is grounded in retrieved chunks, we can return citations (source snippets) alongside the response.
For structured documents, an optional Structure‑Aware Retrieval step builds a TOC and section metadata. This enables section‑targeted queries (e.g., "show Question 2"), returning full, verbatim excerpts rather than a generic summary.
If you need retrieval without LLM generation (for tools like MCP search/retrieve), use:
POST /api/v1/retrieve/
This returns chunks + assembled context without creating chat conversations or messages.
You can also set KB-level retrieval defaults (e.g., top_k, retrieval_mode, BM25 settings):
GET /api/v1/knowledge-bases/{kb_id}/retrieval-settingsPUT /api/v1/knowledge-bases/{kb_id}/retrieval-settingsDELETE /api/v1/knowledge-bases/{kb_id}/retrieval-settings
- KB export/import:
POST /api/v1/kb/export,POST /api/v1/kb/import - Chats (Markdown):
POST /api/v1/kb/export-chats-md(separate archive for reading)
The platform supports three chunking strategies with different trade-offs:
- How it works: Splits text at fixed character positions with configurable overlap
- Pros: Fastest, predictable chunk sizes
- Cons: May split mid-sentence or mid-word
- Use when: Speed is critical, document structure doesn't matter
- Overlap: Required (15-20% recommended)
- How it works: Uses LangChain's RecursiveCharacterTextSplitter to split at natural boundaries (paragraphs → sentences → words)
- Pros: Respects document structure, maintains coherent chunks
- Cons: Slightly slower than simple
- Use when: General-purpose chunking for most documents
- Overlap: Required (15-20% recommended)
- How it works:
- Splits text into sentences (NLTK)
- Generates embeddings for each sentence
- Calculates cosine similarity between consecutive sentences
- Detects boundaries where similarity drops (topic changes)
- Groups semantically related sentences into chunks
- Pros: Chunks align with natural topic boundaries, better retrieval quality
- Cons: Slowest (requires embedding each sentence), GPU recommended
- Use when: Documents have clear topic changes, retrieval quality is critical
- Overlap: Not used (boundaries are semantic, not positional)
- Parameters:
chunk_size: Maximum chunk size (acts as soft limit, default 800)min_chunk_size: Minimum size before merging (default 100)boundary_method: "adaptive" (mean - k*std) or "fixed" (constant threshold)
Dependencies:
- Simple: None
- Smart: LangChain
- Semantic: NLTK, NumPy, scikit-learn
This starts the full stack (API + DB + Qdrant + OpenSearch + frontend).
- Create env file
cp .env.example .env
# add your OPENAI_API_KEY (or other provider keys)- Read the runbook (dev ops notes, CORS, restart rules)
- Start the stack
docker compose up -d --build- Open API docs
http://localhost:8004/docs
URLs:
- UI:
http://localhost:5174 - API:
http://localhost:8004/api/v1
Comprehensive API documentation is available in multiple formats:
Complete documentation with navigation, examples, and visual diagrams:
- 📚 Documentation Home - Start here
- 📖 API Documentation - Complete API reference with request/response examples, data models, and error handling
- ⚡ Quick Reference - Endpoint tables, common patterns, and bash snippets for rapid development
- 🗺️ API Map - Visual API structure, data flow diagrams, and integration examples
Explore and test endpoints directly in your browser:
- Swagger UI - Interactive API documentation with "Try it out" functionality
- ReDoc - Clean, responsive API documentation
- OpenAPI JSON - Machine-readable API specification
Documentation source files are available in the docs/ directory:
API_DOCUMENTATION.md- Full API referenceENDPOINTS_QUICK_REFERENCE.md- Quick lookup tablesAPI_MAP.md- Architecture and diagramsKB_EXPORT_IMPORT.md- KB export/import (API + format + behavior)
- Create a knowledge base
- Upload a document
- Ask a question
- (Optional) Retrieve-only without creating chats
API details are available in Swagger (/docs).
curl -X POST http://localhost:8004/api/v1/retrieve/ \
-H "Content-Type: application/json" \
-d '{
"query": "What is this document about?",
"knowledge_base_id": "your-kb-id",
"top_k": 5
}'Pagination note: list endpoints accept page and page_size (default 10, max 100).
You can expand retrieval context by including neighboring chunks from the same document. This helps recover surrounding text that was split during chunking.
API fields:
context_expansion: ["window"]context_window: N(number of chunks on each side, 0–5)
All configuration lives in .env. The sample file is .env.example.
Key settings:
OPENAI_API_KEY(or alternate provider keys)QDRANT_URLOLLAMA_BASE_URL(optional for local models)MAX_CONTEXT_CHARS(0 = unlimited)STRUCTURE_ANALYSIS_REQUESTS_PER_MINUTE(TOC analysis throttle; 0 = unlimited)OPENSEARCH_URL(optional; required for BM25/hybrid)
After the Setup Wizard creates the admin account, all API routes (except /health, /setup, and /auth) require authentication.
- UI login:
http://<host>:5174/login - API login:
POST /api/v1/auth/login(username + password) - Access token: sent as
Authorization: Bearer <token> - Refresh token: stored as an httpOnly cookie; rotate via
POST /api/v1/auth/refresh - Logout:
POST /api/v1/auth/logout
Use the interactive script to reset an admin password without manual hash copying:
./scripts/reset_admin_password.shIf the frontend and API are on different origins (different host or port), CORS is required.
Common cases:
- Vite frontend on 5174 + API on 8004 → CORS required
- Nginx proxy (frontend serves
/apion same host) → CORS not required
To allow a browser origin, set CORS_ORIGINS in .env (comma-separated). For example:
http://localhost:5174
We now use a Docker‑Compose secret for the database password. This avoids storing the password in .env or the database.
- Create the secret file:
./scripts/setup_secrets.sh
- Rebuild and restart:
docker compose down
docker compose up -d --build
If you don't know the current DB password, reset it first:
docker exec -u postgres -it kb-platform-db psql -U postgres -d postgres \
-c "ALTER USER kb_user WITH PASSWORD 'NEW_PASSWORD';"
Then rerun ./scripts/setup_secrets.sh with the new password.
Global Settings define defaults for new chats and retrieval behavior:
- Default LLM model
- Top K / Max context / Score threshold / Temperature
- Use Document Structure default
- TOC requests per minute (throttles structure analysis to avoid rate limits)
- General Knowledge Base Configuration (chunk size/overlap, batch size, chunking strategy)
These defaults are applied unless a specific knowledge base overrides them. They are saved in the backend and used to initialize new chats and KBs.
The chat UI exposes retrieval controls to tune answer quality:
- Top K: number of chunks retrieved from the vector store. Typical range 10–50. Higher values add recall but can bring more noise.
- Max context chars: limit for assembled context (0 = unlimited). Lower values reduce cost/latency; higher values preserve more context.
- Score threshold: minimum similarity score (0–1) to filter low‑relevance chunks. 0 disables filtering; 0.2–0.4 is a good starting range.
- Temperature: response randomness. Use 0–0.3 for factual extraction, higher for exploratory/creative explanations.
- Use Document Structure: enables TOC‑aware, section‑targeted retrieval (e.g., "show question 2").
- Use MMR (Maximal Marginal Relevance): enables diversity-aware search to avoid retrieving too many similar chunks from the same section. Balances relevance and diversity.
- MMR Diversity (when MMR enabled): controls the relevance-diversity tradeoff (0.0–1.0). See detailed guidance below.
- Windowed retrieval: expands context by adding neighboring chunks (prevents truncated citations; useful for multi-part questions).
- Retrieval mode: dense (vectors) or hybrid (BM25 + vectors).
- BM25 controls (hybrid only): lexical top‑K and weight blending.
MMR (Maximal Marginal Relevance) balances relevance and diversity in search results. Higher diversity values sacrifice some relevance to retrieve chunks from more varied sources.
Diversity parameter (0.0 - 1.0):
| Value | Documents | Behavior | Use Case |
|---|---|---|---|
| 0.0 | ~4 docs | Pure relevance (standard vector search) | Highest precision needed |
| 0.3 | ~5 docs | Slight diversity, high relevance | Legal docs, technical specs |
| 0.5 | ~6 docs | Balanced (recommended default) ⭐ | General use, Q&A |
| 0.7 | ~8 docs | High diversity, varied sources | Research, exploration |
| 1.0 | ~8 docs | Maximum diversity (lower relevance) | Broad topic overview |
Trade-off:
- Higher diversity → More different documents, but average relevance score drops (0.67 → 0.59)
- Lower diversity → Higher relevance, but chunks may come from same sections
When to use:
0.3-0.4— Precision-critical tasks (legal, medical, technical specifications)0.5-0.6— Default balanced mode for most queries0.7-0.8— Exploratory research, brainstorming, broad topic surveys
Example impact (Top K = 8):
- Without MMR: 4 chunks from Unit 1, 2 from Unit 14, 1 from Unit 2, 1 from Unit 8
- With MMR (0.6): 3 chunks from Unit 1, 2 from Unit 5, 1 each from Units 4, 8, 12 → more diverse sources
MMR is not always better. It can hurt answer quality when sequential information is needed:
❌ Don't use MMR for:
| Query Type | Why MMR Hurts | Example |
|---|---|---|
| Sequential explanations | Breaks logical flow by skipping intermediate steps | "Explain the rounding rules" → With MMR: gets intro + conclusion but misses rules 1-3. Without MMR: gets complete sequential explanation |
| Step-by-step instructions | Scatters steps across different sections | "How to install the software" → With MMR: step 1, step 5, troubleshooting. Without: steps 1-6 in order |
| Mathematical proofs | Misses critical intermediate steps | "Prove theorem X" → With MMR: theorem statement + conclusion, missing proof steps |
| Technical procedures | Jumps between prerequisites and advanced steps | "Configure authentication" → With MMR: overview + edge cases, missing basic setup |
| Definition lookups | Gets related concepts instead of the definition | "What is an API?" → With MMR: mentions of APIs in different contexts vs. focused definition |
✅ Use MMR for:
| Query Type | Why MMR Helps | Example |
|---|---|---|
| Comparative questions | Brings perspectives from different sections | "Compare Python vs JavaScript" → gets examples from multiple contexts |
| Topic overviews | Samples diverse aspects of a subject | "What is machine learning?" → gets theory, applications, examples from different chapters |
| Exploratory research | Discovers unexpected connections | "Applications of blockchain" → finds use cases across finance, healthcare, supply chain |
| Multi-faceted questions | Needs information from multiple independent sources | "Pros and cons of microservices" → gets architectural, operational, cost perspectives |
| Brainstorming | Maximum idea diversity from varied sources | "Innovation strategies" → collects diverse approaches from different case studies |
Real example from production:
Query: "Tell me about rounding methods"
- With MMR (0.6): Retrieved chunks from 5 different sections → mentioned "preliminary rounding" but missed the 3 main rounding rules (most important part). Answer was incomplete.
- Without MMR: Retrieved chunks from 1 section sequentially → got all 3 rules + examples + edge cases. Complete answer.
Rule of thumb: If your query expects information from one logical section of a document (rules, procedures, definitions), disable MMR. If you're exploring a topic across multiple independent sources, enable MMR.
- Score threshold vs TOC: TOC/structure queries can return chunks with lower similarity scores. If you see missing sections or "not found" responses, set Score threshold = 0 (no filtering) before running TOC‑style queries.
- Top K and Max context: higher Top K increases recall, but you may need a higher Max context to avoid truncation.
- Hybrid mode: BM25 improves exact‑term matches. For paraphrases, keep some weight on dense vectors.
- MMR vs Top K: MMR is most effective with larger Top K (20+). With small Top K (5-10), diversity impact is limited.
When you first enable hybrid search on an existing KB, use Reindex for BM25 to populate the lexical index.
Each knowledge base can override the TOC / Structure model used for document structure analysis. If override is disabled, the global LLM model is used. KB‑level configuration (chunk size/overlap, batch size, chunking strategy) is set per KB and affects only new or reprocessed documents.
app/ # Backend
frontend/ # UI
docker/ # Docker assets
This project is actively used and evolving. If you want to adapt it to a new domain or provider, the API layer and retrieval engine are designed to be modular.
