This module provides complete knowledge base initialization, management, and query functionality.
knowledge/
├── __init__.py # Module initialization
├── config.py # Path configuration (unified management)
├── start_kb.py # Startup script (main entry) ⭐
├── kb.py # Quick startup script (recommended)
├── initializer.py # Knowledge base initializer
├── add_documents.py # Incremental document addition ⭐ New feature
├── manager.py # Knowledge base manager
├── extract_numbered_items.py # Numbered items extractor
├── progress_tracker.py # Progress tracking
└── README.md # This file
Recommended usage (direct execution):
# Run from project root directory (DeepTutor/)
python -m src.knowledge.start_kb [command]Or using module import:
# Ensure in project root directory (DeepTutor/)
python -m src.knowledge.start_kb [command]python -m src.knowledge.start_kb list# View default knowledge base
python -m src.knowledge.start_kb info
# View specified knowledge base
python -m src.knowledge.start_kb info ai_textbookpython -m src.knowledge.start_kb set-default math2211# Create from single or multiple documents
python -m src.knowledge.start_kb init my_textbook --docs textbook.pdf notes.pdf
# Create from directory
python -m src.knowledge.start_kb init my_course --docs-dir ./course_materials/
# Skip numbered items extraction (faster)
python -m src.knowledge.start_kb init my_kb --docs doc.pdf --skip-extract
# Custom batch size (for large documents)
python -m src.knowledge.start_kb init my_kb --docs large_book.pdf --batch-size 30# Extract all numbered items for specified knowledge base
python -m src.knowledge.start_kb extract --kb ai_textbook
# Process specific file only
python -m src.knowledge.start_kb extract --kb ai_textbook --content-file chapter1.json
# Debug mode (process first file only)
python -m src.knowledge.start_kb extract --kb ai_textbook --debug
# Custom concurrency and batch processing
python -m src.knowledge.start_kb extract --kb ai_textbook --batch-size 30 --max-concurrent 10Add new documents to existing knowledge base without recreating the entire knowledge base:
# Add single document to knowledge base
python -m src.knowledge.add_documents ai_textbook --docs new_chapter.pdf
# Add multiple documents
python -m src.knowledge.add_documents math2211 --docs chapter1.pdf chapter2.pdf
# Add documents from directory
python -m src.knowledge.add_documents ai_textbook --docs-dir ./new_materials/
# Allow overwriting duplicate files
python -m src.knowledge.add_documents ai_textbook --docs document.pdf --allow-duplicates
# Add files only, skip processing (process later manually)
python -m src.knowledge.add_documents ai_textbook --docs file.pdf --skip-processing
# Skip numbered items extraction
python -m src.knowledge.add_documents ai_textbook --docs file.pdf --skip-extractBenefits of incremental addition:
- ✅ Only processes new documents, saves time and cost
- ✅ Automatically merges with existing knowledge graph without affecting existing content
- ✅ Automatically skips duplicate files to avoid reprocessing
- ✅ Automatically updates knowledge base metadata and history
Permanently delete a knowledge base:
# Delete knowledge base (requires confirmation)
python -m src.knowledge.start_kb delete old_kb
# Force delete (skip confirmation, dangerous!)
python -m src.knowledge.start_kb delete old_kb --forceWhen RAG storage is corrupted (e.g., GraphML file corruption), clean and rebuild:
# Clean RAG storage for specified knowledge base (auto backup)
python -m src.knowledge.start_kb clean-rag C2-test
# Clean default knowledge base RAG storage
python -m src.knowledge.start_kb clean-rag
# Clean without backup (not recommended)
python -m src.knowledge.start_kb clean-rag C2-test --no-backupUse cases:
- ❌ RAG initialization failure (e.g., GraphML parsing error)
- ❌ Knowledge graph data corruption
- 🔄 Need to rebuild knowledge graph
After cleaning:
Use add_documents.py to reprocess documents to rebuild RAG storage
Reprocess all documents in knowledge base:
# Refresh knowledge base (rebuild RAG only)
python -m src.knowledge.start_kb refresh ai_textbook
# Full refresh (clean all extracted content and images)
python -m src.knowledge.start_kb refresh ai_textbook --full
# Refresh without backing up RAG storage
python -m src.knowledge.start_kb refresh ai_textbook --no-backup
# Skip numbered items extraction
python -m src.knowledge.start_kb refresh ai_textbook --skip-extractUse cases:
- 🔧 Updated parsing algorithm or model
- 🐛 Fixed previous processing errors
- 📊 Need to completely rebuild knowledge graph
Difference:
refresh: Reprocesses all documentsadd_documents: Only processes newly added documents (incremental)
Initialized knowledge base directory structure:
knowledge_bases/
├── kb_config.json # Knowledge base configuration
└── my_textbook/ # Knowledge base directory
├── metadata.json # Metadata (includes update history)
├── numbered_items.json # Numbered items (Definition, Theorem, etc.)
├── raw/ # Original documents
│ ├── textbook.pdf # Initial documents
│ └── new_chapter.pdf # Incrementally added documents
├── images/ # Extracted images
│ ├── figure_1_1.jpg
│ └── new_figure.jpg # Images from new documents
├── content_list/ # Document content list
│ ├── textbook.json
│ └── new_chapter.json # Content list for new documents
└── rag_storage/ # RAG knowledge graph (incremental updates)
├── kv_store_full_entities.json
├── kv_store_full_relations.json
└── kv_store_text_chunks.json
metadata.json structure example:
{
"name": "my_textbook",
"created_at": "2025-01-15 10:30:00",
"last_updated": "2025-01-20 14:25:00",
"description": "Knowledge base: my_textbook",
"version": "1.0",
"update_history": [
{
"timestamp": "2025-01-15 10:30:00",
"action": "create",
"files_added": 1
},
{
"timestamp": "2025-01-20 14:25:00",
"action": "add_documents",
"files_added": 2
}
]
}Required in project root .env file:
LLM_BINDING_API_KEY=your_openai_api_key
LLM_BINDING_HOST=https://api.openai.com/v1When knowledge base already exists and you only need to add new documents:
# Wrong approach: Recreate entire knowledge base (wastes time and API costs)
python -m src.knowledge.start_kb init ai_textbook --docs chapter10.pdf
# Correct approach: Use incremental addition
python -m src.knowledge.add_documents ai_textbook --docs chapter10.pdfBenefits:
- Only processes new documents, saves API costs
- Preserves existing knowledge graph, new content automatically merged
- Automatically skips duplicate files to avoid reprocessing
For large documents, recommend increasing batch size and concurrency:
python -m src.knowledge.start_kb init large_book \
--docs book.pdf \
--batch-size 30
python -m src.knowledge.start_kb extract \
--kb large_book \
--batch-size 30 \
--max-concurrent 10Use debug mode for testing, only processes first file:
python -m src.knowledge.start_kb extract --kb test_kb --debugUnified path management to avoid path conflicts:
from src.knowledge.config import (
PROJECT_ROOT, # Project root directory
KNOWLEDGE_BASES_DIR, # Knowledge base directory
setup_paths, # Setup Python paths
get_env_config # Get environment variables
)Manages multiple knowledge bases:
from src.knowledge import KnowledgeBaseManager
manager = KnowledgeBaseManager("./knowledge_bases")
kb_list = manager.list_knowledge_bases()
manager.set_default("ai_textbook")
info = manager.get_info("ai_textbook")Creates and initializes new knowledge bases:
from src.knowledge import KnowledgeBaseInitializer
initializer = KnowledgeBaseInitializer(
kb_name="my_kb",
base_dir="./knowledge_bases",
api_key="your_key"
)
await initializer.process_documents()Adds new documents to existing knowledge base (incremental updates):
from src.knowledge.add_documents import DocumentAdder
adder = DocumentAdder(
kb_name="ai_textbook",
base_dir="./knowledge_bases",
api_key="your_key"
)
# Add documents
new_files = adder.add_documents(
source_files=["new_chapter.pdf"],
skip_duplicates=True
)
# Process new documents
processed = await adder.process_new_documents(new_files)
# Extract numbered items
adder.extract_numbered_items_for_new_docs(processed)Features:
- Only processes newly added documents, saves time and API costs
- Automatically merges with existing knowledge graph
- Intelligently skips duplicate files
- Updates knowledge base metadata
Extracts Definition, Theorem, Formula, Figure, etc.:
from src.knowledge.extract_numbered_items import process_content_list
process_content_list(
content_list_file=Path("content_list/chapter1.json"),
output_file=Path("numbered_items.json"),
api_key="your_key",
base_url="your_url"
)- Definition - Definitions
- Proposition - Propositions
- Theorem - Theorems
- Lemma - Lemmas
- Corollary - Corollaries
- Example - Examples
- Remark - Remarks
- Figure - Figures
- Equation - Equations (numbered)
- Table - Tables
A: Use setup_paths() function in config.py to uniformly set paths:
from src.knowledge.config import setup_paths
setup_paths()A: Ensure raganything/RAG-Anything is in project parent directory, or modify path in config.py.
A: Check LLM_BINDING_API_KEY and LLM_BINDING_HOST configuration in .env file.
A: This is caused by corrupted RAG storage files. Fix using:
# Clean corrupted RAG storage
python -m src.knowledge.start_kb clean-rag your_kb_name
# Then reprocess documents
python -m src.knowledge.start_kb refresh your_kb_nameA: Use refresh command:
# Method 1: Full refresh (preserve original documents)
python -m src.knowledge.start_kb refresh kb_name --full
# Method 2: Delete and recreate
python -m src.knowledge.start_kb delete kb_name
python -m src.knowledge.start_kb init kb_name --docs-dir ./documents/- Use start_kb.py or kb.py - Unified entry point, avoid directly calling submodules
- Configure default knowledge base - Reduces need to specify knowledge base name each time
- Incremental document addition ⭐ - Use
add_documents.pywhen adding new documents to existing knowledge base, not reinitialize - Regular cleanup 🧹 - Use
clean-ragcommand when encountering RAG errors - Careful deletion 🗑️ - Ensure important data is backed up before deleting knowledge base
- Regular backups -
numbered_items.jsonandrag_storage/are important - Version control - Recommend adding
knowledge_bases/to.gitignore - Check update history - Use
infocommand to viewupdate_historyin metadata - Error recovery - When RAG is corrupted, first
clean-rag, thenrefreshoradd_documents
src/tools/rag_tool.py- RAG query toolsrc/tools/query_item_tool.py- Query numbered itemssrc/agents/solve/- Problem solving systemsrc/agents/research/- Deep research system