A complete custom implementation of a knowledge graph builder using Qdrant for vector storage and Neo4j for graph storage. Process PDFs, extract entities and relationships, and perform intelligent search using GraphRAG.
- ✅ Custom Implementation: Built from scratch, no dependencies on neo4j_graphrag
- ✅ Qdrant Integration: Fast vector similarity search
- ✅ Neo4j Integration: Powerful graph traversal
- ✅ LLM-Powered: Uses GPT for entity/relationship extraction
- ✅ Configurable: All parameters are customizable
- ✅ Scalable: Handles large documents efficiently
- ✅ Async Processing: Parallel batch processing for faster ingestion
- ✅ CLI Interface: Easy-to-use command-line interface
Text Input
↓
[Text Chunker] → Chunks
↓
[Embedding Generator] → Embeddings
↓
[Entity Extractor] → Entities & Relationships
↓
[Qdrant Store] ← Chunks + Embeddings
[Neo4j Store] ← Entities + Relationships
pip install -r requirements.txtSet environment variables in .env file or export them:
export OPENAI_API_KEY="your-openai-api-key"
export NEO4J_URI="neo4j+s://your-instance.databases.neo4j.io"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="your-password"
# Optional: Qdrant cloud
export QDRANT_URL="https://your-cluster.qdrant.io"
export QDRANT_API_KEY="your-qdrant-api-key"Or use the config.py file with a .env file (recommended).
python src/main.py upload static/docs/note.pdfpython src/main.py search "What is Kleros?"Upload and process PDF files to build the knowledge graph:
python src/main.py upload <pdf_path> [options]Arguments:
pdf_path- Path to the PDF file to process
Options:
--chunk-size <int>- Size of text chunks (default: 500)--chunk-overlap <int>- Overlap between chunks (default: 100)--pages-per-batch <int>- Number of pages per batch (default: 10)--max-concurrent-batches <int>- Maximum concurrent batches (default: 3)--clear- Clear existing data before uploading
Examples:
# Basic upload
python src/main.py upload static/docs/note.pdf
# Upload with custom chunk size
python src/main.py upload static/docs/note.pdf --chunk-size 1000 --chunk-overlap 200
# Upload with custom batch settings
python src/main.py upload static/docs/note.pdf --pages-per-batch 5 --max-concurrent-batches 5
# Clear existing data and upload
python src/main.py upload static/docs/note.pdf --clearSearch the knowledge graph using natural language queries:
python src/main.py search "<query>" [options]Arguments:
query- Search query (use quotes for multi-word queries)
Options:
--top-k <int>- Number of top chunks to retrieve (default: 5)--max-depth <int>- Maximum graph traversal depth (default: 2)
Examples:
# Basic search
python src/main.py search "What is Kleros?"
# Search with more chunks
python src/main.py search "What are the main concepts?" --top-k 10
# Search with deeper graph traversal
python src/main.py search "How does X relate to Y?" --max-depth 3Delete all data from Qdrant and Neo4j:
python src/main.py delete --confirmOptions:
--confirm(required) - Confirmation flag to proceed with deletion
Example:
# Without confirmation (shows warning)
python src/main.py delete
# With confirmation (deletes all data)
python src/main.py delete --confirm# General help
python src/main.py --help
# Command-specific help
python src/main.py upload --help
python src/main.py search --help
python src/main.py delete --helpfrom src.builders.kg_builder import KnowledgeGraphBuilder
from src.builders.graphrag import GraphRAG
# Initialize builder
kg_builder = KnowledgeGraphBuilder(
openai_api_key="your-key",
neo4j_uri="neo4j+s://...",
neo4j_username="neo4j",
neo4j_password="password"
)
# Build knowledge graph
text = "Your text here..."
result = kg_builder.build_from_text(text)
# Initialize GraphRAG
graphrag = GraphRAG(
openai_api_key="your-key",
vector_store=kg_builder.vector_store,
graph_store=kg_builder.graph_store
)
# Search
query = "What is X?"
answer = graphrag.search(query)
print(answer["answer"])import asyncio
from src.processors.pdf_processor import PDFProcessor
pdf_processor = PDFProcessor("static/docs/note.pdf")
page_batches = pdf_processor.get_page_batches(pages_per_batch=10)
text_batches = []
for start_page, end_page in page_batches:
text_batches.append(pdf_processor.process_batch(start_page, end_page))
# Process batches in parallel
result = await kg_builder.async_build_from_text_batches(
text_batches,
max_concurrent_batches=3
)The codebase is organized into logical modules:
Core processing components:
- text_chunker.py: Splits text into chunks with overlap
- embeddings.py: Generates vector embeddings (async support)
- entity_extractor.py: Extracts entities and relationships using LLM
Storage backends:
- qdrant_store.py: Vector storage (local or cloud)
- neo4j_store.py: Graph storage with graceful error handling
High-level orchestrators:
- kg_builder.py: Main knowledge graph building pipeline
- graphrag.py: GraphRAG search and retrieval
File processing:
- pdf_reader.py: PDF text extraction
- pdf_processor.py: PDF processing with batch support
Configuration:
- config.py: Environment variable management
- models.py: Model enums
Command-line interface:
- main.py: Upload and search commands
- PDF Processing: Extracts text from PDF in batches
- Text Chunking: Splits text into manageable pieces
- Embedding: Creates vector representations (parallel)
- Extraction: LLM extracts entities and relationships (parallel)
- Storage: Stores chunks in Qdrant, graph in Neo4j
- Vector Search: Find similar chunks using embeddings
- Graph Traversal: Get entities connected to relevant chunks
- Context Building: Combine chunk text + entity information
- Answer Generation: LLM generates answer from context
- Architecture Overview - System architecture and design
- Usage Guide - Detailed usage instructions
- CLI Reference - Complete CLI documentation
- The implementation uses Neo4j driver for graph operations (standard approach)
- Qdrant can run locally (default) or in the cloud
- All LLM calls use OpenAI API
- The system handles 'id' property conflicts automatically
- Neo4j connection failures are handled gracefully (system continues with Qdrant only)
MIT