Custom Knowledge Graph Builder from Scratch

A complete custom implementation of a knowledge graph builder using Qdrant for vector storage and Neo4j for graph storage. Process PDFs, extract entities and relationships, and perform intelligent search using GraphRAG.

Features

✅ Custom Implementation: Built from scratch, no dependencies on neo4j_graphrag
✅ Qdrant Integration: Fast vector similarity search
✅ Neo4j Integration: Powerful graph traversal
✅ LLM-Powered: Uses GPT for entity/relationship extraction
✅ Configurable: All parameters are customizable
✅ Scalable: Handles large documents efficiently
✅ Async Processing: Parallel batch processing for faster ingestion
✅ CLI Interface: Easy-to-use command-line interface

Architecture

Text Input
    ↓
[Text Chunker] → Chunks
    ↓
[Embedding Generator] → Embeddings
    ↓
[Entity Extractor] → Entities & Relationships
    ↓
[Qdrant Store] ← Chunks + Embeddings
[Neo4j Store] ← Entities + Relationships

Installation

pip install -r requirements.txt

Configuration

Set environment variables in .env file or export them:

export OPENAI_API_KEY="your-openai-api-key"
export NEO4J_URI="neo4j+s://your-instance.databases.neo4j.io"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="your-password"

# Optional: Qdrant cloud
export QDRANT_URL="https://your-cluster.qdrant.io"
export QDRANT_API_KEY="your-qdrant-api-key"

Or use the config.py file with a .env file (recommended).

Quick Start

1. Upload and Process a PDF

python src/main.py upload static/docs/note.pdf

2. Search the Knowledge Graph

python src/main.py search "What is Kleros?"

CLI Usage

Upload Command

Upload and process PDF files to build the knowledge graph:

python src/main.py upload <pdf_path> [options]

Arguments:

pdf_path - Path to the PDF file to process

Options:

--chunk-size <int> - Size of text chunks (default: 500)
--chunk-overlap <int> - Overlap between chunks (default: 100)
--pages-per-batch <int> - Number of pages per batch (default: 10)
--max-concurrent-batches <int> - Maximum concurrent batches (default: 3)
--clear - Clear existing data before uploading

Examples:

# Basic upload
python src/main.py upload static/docs/note.pdf

# Upload with custom chunk size
python src/main.py upload static/docs/note.pdf --chunk-size 1000 --chunk-overlap 200

# Upload with custom batch settings
python src/main.py upload static/docs/note.pdf --pages-per-batch 5 --max-concurrent-batches 5

# Clear existing data and upload
python src/main.py upload static/docs/note.pdf --clear

Search Command

Search the knowledge graph using natural language queries:

python src/main.py search "<query>" [options]

Arguments:

query - Search query (use quotes for multi-word queries)

Options:

--top-k <int> - Number of top chunks to retrieve (default: 5)
--max-depth <int> - Maximum graph traversal depth (default: 2)

Examples:

# Basic search
python src/main.py search "What is Kleros?"

# Search with more chunks
python src/main.py search "What are the main concepts?" --top-k 10

# Search with deeper graph traversal
python src/main.py search "How does X relate to Y?" --max-depth 3

Delete Command

Delete all data from Qdrant and Neo4j:

python src/main.py delete --confirm

Options:

--confirm (required) - Confirmation flag to proceed with deletion

Example:

# Without confirmation (shows warning)
python src/main.py delete

# With confirmation (deletes all data)
python src/main.py delete --confirm

Help

# General help
python src/main.py --help

# Command-specific help
python src/main.py upload --help
python src/main.py search --help
python src/main.py delete --help

Programmatic Usage

Basic Usage

from src.builders.kg_builder import KnowledgeGraphBuilder
from src.builders.graphrag import GraphRAG

# Initialize builder
kg_builder = KnowledgeGraphBuilder(
    openai_api_key="your-key",
    neo4j_uri="neo4j+s://...",
    neo4j_username="neo4j",
    neo4j_password="password"
)

# Build knowledge graph
text = "Your text here..."
result = kg_builder.build_from_text(text)

# Initialize GraphRAG
graphrag = GraphRAG(
    openai_api_key="your-key",
    vector_store=kg_builder.vector_store,
    graph_store=kg_builder.graph_store
)

# Search
query = "What is X?"
answer = graphrag.search(query)
print(answer["answer"])

Async Batch Processing

import asyncio
from src.processors.pdf_processor import PDFProcessor

pdf_processor = PDFProcessor("static/docs/note.pdf")
page_batches = pdf_processor.get_page_batches(pages_per_batch=10)

text_batches = []
for start_page, end_page in page_batches:
    text_batches.append(pdf_processor.process_batch(start_page, end_page))

# Process batches in parallel
result = await kg_builder.async_build_from_text_batches(
    text_batches,
    max_concurrent_batches=3
)

Project Structure

The codebase is organized into logical modules:

Core (`src/core/`)

Core processing components:

text_chunker.py: Splits text into chunks with overlap
embeddings.py: Generates vector embeddings (async support)
entity_extractor.py: Extracts entities and relationships using LLM

Storage (`src/storage/`)

Storage backends:

qdrant_store.py: Vector storage (local or cloud)
neo4j_store.py: Graph storage with graceful error handling

Builders (`src/builders/`)

High-level orchestrators:

kg_builder.py: Main knowledge graph building pipeline
graphrag.py: GraphRAG search and retrieval

Processors (`src/processors/`)

File processing:

pdf_reader.py: PDF text extraction
pdf_processor.py: PDF processing with batch support

Config (`src/config/`)

Configuration:

config.py: Environment variable management
models.py: Model enums

CLI (`src/cli/`)

Command-line interface:

main.py: Upload and search commands

How It Works

Upload Process

PDF Processing: Extracts text from PDF in batches
Text Chunking: Splits text into manageable pieces
Embedding: Creates vector representations (parallel)
Extraction: LLM extracts entities and relationships (parallel)
Storage: Stores chunks in Qdrant, graph in Neo4j

Search Process

Vector Search: Find similar chunks using embeddings
Graph Traversal: Get entities connected to relevant chunks
Context Building: Combine chunk text + entity information
Answer Generation: LLM generates answer from context

Documentation

Architecture Overview - System architecture and design
Usage Guide - Detailed usage instructions
CLI Reference - Complete CLI documentation

Notes

The implementation uses Neo4j driver for graph operations (standard approach)
Qdrant can run locally (default) or in the cloud
All LLM calls use OpenAI API
The system handles 'id' property conflicts automatically
Neo4j connection failures are handled gracefully (system continues with Qdrant only)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
docs		docs
src		src
static/docs		static/docs
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Custom Knowledge Graph Builder from Scratch

Features

Architecture

Installation

Configuration

Quick Start

1. Upload and Process a PDF

2. Search the Knowledge Graph

CLI Usage

Upload Command

Search Command

Delete Command

Help

Programmatic Usage

Basic Usage

Async Batch Processing

Project Structure

Core (src/core/)

Storage (src/storage/)

Builders (src/builders/)

Processors (src/processors/)

Config (src/config/)

CLI (src/cli/)

How It Works

Upload Process

Search Process

Documentation

Notes

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Core (`src/core/`)

Storage (`src/storage/`)

Builders (`src/builders/`)

Processors (`src/processors/`)

Config (`src/config/`)

CLI (`src/cli/`)

Packages