SemanticCite: AI-Powered Citation Verification

Automated full-text citation verification and contextual insights

🎯 Project Mission

Citations should guide readers to exact evidence, not just point to entire papers. Research today suffers from widespread citation inaccuracies and the challenge of locating specific supporting content within referenced documents.

SemanticCite transforms citation verification by analysing complete source documents and providing nuanced classification through four categories: Supported, Partially Supported, Unsupported, and Uncertain. Beyond simple validation, the system delivers detailed reasoning, confidence scores, and evidence reference snippets that show researchers exactly how their claims connect to the supporting literature.

✨ Features

🔍 Deep Semantic Analysis

Full-text document analysis (not just abstracts)
4-class classification: Supported, Partially Supported, Unsupported, Uncertain
Evidence-based reasoning with relevant text snippets

🧠 Lightweight AI Models

Fine-tuned Qwen3 models (1.7B & 4B parameters)
Performance comparable to GPT-4 with 100x fewer resources
Local deployment option for privacy

🔄 Triple Retrieval System

Dense vector search + sparse BM25 matching
Neural reranking with FlashRank
Optimized for accuracy and efficiency

🌐 Multiple Deployment Options

Web interface (Streamlit)
Python API
Local/cloud deployment
Enterprise licensing available

📦 Installation

Option 1: Quick Start (Recommended)

# Clone and setup
git clone https://github.com/your-org/SemanticCite
cd SemanticCite
conda env create -f environment.yaml
conda activate cite

# Set API keys (copy from template)
cp .env.example .env
# Edit .env with your API keys (see Environment Setup below)

# Run web interface
streamlit run src/app.py
# Visit http://localhost:8501

Option 2: Python Package (Coming Soon)

pip install semanticcite

Environment Setup

Create a .env file from the template and add your API keys:

cp .env.example .env

Edit .env with the providers you plan to use:

# For OpenAI models (LLM and/or embeddings)
OPENAI_API_KEY=sk-...

# For Claude models
ANTHROPIC_API_KEY=sk-ant-...

# For Gemini models
GEMINI_API_KEY=AIza...

# Optional: For custom endpoints
NVIDIA_API_KEY=nvapi-...

Note: Only the API keys for your chosen providers are required. For fully local deployment with Ollama, no API keys are needed.

⏱️ First-Run Expectations

Initial Model Downloads

On first run, SemanticCite will automatically download required models. Download times vary based on your configuration:

Cloud Providers (OpenAI/Claude/Gemini):

FlashRank reranking model: ~150MB (~30-60 seconds)
Local embeddings (if selected): ~400MB-1GB (~1-3 minutes)

Local Deployment (Ollama):

FlashRank reranking model: ~150MB (~30-60 seconds)
SemanticCite-Refiner-Qwen3-1B: ~1GB (~2-4 minutes)
SemanticCite-Checker-Qwen3-4B: ~2.5GB (~4-8 minutes)
Local embeddings: ~400MB-1GB (~1-3 minutes)

Total first-run setup time: 2-15 minutes depending on configuration

Processing Times

After initial setup, typical analysis times:

Cloud providers: 5-15 seconds per citation
Local models (Ollama): 10-30 seconds per citation (first analysis may take longer as models load into memory)

What Happens During Analysis

Document Processing (2-5s): Splits reference into chunks, creates vector embeddings
Claim Extraction (1-3s): Extracts core claim from citation text
Retrieval (1-2s): Finds relevant chunks using hybrid search (BM25 + dense vectors)
Reranking (1-2s): Reorders chunks by semantic relevance
Classification (2-5s): Analyses support level and generates reasoning

📚 Usage

For Researchers & Students: Streamlit Web Interface

After completing installation, launch the web interface:

streamlit run src/app.py
# Visit http://localhost:8501

Features:

📁 Upload files (PDF, TXT, Markdown) or download from URLs
⚙️ Choose LLM providers: OpenAI, Claude, Gemini, or Local (Ollama)
🔍 Multiple embedding options: Local SentenceTransformers, OpenAI, Custom endpoint
📝 Optional metadata input for enhanced context
📊 Interactive results with reasoning and evidence snippets in collapsible expanders
📥 Export results to Markdown format for documentation

For Developers: Python API

# Basic usage
from src.citecheck import ReferenceChecker

# Initialize with default OpenAI models
checker = ReferenceChecker()

# Or configure specific providers
checker = ReferenceChecker(
    llm_provider="openai",
    llm_config={
        "model": "gpt-4.1-mini",
        "temperature": 0.7
    },
    embedding_provider="local",
    embedding_config={
        "model_name": "all-mpnet-base-v2"
    }
)

# Check a citation
result = checker.check_citation(
    citation="Your citation text here",
    reference_text="Reference document text",
    metadata="Optional document metadata"
)

print(f"Classification: {result['classification']}")
print(f"Confidence: {result['metadata']['confidence_score']}")
print(f"Reasoning: {result['reasoning']}")

Command Line Interface

# Basic CLI usage
python src/citecheck.py \
  --citation "Over the period 2004–2012, a decline in the AMOC has been observed" \
  --reference "path/to/reference.pdf"

# Interactive mode (prompts for inputs)
python src/citecheck.py

Supported Providers & Models

LLM Providers

Powered by LiteLLM supporting 100+ AI providers including OpenAI, Claude, Gemini, and local endpoints via Ollama.

Embedding Providers

Local: SentenceTransformers models (all-mpnet-base-v2, Qwen/Qwen3-Embedding-0.6B)
OpenAI: text-embedding-3-small, text-embedding-ada-002
Custom Endpoint: Any OpenAI-compatible embedding API

💼 Tailored Solutions to Scale

Need to verify entire documents automatically? Visit semanticcite.com for tailored solutions:

Complete Citation System: Automatic document processing with citation extraction and verification of all references in one workflow
Batch Processing: Verify hundreds or thousands of citations efficiently with automated pipelines
API Integration: RESTful API for seamless integration into editorial and publishing workflows
On-premise Deployment: Secure, private installation with custom model training on your domain

🔧 Technical Details

System Architecture

Hybrid Retrieval: BM25 + Dense Vector Search
Reranking: FlashRank neural reranking
Classification: Fine-tuned Qwen3 models
Frontend: Streamlit web interface
Storage: ChromaDB vector database

Model Configuration

Cloud Deployment (Single Model)

For cloud providers (OpenAI, Claude, Gemini), a single model handles both claim extraction and classification:

Model: Provider-specific (e.g., GPT-4, Claude Sonnet, Gemini Flash)
Embedding: Local SentenceTransformers or OpenAI embeddings
Supported Formats: PDF, TXT, Markdown

Local Deployment (Dual Model)

For local deployment with Ollama, two specialized models work together:

Preprocessing Model: SemanticCite-Refiner-Qwen3-1B (extracts core claims from citations)
Classification Model: SemanticCite-Checker-Qwen3-4B (analyses support level)
Embedding Model: Local SentenceTransformers (e.g., Qwen/Qwen3-Embedding-0.6B)
Advantage: Optimized models with better performance and lower resource usage

Setting Up Ollama Models

The SemanticCite models are available on Hugging Face and can be used with Ollama:

Models:

SemanticCite-Refiner-Qwen3-1B - Claim extraction (1.7B parameters)
SemanticCite-Checker-Qwen3-4B - Citation verification (4B parameters)

Installation:

Install Ollama from ollama.ai

Download the models:

ollama pull sebsigma/semanticcite-refiner-qwen3-1b
ollama pull sebsigma/semanticcite-checker-qwen3-4b

Verify installation:
```
ollama list
```
In the Streamlit interface, select "Local Ollama" as your LLM provider

🧪 Testing

# Run test suite
python tests/run_tests.py

# Test specific functionality
python tests/test_citecheck.py

🔧 Troubleshooting

Common Issues

Ollama Connection Timeout

Problem: Analysis fails with "Connection timed out after 120.0 seconds" when using local Ollama models.

Solutions:

Increase timeout in code: The default timeout is 120 seconds. For slower systems, this may be insufficient.
Check Ollama is running:
```
curl http://localhost:11434/api/tags
```
Verify models are installed:
```
ollama list
```
Test model loading time:
```
python tests/test_ollama_diagnostics.py
```

Missing API Keys

Problem: Analysis fails with "API key required" error.

Solution: Ensure you've:

Created a .env file from .env.example
Added the correct API key for your selected provider
Restarted the Streamlit app after adding keys

FlashRank Download Fails

Problem: First run fails during FlashRank model download.

Solutions:

Check internet connection
Retry - the download will resume from where it left off

Manually download and cache the model:

from flashrank import Ranker
ranker = Ranker(model_name="ms-marco-MultiBERT-L-12")

PDF Upload Fails

Problem: PDF file upload returns an error.

Solutions:

Verify the PDF is not corrupted or password-protected
Try converting to TXT format first
Check file size is reasonable (<50MB recommended)

Low Memory / Out of Memory

Problem: System crashes or becomes unresponsive during analysis.

Solutions:

Close other applications to free up RAM
Use cloud providers instead of local models
Reduce chunk size in processing parameters
Process one citation at a time

"No chunks met the relevance threshold" Warning

Problem: Analysis completes but shows no evidence chunks.

This is normal behaviour when:

The citation is not well-supported by the reference document
The reference document doesn't contain relevant information
The citation refers to a different section/paper

To investigate:

Check if you uploaded the correct reference document
Verify the citation actually refers to this paper
Try adjusting the relevance threshold (advanced configuration)

Getting Help

If you encounter issues not covered here:

Check GitHub Issues
Review logs in the logs/ directory
Open a new issue with:
- Error message
- Configuration details (LLM/embedding providers)
- Steps to reproduce

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Academic Citation

If you use SemanticCite in your research, please cite our paper:

@article{semanticcite2025,
  title={SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning},
  author={Sebastian Haan},
  journal={ArXiv Preprint},
  year={2025},
  url={https://arxiv.org/abs/2511.16198}
}

🙏 Acknowledgments

Built with LangChain, LiteLLM, and Streamlit
Models fine-tuned on Qwen3 architecture using Unsloth
Vector search powered by ChromaDB
Neural reranking via FlashRank
Supported by the Sydney Informatics Hub at the University of Sydney

SemanticCite - Enhancing research quality through AI-powered citation verification and insight

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.streamlit		.streamlit
assets		assets
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt

sebhaan/SemanticCite

Folders and files

Latest commit

History

Repository files navigation

SemanticCite: AI-Powered Citation Verification

🎯 Project Mission

✨ Features

🔍 Deep Semantic Analysis

🧠 Lightweight AI Models

🔄 Triple Retrieval System

🌐 Multiple Deployment Options

📦 Installation

Option 1: Quick Start (Recommended)

Option 2: Python Package (Coming Soon)

Environment Setup

⏱️ First-Run Expectations

Initial Model Downloads

Processing Times

What Happens During Analysis

📚 Usage

For Researchers & Students: Streamlit Web Interface

For Developers: Python API

Command Line Interface

Supported Providers & Models

LLM Providers

Embedding Providers

💼 Tailored Solutions to Scale

🔧 Technical Details

System Architecture

Model Configuration

Cloud Deployment (Single Model)

Local Deployment (Dual Model)

Setting Up Ollama Models

🧪 Testing

🔧 Troubleshooting

Common Issues

Ollama Connection Timeout

Missing API Keys

FlashRank Download Fails

PDF Upload Fails

Low Memory / Out of Memory

"No chunks met the relevance threshold" Warning

Getting Help

📄 License

Academic Citation

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages