Skip to content

AI-powered citation verification with full-text analysis and contextual insight. Automatically validates citations with evidence-based reasoning, confidence scores, and support classifications. Features fine-tuned models, hybrid retrieval, and local deployment options.

Notifications You must be signed in to change notification settings

sebhaan/SemanticCite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SemanticCite

SemanticCite: AI-Powered Citation Verification

Paper Models Dataset Homepage LiteLLM License

Automated full-text citation verification and contextual insights

SemanticCite Input/Output

🎯 Project Mission

Citations should guide readers to exact evidence, not just point to entire papers. Research today suffers from widespread citation inaccuracies and the challenge of locating specific supporting content within referenced documents.

SemanticCite transforms citation verification by analysing complete source documents and providing nuanced classification through four categories: Supported, Partially Supported, Unsupported, and Uncertain. Beyond simple validation, the system delivers detailed reasoning, confidence scores, and evidence reference snippets that show researchers exactly how their claims connect to the supporting literature.


✨ Features

πŸ” Deep Semantic Analysis

  • Full-text document analysis (not just abstracts)
  • 4-class classification: Supported, Partially Supported, Unsupported, Uncertain
  • Evidence-based reasoning with relevant text snippets

🧠 Lightweight AI Models

  • Fine-tuned Qwen3 models (1.7B & 4B parameters)
  • Performance comparable to GPT-4 with 100x fewer resources
  • Local deployment option for privacy

πŸ”„ Triple Retrieval System

  • Dense vector search + sparse BM25 matching
  • Neural reranking with FlashRank
  • Optimized for accuracy and efficiency

🌐 Multiple Deployment Options

  • Web interface (Streamlit)
  • Python API
  • Local/cloud deployment
  • Enterprise licensing available

πŸ“¦ Installation

Option 1: Quick Start (Recommended)

# Clone and setup
git clone https://github.com/your-org/SemanticCite
cd SemanticCite
conda env create -f environment.yaml
conda activate cite

# Set API keys (copy from template)
cp .env.example .env
# Edit .env with your API keys (see Environment Setup below)

# Run web interface
streamlit run src/app.py
# Visit http://localhost:8501

Option 2: Python Package (Coming Soon)

pip install semanticcite

Environment Setup

Create a .env file from the template and add your API keys:

cp .env.example .env

Edit .env with the providers you plan to use:

# For OpenAI models (LLM and/or embeddings)
OPENAI_API_KEY=sk-...

# For Claude models
ANTHROPIC_API_KEY=sk-ant-...

# For Gemini models
GEMINI_API_KEY=AIza...

# Optional: For custom endpoints
NVIDIA_API_KEY=nvapi-...

Note: Only the API keys for your chosen providers are required. For fully local deployment with Ollama, no API keys are needed.


⏱️ First-Run Expectations

Initial Model Downloads

On first run, SemanticCite will automatically download required models. Download times vary based on your configuration:

Cloud Providers (OpenAI/Claude/Gemini):

  • FlashRank reranking model: ~150MB (~30-60 seconds)
  • Local embeddings (if selected): ~400MB-1GB (~1-3 minutes)

Local Deployment (Ollama):

  • FlashRank reranking model: ~150MB (~30-60 seconds)
  • SemanticCite-Refiner-Qwen3-1B: ~1GB (~2-4 minutes)
  • SemanticCite-Checker-Qwen3-4B: ~2.5GB (~4-8 minutes)
  • Local embeddings: ~400MB-1GB (~1-3 minutes)

Total first-run setup time: 2-15 minutes depending on configuration

Processing Times

After initial setup, typical analysis times:

  • Cloud providers: 5-15 seconds per citation
  • Local models (Ollama): 10-30 seconds per citation (first analysis may take longer as models load into memory)

What Happens During Analysis

  1. Document Processing (2-5s): Splits reference into chunks, creates vector embeddings
  2. Claim Extraction (1-3s): Extracts core claim from citation text
  3. Retrieval (1-2s): Finds relevant chunks using hybrid search (BM25 + dense vectors)
  4. Reranking (1-2s): Reorders chunks by semantic relevance
  5. Classification (2-5s): Analyses support level and generates reasoning

πŸ“š Usage

For Researchers & Students: Streamlit Web Interface

After completing installation, launch the web interface:

streamlit run src/app.py
# Visit http://localhost:8501

Features:

  • πŸ“ Upload files (PDF, TXT, Markdown) or download from URLs
  • βš™οΈ Choose LLM providers: OpenAI, Claude, Gemini, or Local (Ollama)
  • πŸ” Multiple embedding options: Local SentenceTransformers, OpenAI, Custom endpoint
  • πŸ“ Optional metadata input for enhanced context
  • πŸ“Š Interactive results with reasoning and evidence snippets in collapsible expanders
  • πŸ“₯ Export results to Markdown format for documentation

For Developers: Python API

# Basic usage
from src.citecheck import ReferenceChecker

# Initialize with default OpenAI models
checker = ReferenceChecker()

# Or configure specific providers
checker = ReferenceChecker(
    llm_provider="openai",
    llm_config={
        "model": "gpt-4.1-mini",
        "temperature": 0.7
    },
    embedding_provider="local",
    embedding_config={
        "model_name": "all-mpnet-base-v2"
    }
)

# Check a citation
result = checker.check_citation(
    citation="Your citation text here",
    reference_text="Reference document text",
    metadata="Optional document metadata"
)

print(f"Classification: {result['classification']}")
print(f"Confidence: {result['metadata']['confidence_score']}")
print(f"Reasoning: {result['reasoning']}")

Command Line Interface

# Basic CLI usage
python src/citecheck.py \
  --citation "Over the period 2004–2012, a decline in the AMOC has been observed" \
  --reference "path/to/reference.pdf"

# Interactive mode (prompts for inputs)
python src/citecheck.py

Supported Providers & Models

LLM Providers

Powered by LiteLLM supporting 100+ AI providers including OpenAI, Claude, Gemini, and local endpoints via Ollama.

Embedding Providers

  • Local: SentenceTransformers models (all-mpnet-base-v2, Qwen/Qwen3-Embedding-0.6B)
  • OpenAI: text-embedding-3-small, text-embedding-ada-002
  • Custom Endpoint: Any OpenAI-compatible embedding API

πŸ’Ό Tailored Solutions to Scale

Need to verify entire documents automatically? Visit semanticcite.com for tailored solutions:

  • Complete Citation System: Automatic document processing with citation extraction and verification of all references in one workflow
  • Batch Processing: Verify hundreds or thousands of citations efficiently with automated pipelines
  • API Integration: RESTful API for seamless integration into editorial and publishing workflows
  • On-premise Deployment: Secure, private installation with custom model training on your domain

πŸ”§ Technical Details

System Architecture

  • Hybrid Retrieval: BM25 + Dense Vector Search
  • Reranking: FlashRank neural reranking
  • Classification: Fine-tuned Qwen3 models
  • Frontend: Streamlit web interface
  • Storage: ChromaDB vector database

Model Configuration

Cloud Deployment (Single Model)

For cloud providers (OpenAI, Claude, Gemini), a single model handles both claim extraction and classification:

  • Model: Provider-specific (e.g., GPT-4, Claude Sonnet, Gemini Flash)
  • Embedding: Local SentenceTransformers or OpenAI embeddings
  • Supported Formats: PDF, TXT, Markdown

Local Deployment (Dual Model)

For local deployment with Ollama, two specialized models work together:

  • Preprocessing Model: SemanticCite-Refiner-Qwen3-1B (extracts core claims from citations)
  • Classification Model: SemanticCite-Checker-Qwen3-4B (analyses support level)
  • Embedding Model: Local SentenceTransformers (e.g., Qwen/Qwen3-Embedding-0.6B)
  • Advantage: Optimized models with better performance and lower resource usage
Setting Up Ollama Models

The SemanticCite models are available on Hugging Face and can be used with Ollama:

Models:

Installation:

  1. Install Ollama from ollama.ai
  2. Download the models:
    ollama pull sebsigma/semanticcite-refiner-qwen3-1b
    ollama pull sebsigma/semanticcite-checker-qwen3-4b
  3. Verify installation:
    ollama list
  4. In the Streamlit interface, select "Local Ollama" as your LLM provider

πŸ§ͺ Testing

# Run test suite
python tests/run_tests.py

# Test specific functionality
python tests/test_citecheck.py

πŸ”§ Troubleshooting

Common Issues

Ollama Connection Timeout

Problem: Analysis fails with "Connection timed out after 120.0 seconds" when using local Ollama models.

Solutions:

  1. Increase timeout in code: The default timeout is 120 seconds. For slower systems, this may be insufficient.
  2. Check Ollama is running:
    curl http://localhost:11434/api/tags
  3. Verify models are installed:
    ollama list
  4. Test model loading time:
    python tests/test_ollama_diagnostics.py

Missing API Keys

Problem: Analysis fails with "API key required" error.

Solution: Ensure you've:

  1. Created a .env file from .env.example
  2. Added the correct API key for your selected provider
  3. Restarted the Streamlit app after adding keys

FlashRank Download Fails

Problem: First run fails during FlashRank model download.

Solutions:

  1. Check internet connection
  2. Retry - the download will resume from where it left off
  3. Manually download and cache the model:
    from flashrank import Ranker
    ranker = Ranker(model_name="ms-marco-MultiBERT-L-12")

PDF Upload Fails

Problem: PDF file upload returns an error.

Solutions:

  1. Verify the PDF is not corrupted or password-protected
  2. Try converting to TXT format first
  3. Check file size is reasonable (<50MB recommended)

Low Memory / Out of Memory

Problem: System crashes or becomes unresponsive during analysis.

Solutions:

  1. Close other applications to free up RAM
  2. Use cloud providers instead of local models
  3. Reduce chunk size in processing parameters
  4. Process one citation at a time

"No chunks met the relevance threshold" Warning

Problem: Analysis completes but shows no evidence chunks.

This is normal behaviour when:

  • The citation is not well-supported by the reference document
  • The reference document doesn't contain relevant information
  • The citation refers to a different section/paper

To investigate:

  • Check if you uploaded the correct reference document
  • Verify the citation actually refers to this paper
  • Try adjusting the relevance threshold (advanced configuration)

Getting Help

If you encounter issues not covered here:

  1. Check GitHub Issues
  2. Review logs in the logs/ directory
  3. Open a new issue with:
    • Error message
    • Configuration details (LLM/embedding providers)
    • Steps to reproduce

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Academic Citation

If you use SemanticCite in your research, please cite our paper:

@article{semanticcite2025,
  title={SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning},
  author={Sebastian Haan},
  journal={ArXiv Preprint},
  year={2025},
  url={https://arxiv.org/abs/2511.16198}
}

πŸ™ Acknowledgments


SemanticCite - Enhancing research quality through AI-powered citation verification and insight

About

AI-powered citation verification with full-text analysis and contextual insight. Automatically validates citations with evidence-based reasoning, confidence scores, and support classifications. Features fine-tuned models, hybrid retrieval, and local deployment options.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages