A comprehensive plagiarism detection system that combines multiple similarity algorithms, web search integration, and AI-powered suggestions to analyze text documents for potential plagiarism. Built with Python using a single-agent LangGraph workflow architecture and featuring a modern web interface.
This system provides multi-layered plagiarism detection through:
- Statistical Analysis: Jaccard similarity and TF-IDF cosine similarity
- Semantic Analysis: Sentence embeddings for contextual similarity
- Web Verification: Google Custom Search integration for online source checking
- AI Enhancement: Integrated OpenAI API for content improvement
- Automated Workflow: Single LangGraph agent with state management for reliable processing
The system uses 1 primary LangGraph agent (PlagiarismDetectionAgent) that manages the entire workflow through state transitions and conditional routing. This single-agent architecture ensures consistent state management and simplified error handling across the entire pipeline.
Place the workflow diagram image as workflow_dag.png in your repository root directory
The workflow follows a directed acyclic graph pattern with the following processing stages:
- parse_input: Entry point that validates and categorizes input data
- extract_text: File processing for PDF, DOCX, and TXT formats
- preprocess_text: Text cleaning, tokenization, and chunking
- detect_plagiarism: Core similarity analysis using multiple algorithms
- analyze_results: Results processing and metric extraction
- generate_suggestions: AI-powered improvement recommendations (conditional based on plagiarism score)
- create_report: Final report generation and formatting
- handle_error: Error handling and recovery with multiple entry points
The diagram shows the complete workflow with:
- Solid arrows: Direct sequential flow
- Dotted arrows: Conditional routing based on decision points
- 3 decision points: File type routing, error handling, and suggestion generation threshold
Detection Engine (plagiarism_detector.py)
- Multiple similarity algorithms (Jaccard, TF-IDF, Semantic)
- Text preprocessing and chunking
- Web search integration
- OpenAI API integration for suggestions
Single Agent Workflow (langgraph_plagiarism_agent.py)
- LangGraph-based state management with single agent
- Conditional routing and error handling
- File format support (PDF, DOCX, TXT)
- Batch processing capabilities
Web Interface (streamlit_plagiarism_app.py)
- Interactive Streamlit dashboard
- Real-time analysis visualization
- Export functionality
- Progress tracking
Input Text/File β Parse & Validate β Extract Text (if file) β
Preprocess & Clean β Detect Plagiarism β Analyze Results β
Generate AI Suggestions (if needed) β Create Final Report
The single agent handles all state transitions, conditional branching, and error recovery through the LangGraph framework.
- Jaccard Similarity: Word overlap analysis between texts
- TF-IDF Cosine Similarity: Document vector comparison
- Semantic Similarity: Context-aware similarity using sentence transformers
- Web Search Verification: Compare against online sources via Google API
- PDF Files: Automatic text extraction using PyPDF2
- Word Documents: DOCX file processing with python-docx
- Plain Text: Direct text file reading
- Raw Text Input: Direct text paste functionality
- Smart Suggestions: OpenAI-powered rewriting recommendations
- Context Awareness: Suggestions tailored to specific flagged content
- Multiple Options: Various paraphrasing and restructuring approaches
- Real-time Dashboard: Interactive web interface with live updates
- Visual Analytics: Plotly charts showing similarity scores and source distribution
- Batch Processing: Analyze multiple documents simultaneously
- Export Options: JSON, CSV, and formatted text reports
- Python 3.8 or higher
- Internet connection (for web search and AI features)
- 2GB RAM minimum (4GB recommended)
streamlit>=1.28.0
plotly>=5.15.0
pandas>=2.0.0
openai>=1.3.0
sentence-transformers>=2.2.0
scikit-learn>=1.3.0
nltk>=3.8.0
langgraph>=0.0.40
PyPDF2>=3.0.0
python-docx>=0.8.11
requests>=2.31.0
python-dotenv>=1.0.0
git clone <repository-url>
cd plagiarism-detection-systempython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtpython -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab'); nltk.download('stopwords')"Create a .env file in the project root:
# Required: OpenAI API Key
OPENAI_API_KEY=your_openai_api_key_here
# Optional: Google Custom Search (recommended for enhanced detection)
GOOGLE_API_KEY=your_google_api_key_here
GOOGLE_CSE_ID=your_custom_search_engine_id_hereOpenAI API Key:
- Visit OpenAI API Dashboard
- Create account and generate API key
- Add billing information for usage
Google Custom Search (Optional):
- Go to Google Cloud Console
- Enable "Custom Search JSON API"
- Create API credentials
- Set up Custom Search Engine at cse.google.com
- Configure to search entire web (
*in sites to search)
streamlit run streamlit_plagiarism_app.pyAccess the application at http://localhost:8501
# Analyze text directly
python langgraph_plagiarism_agent.py --text "Your text here" --openai-key "your_key"
# Analyze a file
python langgraph_plagiarism_agent.py --file "document.pdf" --openai-key "your_key"
# Batch processing
python langgraph_plagiarism_agent.py --batch "input_files.json" --openai-key "your_key"from plagiarism_detector import PlagiarismDetector
# Initialize detector
detector = PlagiarismDetector(
openai_api_key="your_openai_key",
google_api_key="your_google_key", # Optional
google_cse_id="your_cse_id" # Optional
)
# Analyze text
results = detector.analyze_text("Your text to analyze")
print(f"Plagiarism: {results['overall_plagiarism_percentage']:.1f}%")
# Generate report
report = detector.format_report(results)
print(report)- 0-15%: Low risk (likely original content)
- 15-30%: Medium risk (requires review)
- 30%+: High risk (significant similarities found)
- Jaccard: Measures word overlap between texts
- TF-IDF: Compares document term frequency patterns
- Semantic: Analyzes contextual meaning similarity
- JSON: Complete analysis data with metadata
- CSV: Tabular chunk-by-chunk analysis
- Text Report: Human-readable formatted summary
Edit configuration in the web interface sidebar:
- Chunk Size: Number of sentences per analysis chunk (2-6)
- Web Search: Enable/disable online source checking
- Similarity Thresholds: Adjust sensitivity levels
- Detection Mode: Standard vs Enhanced analysis
Modify thresholds in plagiarism_detector.py:
PLAGIARISM_THRESHOLD = 0.5 # Consider as plagiarized
WEB_SEARCH_THRESHOLD = 0.3 # Include in web search results
SUGGESTION_THRESHOLD = 20 # Generate suggestions above this percentageGenerate workflow diagrams:
python graph.pyThis creates:
- Mermaid diagram code (for web visualization)
- ASCII art workflow chart
- Processing pipeline overview
# Test installation
python test_plagiarism.py
# Test OpenAI integration
python debug_suggestions.py
# Test web interface
python test_streamlit.py# Quick test with sample text
python simple_example.pyplagiarism-detection-system/
βββ plagiarism_detector.py # Core detection engine
βββ enhanced_plagiarism_detector.py # Extended detection methods
βββ langgraph_plagiarism_agent.py # LangGraph workflow agent
βββ streamlit_plagiarism_app.py # Web interface
βββ graph.py # Workflow visualization
βββ requirements.txt # Python dependencies
βββ .env # API keys (create this)
βββ README.md # This file
βββ test_scripts/
β βββ test_plagiarism.py # Installation tests
β βββ debug_suggestions.py # OpenAI debugging
β βββ simple_example.py # Quick start example
βββ docs/
βββ setup_guide.md # Detailed setup instructions
βββ api_documentation.md # API reference
Import Errors:
pip install --upgrade -r requirements.txtNLTK Data Missing:
python -c "import nltk; nltk.download('all')"OpenAI API Errors:
- Verify API key is correct
- Check billing status in OpenAI dashboard
- Monitor rate limits and usage
Google Search Errors:
- Ensure Custom Search JSON API is enabled
- Verify CSE ID format (should be alphanumeric with colon, not email)
- Check API quotas in Google Cloud Console
Streamlit Issues:
streamlit --version
pip install --upgrade streamlitFor Faster Analysis:
- Disable web search for local-only checking
- Use smaller chunk sizes for shorter texts
- Limit batch processing to 10 documents at once
For Better Accuracy:
- Enable all detection methods
- Use both Google and OpenAI integrations
- Increase chunk overlap for better coverage
- State Consistency: Single agent ensures reliable state management across all processing steps
- Error Recovery: Centralized error handling with multiple recovery paths shown in DAG
- Memory Efficiency: Single agent instance handles entire workflow without inter-agent communication overhead
- Processing Speed: ~1-2 seconds per 1000 words (local algorithms)
- Web Search: +2-3 seconds per chunk (depends on network)
- AI Suggestions: +3-5 seconds (depends on OpenAI response time)
- Memory Usage: ~200MB base + ~50MB per document
- Accuracy: 85-95% detection rate for academic text similarities
- Agent Overhead: Minimal due to single-agent design (~10-20ms state transitions)
- Fork the repository
- Create feature branch (
git checkout -b feature/improvement) - Commit changes (
git commit -am 'Add feature') - Push to branch (
git push origin feature/improvement) - Create Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions:
- Create an issue on GitHub
- Check the troubleshooting section above
- Review the setup guide in
/docs/setup_guide.md
- Support for more file formats (RTF, ODT)
- Integration with academic databases (ArXiv, PubMed)
- Advanced citation detection
- Multi-language support
- REST API for integration
- Docker containerization
- Cloud deployment options
Built with β€οΈ using Python, OpenAI, LangGraph, and Streamlit
