FastAPI-based backend for code similarity/plagiarism scoring using preprocessing + TF-IDF + cosine similarity.
- Accepts code and language id through an HTTP API.
- Removes comments based on language style.
- Creates a temporary source file in
temp/. - Tokenizes code and vectorizes with
TfidfVectorizer. - Computes similarity score using cosine similarity (returned internally as percentage).
- Python 3.12+
- FastAPI
- scikit-learn
.
├── main.py # FastAPI app and routes
├── base/model.py # TF-IDF vectorization + similarity search
├── model/code.py # Request body schema (Pydantic)
├── modules/preprocess.py # Language-aware comment removal + token prep
├── modules/genrate_code_file.py # Writes submitted code to temp file
├── temp/ # Generated temporary source files
├── setup.bash # Create virtual environment (uv)
└── activate.bash # Activate virtual environment
Language ids used by the API:
0→ C1→ C++2→ Java3→ JavaScript (node)4→ Rust5→ Python
./setup.bashsource .venv/bin/activate
# or
./activate.bashpip install -e .fastapi dev main.pyApp runs at http://127.0.0.1:8000 by default.
Health check/sample route.
Response:
{
"Hello": "World"
}Request body:
{
"language": 0,
"code": "#include <stdio.h>\nint main(){return 0;}"
}Current response:
{
"message": "Done"
}- Input code is preprocessed (comment removal by language).
- Cleaned code is saved in
temp/with language extension. - File content is split into tokens and used as corpus.
- TF-IDF matrix is built from corpus.
- Cosine similarity score is computed against a query snippet.
The current POST / route computes a score internally but uses a hardcoded query snippet and does not return the score in the API response yet.
This project is licensed under the MIT License.