Code Plagiarism Checker

FastAPI-based backend for code similarity/plagiarism scoring using preprocessing + TF-IDF + cosine similarity.

Features

Accepts code and language id through an HTTP API.
Removes comments based on language style.
Creates a temporary source file in temp/.
Tokenizes code and vectorizes with TfidfVectorizer.
Computes similarity score using cosine similarity (returned internally as percentage).

Tech Stack

Python 3.12+
FastAPI
scikit-learn

Project Structure

.
├── main.py                    # FastAPI app and routes
├── base/model.py              # TF-IDF vectorization + similarity search
├── model/code.py              # Request body schema (Pydantic)
├── modules/preprocess.py      # Language-aware comment removal + token prep
├── modules/genrate_code_file.py # Writes submitted code to temp file
├── temp/                      # Generated temporary source files
├── setup.bash                 # Create virtual environment (uv)
└── activate.bash              # Activate virtual environment

Supported Languages

Language ids used by the API:

0 → C
1 → C++
2 → Java
3 → JavaScript (node)
4 → Rust
5 → Python

Setup

1) Create virtual environment

./setup.bash

2) Activate environment

source .venv/bin/activate
# or
./activate.bash

3) Install dependencies

pip install -e .

Run Server

fastapi dev main.py

App runs at http://127.0.0.1:8000 by default.

API

`GET /`

Health check/sample route.

Response:

{
	"Hello": "World"
}

`POST /`

Request body:

{
	"language": 0,
	"code": "#include <stdio.h>\nint main(){return 0;}"
}

Current response:

{
	"message": "Done"
}

How Scoring Works (Current Implementation)

Input code is preprocessed (comment removal by language).
Cleaned code is saved in temp/ with language extension.
File content is split into tokens and used as corpus.
TF-IDF matrix is built from corpus.
Cosine similarity score is computed against a query snippet.

Current Limitation

The current POST / route computes a score internally but uses a hardcoded query snippet and does not return the score in the API response yet.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
base		base
model		model
modules		modules
pipeline		pipeline
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
activate.bash		activate.bash
main.py		main.py
pyproject.toml		pyproject.toml
setup.bash		setup.bash
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Plagiarism Checker

Features

Tech Stack

Project Structure

Supported Languages

Setup

1) Create virtual environment

2) Activate environment

3) Install dependencies

Run Server

API

`GET /`

`POST /`

How Scoring Works (Current Implementation)

Current Limitation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Code Plagiarism Checker

Features

Tech Stack

Project Structure

Supported Languages

Setup

1) Create virtual environment

2) Activate environment

3) Install dependencies

Run Server

API

GET /

POST /

How Scoring Works (Current Implementation)

Current Limitation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`POST /`

Packages