Persian News Search Engine

A Python-based search engine designed to index and search Persian news articles efficiently. This project processes, normalizes, tokenizes, and indexes text data to enable fast and accurate search results.

Features

Indexing: Creates a positional index of tokens for efficient document retrieval.
Normalization: Normalizes Persian text, including handling spacing, punctuation, and converting English numbers to Persian numbers.
Stemming: Implements Persian stemming for better token matching.
Search Engine: Provides a search interface to query indexed documents and return relevant results with their scores and metadata.

Installation

Clone the repository:

git clone https://github.com/12ali21/project.git

Navigate to the project directory:
```
cd project
```
Install the required dependencies (make sure you have pip installed):
```
pip install -r requirements.txt
```

Usage

Run the main script to index the data and perform searches:
```
python main.py
```
Enter your query when prompted, and the search engine will return the most relevant results.

Project Structure

indexer.py: Handles indexing and preprocessing of text data, including tokenization, normalization, stemming, and positional indexing.
search_engine.py: Implements the search functionality, including calculating document scores and retrieving the top results.
utils.py: Contains utility functions and constants used across the project, such as stopword handling, punctuation definitions, and common word counting.
main.py: The entry point of the project, integrates the indexing and search functionalities.
data/: Contains the JSON files for storing indexed data, stopwords, and other intermediate results.
.gitignore: Specifies ignored files and directories such as data/, pack/, and cache files.

Data Processing

The project processes Persian text through the following stages:

Tokenization: Splits text into individual words or tokens.
Normalization: Adjusts spacing, punctuation, and character representation to standardize tokens.
Stemming: Reduces words to their root forms using a Persian stemmer.
Indexing: Creates a positional index for fast retrieval of documents based on token positions.

Example Input/Output

Input

Enter your query: خبر جدید

Output

Score: 0.85
Title: Example News Title
URL: http://example.com/news

Requirements

Python 3.x
Libraries: math, json, PersianStemmer
JSON-formatted data files for news articles.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
.~lock.Gozaresh.odt#		.~lock.Gozaresh.odt#
Gozaresh.odt		Gozaresh.odt
README.md		README.md
indexer.py		indexer.py
main.py		main.py
search_engine.py		search_engine.py
utils.py		utils.py
تعریف پروژه-بهار 1403.pdf		تعریف پروژه-بهار 1403.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persian News Search Engine

Features

Installation

Usage

Project Structure

Data Processing

Example Input/Output

Input

Output

Requirements

About

Uh oh!

Releases

Packages

Languages

12ali21/IR-Project

Folders and files

Latest commit

History

Repository files navigation

Persian News Search Engine

Features

Installation

Usage

Project Structure

Data Processing

Example Input/Output

Input

Output

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages