A Python-based search engine designed to index and search Persian news articles efficiently. This project processes, normalizes, tokenizes, and indexes text data to enable fast and accurate search results.
- Indexing: Creates a positional index of tokens for efficient document retrieval.
- Normalization: Normalizes Persian text, including handling spacing, punctuation, and converting English numbers to Persian numbers.
- Stemming: Implements Persian stemming for better token matching.
- Search Engine: Provides a search interface to query indexed documents and return relevant results with their scores and metadata.
- Clone the repository:
git clone https://github.com/12ali21/project.git
- Navigate to the project directory:
cd project - Install the required dependencies (make sure you have
pipinstalled):pip install -r requirements.txt
- Run the main script to index the data and perform searches:
python main.py
- Enter your query when prompted, and the search engine will return the most relevant results.
indexer.py: Handles indexing and preprocessing of text data, including tokenization, normalization, stemming, and positional indexing.search_engine.py: Implements the search functionality, including calculating document scores and retrieving the top results.utils.py: Contains utility functions and constants used across the project, such as stopword handling, punctuation definitions, and common word counting.main.py: The entry point of the project, integrates the indexing and search functionalities.data/: Contains the JSON files for storing indexed data, stopwords, and other intermediate results..gitignore: Specifies ignored files and directories such asdata/,pack/, and cache files.
The project processes Persian text through the following stages:
- Tokenization: Splits text into individual words or tokens.
- Normalization: Adjusts spacing, punctuation, and character representation to standardize tokens.
- Stemming: Reduces words to their root forms using a Persian stemmer.
- Indexing: Creates a positional index for fast retrieval of documents based on token positions.
Enter your query: خبر جدید
Score: 0.85
Title: Example News Title
URL: http://example.com/news
- Python 3.x
- Libraries:
math,json,PersianStemmer - JSON-formatted data files for news articles.