Skip to content

Repository for storing and improving the site parser

Notifications You must be signed in to change notification settings

KGFSB-agent/Site_parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Site Parser

This is a test parser for extracting newses and saving them to a CSV file using the library Selectolax.

Features

  • Choosing which page to which page:
    You can choose which page to which data will be collected in the .env file.

  • Asynchronous translation into Russian:
    The parser is also capable of synchronously translating all incoming text with p and li tags.

  • Further improvements:
    In the future, parsing options for other sites will be added, as well as the ability to parse from many sites at the same time.

Setup

Requirements

  • httpx
  • selectolax
  • deep-translator

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/Site_parser.git
    cd telegram-parser
  2. Install Poetry:

    (Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | python -
    poetry --version
  3. Install Project Dependencies:

    poetry install
  4. Set Up Environment Variables: The project uses environment variables to store information like start/end page. You need to create a .env file at the root of the project directory.

    • 4.1. Create a .env file:

      touch .env
    • 4.2. Add the page numbers you want to parse, as well as the information you need (example below):

      START_PAGE=1
      END_PAGE=3
      NEWS_CATEGORY=economy-trade
      
  5. Running the Code:

    • 5.1. Activate the virtual environment created by Poetry:

      poetry shell
    • 5.2. Run the script to start fetching messages from Telegram:

      poetry run python src/main.py
  6. Output:

    • 6.1 Example output:
      The data from the page 1 is collected
      The data from the page 2 is collected

    After running the script, the extracted messages will be saved in a CSV file located at data/data/results.csv. The file will include details like title, news date, news href, news short text, news main text, country and category.

About

Repository for storing and improving the site parser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages