Skip to content

data-silence/timemachine

Repository files navigation

logo

Pandas Numpy PostgreSQL Docker aiogram sklearn

Table of contents

About project

Timemachine - the NLP project based on newsru.com dataset.

This is an attempt to create an aggregator of the Past based on a Telegram bot - @time_mashine_bot

At the moment, the first stage of the project has been realized: aggregator of news of the past has been created on the basis of materials of the agency newsru.com

This work is a demonstration of my skills as a data-science professional to address a full range of tasks:

  • data collection and processing;
  • data analysis;
  • data utilization ideation;
  • realization of data storage infra-structure;
  • training of necessary machine learning models for solving tasks within the project;
  • writing telegram bot code based on aiogram library
  • Deployment and support of the finished bot using docker

Project structure

Materials related to collecting and analyzing the dataset (parser, EDA, etc.) can be found in the researh_notebooks directory.

The other directories are parts of the telegram bot:

  • imports: contains files with imports of the necessary libraries;
  • models: contains ready-made models of embeddings and classifier, which are used by the bot;
  • scripts: stores scripts that provide functionality of separate parts of the bot:
    • time_machine.py is the main script for obtaining and converting data into the required output format;
    • handlers.py and common_handlers.py - dispatcher and handlers of main and basic user reactions;
    • keyboard.py - keyboards;
    • utils - auxiliary functions and variables
  • graphs: stores auxiliary graphical files

The app.py file is the entry point for the bot.

About datasource

Newsru.com is a Russian online media agency that existed from August 28, 2000 to May 31, 2021 as a news agency, and since June 1, 2021 has existed in the format of a news archive for the entire time of its operation.

This is the dataset of Russian-language news obtained from a single agency:

  • russian news for 21 years
  • more then 600.000 news articles
  • contains short summary of all news, which can be used to train sammarization models in ML

License

This project is licensed under the MIT license. For more information, see the LICENSE file. All text materials on NEWSru.com are available under the Creative Commons Attribution 4.0 International license.

About

NLP project based on newsru.com dataset with telegram-bot interface

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages