NewsPeeking API is a RESTful API built with FastAPI that allows you to crawl news websites, extract article data, and classify news articles using Natural Language Processing (NLP) techniques. It's designed to be modular, configurable, and easy to use for developers who need to programmatically access and analyze news content.
- Web Crawling: Efficiently crawls news websites using
BeautifulSoup4andrequests. - Article Extraction: Extracts key information from news articles:
- 📰 Headline
- 📝 Article Text
- 📅 Publication Date
- ✍️ Author Information
- Intelligent Crawling Modes:
- List Articles Mode (Default): Extracts and lists article URLs from news website listing pages (homepages, category pages).
- Crawl Articles Mode (Optional): Crawls individual articles, extracts content, classifies them, and stores them in a database.
- Article Classification: Categorizes articles using NLP (NLTK) and keyword-based classification (configurable categories).
- Structured Storage: Stores extracted and classified data in a structured SQLite database.
- Website-Specific Configuration: Highly adaptable to different news website structures through YAML configuration files. Define custom CSS selectors for each website.
- RESTful API: Built with FastAPI for a modern, fast, and well-documented API.
- Error Handling & Rate Limiting: Robust error handling for invalid URLs and website issues. Basic rate limiting to be respectful to websites.
- Database Reset Endpoint: Includes an endpoint to easily reset/flush the database for development and testing.
- Description: Crawls a news website URL. Operates in two modes:
- Default Mode (List Articles): Returns a list of article URLs from a listing page.
- Crawl Articles Mode: Crawls individual articles, extracts data, classifies, and stores in the database.
- Request Body:
{ "url": "https://www.example-news-website.com/", "crawl_articles": false }url(string, required): The URL of the news website (listing page or article page).crawl_articles(boolean, optional, default:false): Set totrueto enable crawling and storing article content. Iffalseor not provided, the API will only list article URLs.
- Response (Default Mode - List Articles):
{ "message": "Article URLs extracted from listing page.", "article_urls": [ "https://www.example-news-website.com/article1", "https://www.example-news-website.com/article2", ... ] } - Response (Crawl Articles Mode - Success):
{ "message": "Crawling from listing page successful, 15 new articles stored, 3 articles updated.", "articles_crawled": 18 } - Response (Error - 400 Bad Request):
{ "detail": "Failed to crawl article data from URL." }
- Description: Resets the database by deleting all stored articles. Useful for development and testing.
- Request Body: None
- Response (Success - 200 OK):
{ "message": "Database reset successful: All articles deleted." }
-
Clone the Repository:
git clone https://github.com/maksha/NewsPeeking.git cd NewsPeeking -
Create a Virtual Environment:
python3 -m venv venv source venv/bin/activate # On Linux/macOS venv\Scripts\activate # On Windows
-
Install Dependencies:
pip install -r requirements.txt
-
Configuration:
- Edit the
config.yamlfile in the project root to configure:defaultsettings:rate_limit_delay,categories,database_url(default issqlite:///./news_articles.db).websitessettings: Website-specific configurations including:listing_page:article_link_selectors,url_pattern_inclusion.article_page:headline_selector,article_text_selector,publication_date_selector,author_selector.
- See the example
config.yamlfor detailed structure.
- Edit the
-
Run the API:
uvicorn newspeeking.main:app --reload
The API will be accessible at
http://127.0.0.1:8000.
The config.yaml file allows you to customize the behavior of NewsPeeking, especially for different news websites.
default:
rate_limit_delay: 1
categories: # ... (Category definitions) ...
websites:
nytimes.com: # Website-specific settings for nytimes.com
listing_page:
article_link_selectors: # CSS selectors to find article links on listing pages
- "..."
url_pattern_inclusion: "..." # URL path pattern for article URLs
article_page:
headline_selector: "..." # CSS selector for article headline
article_text_selector: "..." # CSS selector for article text paragraphs
publication_date_selector: "..." # CSS selector for publication date
author_selector: "..." # CSS selector for author
inet.detik.com: # Website-specific settings for inet.detik.com
# ... (Similar structure as nytimes.com) ...default.rate_limit_delay: Delay (in seconds) between requests to a website (default: 1). Adjust for politeness and to avoid getting blocked.default.categories: Defines categories and keywords used for article classification. Customize these to suit your needs.websites.[domain].listing_page.article_link_selectors: A list of CSS selectors used to extract article URLs from listing pages. Crucially, you need to inspect the HTML of target websites and update these selectors.websites.[domain].listing_page.url_pattern_inclusion: A URL path pattern used to filter extracted URLs to identify likely article URLs.websites.[domain].article_page.[selectors]: CSS selectors used to extract headline, article text, publication date, and author from individual article pages. You MUST inspect website HTML and update these for each website you want to crawl.
-
List Article URLs from NYTimes Homepage (Default Mode):
curl -X POST \ -H "Content-Type: application/json" \ -d '{"url": "https://www.nytimes.com/"}' \ http://127.0.0.1:8000/crawl
-
Crawl and Store Articles from NYTimes Technology Section Page:
curl -X POST \ -H "Content-Type: application/json" \ -d '{"url": "https://www.nytimes.com/section/technology", "crawl_articles": true}' \ http://127.0.0.1:8000/crawl
-
Reset the Database:
curl -X POST http://127.0.0.1:8000/reset_db Use code with caution.
- FastAPI - A modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. (Used for building the REST API endpoints.)
- Uvicorn - An ASGI web server for Python. (Used to run the FastAPI application.)
- Requests - Python HTTP for Humans. (Used for making HTTP requests to fetch web page content.)
- BeautifulSoup4 - Python library for pulling data out of HTML and XML files. (Used for parsing HTML content and extracting data.)
- NLTK (Natural Language Toolkit) - A leading platform for building Python programs to work with human language data. (Used for Natural Language Processing tasks, specifically article classification.)
- Validators - Python validation library. (Used for validating URL formats.)
- SQLAlchemy - Python SQL toolkit and Object-Relational Mapper. (Used as an ORM to interact with the SQLite database.)
- python-dateutil - Extensions to the standard Python datetime module. (Used for robust parsing of dates from web pages.)
- PyYAML - YAML parser and emitter for Python. (Used for loading configuration settings from YAML files.)
Thank you to the developers of these amazing open-source dependencies!
- Advanced NLP Classification: Implement more sophisticated NLP techniques (e.g., TF-IDF, word embeddings, machine learning classifiers) for improved article categorization.
- Pagination Handling: Implement pagination crawling to fetch articles from multi-page listing pages.
- More Robust Rate Limiting: Implement more advanced rate limiting strategies to be even more respectful to websites and handle large-scale crawling.
- Asynchronous Crawling: Convert crawling to asynchronous operations for improved performance and speed.
- Data Validation & Cleaning: Add more robust data validation and cleaning steps.
- User Interface: Develop a web UI to interact with the API and visualize crawled data.
- Expand Website Configurations: Add configurations for more news websites.
...
Made with ❤️ by maksha
