GitHub - imprvhub/news-scraper-bigquery-gcloud: a python-based web scraper that collects news articles from an specific source, processes them, and stores them in Google BigQuery. The application is containerized with Docker and can be deployed to Google Cloud Run.

News Scraper and BigQuery Integration Challenge

A Python-based web scraper that collects news articles from Yogonet, processes them, and stores them in Google BigQuery. The application is containerized with Docker and can be deployed to Google Cloud Run.

Prerequisites

Docker installed (Docker Installation Guide)
Google Cloud CLI installed (gcloud Installation Guide)
A Google Cloud Project with enabled:
- BigQuery API
- Cloud Run API
- Artifact Registry API

Environment Variables

The application requires the following environment variables:

Variable	Description
`GCP_PROJECT_ID`	Your Google Cloud Project ID
`BQ_DATASET_ID`	BigQuery dataset name (default: `news_data`)
`BQ_TABLE_ID`	BigQuery table name (default: `articles`)
`GOOGLE_APPLICATION_CREDENTIALS`	Path to service account credentials

Setup

Clone the repository:

git clone <repository-url>
cd <repository-name>

Configure Google Cloud:

# Login to Google Cloud
gcloud auth login

# Set your project ID
gcloud config set project YOUR_PROJECT_ID

# Create a service account
gcloud iam service-accounts create news-scraper \
    --display-name="News Scraper Service Account"

# Grant necessary permissions
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
    --member="serviceAccount:news-scraper@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/bigquery.admin"

# Download service account key (saved as `credentials.json`)
gcloud iam service-accounts keys create credentials.json \
    --iam-account=news-scraper@YOUR_PROJECT_ID.iam.gserviceaccount.com

Create BigQuery Dataset:

bq mk --dataset YOUR_PROJECT_ID:news_data

Configuration

Update the following variables in deploy.sh:

PROJECT_ID="YOUR_PROJECT_ID"
REGION="YOUR_PREFERRED_REGION"
ARTIFACT_REGISTRY="YOUR_REGISTRY_NAME"

Local Development

Build the Docker image:

docker build -t news-scraper .

Run locally:

docker run --rm \
  -e GCP_PROJECT_ID=YOUR_PROJECT_ID \
  -e BQ_DATASET_ID=news_data \
  -e BQ_TABLE_ID=articles \
  -e GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json \
  -v "$(pwd)/credentials.json:/app/credentials.json:ro" \
  -v "$(pwd)/output:/app/output" \
  news-scraper

Deployment

Create Artifact Registry repository:

gcloud artifacts repositories create YOUR_REGISTRY_NAME \
    --repository-format=docker \
    --location=YOUR_PREFERRED_REGION

Deploy to Cloud Run:

chmod +x deploy.sh
./deploy.sh

Testing

To run the unit tests, execute the following command in the project directory:

pytest tests/ -v

Project Structure

├── scraper.py          # Main scraping logic
├── Dockerfile          # Container configuration
├── deploy.sh           # Deployment script
├── requirements.txt    # Python dependencies
├── tests/
│   └── test_scraper.py # Unit tests
└── README.md          # Documentation

Requirements

See requirements.txt for Python dependencies:

google-cloud-bigquery==3.17.2
mock==5.0.1 
pandas==2.2.0
pyarrow==15.0.0
pytest==7.3.1
requests==2.31.0
selenium==4.17.2

Common Issues

Credentials not found: Ensure your credentials.json is in the correct location and mounted properly in Docker.
Permission denied: Verify your service account has the correct IAM roles.
Docker mount issues on Windows: Use Windows path format: C:\path\to\credentials.json:/app/credentials.json:ro

BigQuery Schema

The scraper creates a table with the following schema:

Field	Type	Description
title	STRING	Article title
kicker	STRING	Text above headline
link	STRING	Article URL
image	STRING	Image URL
title_word_count	INTEGER	Word count
title_char_count	INTEGER	Character count
capital_words	ARRAY	Words starting with capital letter
scrape_date	TIMESTAMP	Scraping timestamp

License

This project is licensed under the MIT License. See LICENSE.md for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News Scraper and BigQuery Integration Challenge

Prerequisites

Environment Variables

Setup

Configuration

Local Development

Deployment

Testing

Project Structure

Requirements

Common Issues

BigQuery Schema

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
output		output
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
deploy.sh		deploy.sh
requirements.txt		requirements.txt
scraper.py		scraper.py

License

imprvhub/news-scraper-bigquery-gcloud

Folders and files

Latest commit

History

Repository files navigation

News Scraper and BigQuery Integration Challenge

Prerequisites

Environment Variables

Setup

Configuration

Local Development

Deployment

Testing

Project Structure

Requirements

Common Issues

BigQuery Schema

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages