Skip to content

delisha02/research-engineering-intern-assignment

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

research-engineering-intern-assignment

πŸ“Š The Political Spectrum: Reddit Communities & Their Political Sentiments

Overview

This project is an interactive dashboard that visualizes political discussions on Reddit. It provides insights into how information narratives evolved, which communities were most active, and how users engaged with different types of content.


Features

βœ… Interactive Dashboard – Built with Streamlit to explore Reddit data dynamically.
βœ… Sentiment Analysis – Uses VADER to classify posts as positive, neutral, or negative.
βœ… Topic Modeling – Implements BERTopic to identify key discussion topics.
βœ… Engagement Metrics – Shows post scores, comments, and subreddit activity.
βœ… Time-Series Analysis – Tracks sentiment and post frequency trends over time.



Deployment

Deployed using Streamlit Cloud.

Reddit's Political Spectrum Analysis


Project Structure

πŸ“‚ research-engineering-intern-assignment/
β”‚
β”œβ”€β”€ πŸ“‚ src/
β”‚   β”œβ”€β”€ πŸ“‚ dashboard/
β”‚   β”‚   β”œβ”€β”€ app.py                # Main Streamlit dashboard  
β”‚   β”‚   β”œβ”€β”€ data_loader.py        # Loads Reddit dataset & topic model  
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ static/             # Static assets (CSS, images, etc.)
β”‚   β”‚   β”‚   β”œβ”€β”€ styles.css         # Custom CSS for UI enhancements  
β”‚   β”‚   β”œβ”€β”€ πŸ“‚ models/             # Machine learning models for dashboard  
β”‚   β”‚   β”‚   β”œβ”€β”€ sentiment_analysis.py  # Sentiment scoring  
β”‚   β”‚   β”‚   β”œβ”€β”€ topic_modeling.py      # Topic modeling with BERTopic  
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“‚ preprocessing/
β”‚   β”‚   β”œβ”€β”€ clean_data.py          # Data cleaning and preprocessing  
β”‚
β”œβ”€β”€ πŸ“‚ models/                     # Folder for trained models  
β”‚   β”œβ”€β”€ πŸ“‚ topic_model/             # Trained BERTopic models and data  
β”‚   β”‚   β”œβ”€β”€ topic_model.pkl        # Trained BERTopic model  
β”‚   β”‚   β”œβ”€β”€ topics.npy             # Topic assignments per post  
β”‚   β”‚   β”œβ”€β”€ probs.npy              # Probability scores of topics  
β”‚   β”‚   β”œβ”€β”€ topic_labels.pkl       # Topic names generated from BERTopic  
β”‚   β”‚   β”œβ”€β”€ topic_words.pkl        # Top words per topic  
β”‚   β”‚   β”œβ”€β”€ topic_counts.csv       # Number of posts per topic  
β”‚   β”‚   β”œβ”€β”€ topic_info.csv         # Topic metadata for visualization  
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“‚ sentiment_analysis/      # Trained sentiment analysis models and data  
β”‚   β”‚   β”œβ”€β”€ topic_sentiment.csv         # Sentiment data per topic  
β”‚   β”‚   β”œβ”€β”€ topic_sentiment_pivot.csv   # Pivot table of topics vs sentiment  
β”‚   β”‚   β”œβ”€β”€ topic_sentiment_pivot_pct.csv  # Percentage-based topic sentiment  
β”‚   β”‚   β”œβ”€β”€ sentiment_stats.pkl          # Sentiment statistics  
β”‚   β”‚   β”œβ”€β”€ sentiment_keywords.pkl       # Keywords strongly associated with sentiment  
β”‚
β”œβ”€β”€ πŸ“‚ data/                       # Folder for datasets  
β”‚   β”œβ”€β”€ πŸ“‚ raw/                    # Unprocessed Reddit data  
β”‚   β”œβ”€β”€ πŸ“‚ processed/              # Cleaned and analyzed data  
β”‚
β”œβ”€β”€ requirements.txt                # Dependencies  
β”œβ”€β”€ README.md                       # Documentation  


Installation & Setup

1️⃣ Clone the Repository

git clone https://github.com/delisha02/research-engineering-intern-assignment.git
cd research-engineering-intern-assignment

2️⃣ Create Virtual Environment and Install Dependencies

#Create Virtual Environment
python -m venv venv
# Activate the virtual environment:
# On macOS and Linux:
source myenv/bin/activate
# On Windows
venv\Scripts\activate
#Install dependencies
pip install -r requirements.txt

3️⃣ Running the Scripts (from root directory: research-engineering-intern-assignment):

# Data cleaning 
python src/preprocessing/clean_data.py
# Topic modeling
python src/dashboard/models/topic_modeling.py
# Sentiment analysis
python src/dashboard/models/sentiment_analysis.py

4️⃣ Run the Dashboard

streamlit run src/dashboard/app.py

How It Works

1️⃣ Data Processing (src/preprocessing/clean_data.py)

  • Loads raw Reddit JSON data.
  • Cleans and filters posts based on political keywords.
  • Converts timestamps and computes engagement metrics.
  • Saves cleaned data to data/processed/reddit_data_final.csv.

2️⃣ Sentiment Analysis (src/dashboard/models/sentiment_analysis.py)

  • Uses NLTK’s VADER to compute sentiment for post titles & selftext.
  • Assigns sentiment categories (Very Negative, Negative, Neutral, Positive, Very Positive).
  • Saves sentiment-enhanced data.

3️⃣ Topic Modeling (src/dashboard/models/topic_modeling.py)

  • Uses BERTopic to extract discussion topics.
  • Assigns the top 3 topics per post with confidence scores.
  • Stores topics for visualization in the dashboard.

4️⃣ Interactive Dashboard (src/dashboard/app.py)

  • Allows users to filter posts by date, subreddit, and sentiment.
  • Displays key statistics, time-series trends, and topic distributions.
  • Provides a post explorer for sorting by sentiment, comments, and engagement.

Key Components

πŸ“Œ src/dashboard/app.py – The Streamlit Dashboard

  • Loads data & applies sidebar filters (date, subreddit).
  • Displays key statistics (Total Posts, Avg. Sentiment).
  • Renders interactive charts & graphs using Plotly.

πŸ” src/dashboard/models/sentiment_analysis.py – Sentiment Scoring

  • Uses NLTK VADER to compute sentiment.
  • Adjusts weights dynamically (title vs. selftext).
  • Classifies sentiment into 5 categories.

πŸ“’ src/dashboard/models/topic_modeling.py – Topic Discovery

  • Trains BERTopic using UMAP & HDBSCAN clustering.
  • Extracts keywords & representative posts per topic.
  • Saves topic distributions for visualization.

πŸ”„ src/dashboard/data_loader.py – Loads Processed Data

  • Fetches reddit_data_final.csv.
  • Loads the trained topic model (.pkl file).

🎨 src/dashboard/static/styles.css – UI Enhancements

  • Customizes header colors & sidebar styles.
  • Adjusts font size for better readability.

Dashboard Overview

Here’s how the dashboard looks:

Dashboard Sidebar to Filter


Future Enhancements

πŸš€ Real-time Reddit API integration for live updates.
πŸ€– More advanced NLP models (e.g., RoBERTa for sentiment).
πŸ“Š Dashboard customization with user-defined topic filters.
πŸ” Named Entity Recognition (NER) for tracking politicians & policies.



Contributer

Delisha Naik: delisha02

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.4%
  • CSS 0.6%