This project is an interactive dashboard that visualizes political discussions on Reddit. It provides insights into how information narratives evolved, which communities were most active, and how users engaged with different types of content.
β
Interactive Dashboard β Built with Streamlit to explore Reddit data dynamically.
β
Sentiment Analysis β Uses VADER to classify posts as positive, neutral, or negative.
β
Topic Modeling β Implements BERTopic to identify key discussion topics.
β
Engagement Metrics β Shows post scores, comments, and subreddit activity.
β
Time-Series Analysis β Tracks sentiment and post frequency trends over time.
Deployed using Streamlit Cloud.
Reddit's Political Spectrum Analysis
π research-engineering-intern-assignment/
β
βββ π src/
β βββ π dashboard/
β β βββ app.py # Main Streamlit dashboard
β β βββ data_loader.py # Loads Reddit dataset & topic model
β β βββ π static/ # Static assets (CSS, images, etc.)
β β β βββ styles.css # Custom CSS for UI enhancements
β β βββ π models/ # Machine learning models for dashboard
β β β βββ sentiment_analysis.py # Sentiment scoring
β β β βββ topic_modeling.py # Topic modeling with BERTopic
β β
β βββ π preprocessing/
β β βββ clean_data.py # Data cleaning and preprocessing
β
βββ π models/ # Folder for trained models
β βββ π topic_model/ # Trained BERTopic models and data
β β βββ topic_model.pkl # Trained BERTopic model
β β βββ topics.npy # Topic assignments per post
β β βββ probs.npy # Probability scores of topics
β β βββ topic_labels.pkl # Topic names generated from BERTopic
β β βββ topic_words.pkl # Top words per topic
β β βββ topic_counts.csv # Number of posts per topic
β β βββ topic_info.csv # Topic metadata for visualization
β β
β βββ π sentiment_analysis/ # Trained sentiment analysis models and data
β β βββ topic_sentiment.csv # Sentiment data per topic
β β βββ topic_sentiment_pivot.csv # Pivot table of topics vs sentiment
β β βββ topic_sentiment_pivot_pct.csv # Percentage-based topic sentiment
β β βββ sentiment_stats.pkl # Sentiment statistics
β β βββ sentiment_keywords.pkl # Keywords strongly associated with sentiment
β
βββ π data/ # Folder for datasets
β βββ π raw/ # Unprocessed Reddit data
β βββ π processed/ # Cleaned and analyzed data
β
βββ requirements.txt # Dependencies
βββ README.md # Documentation
git clone https://github.com/delisha02/research-engineering-intern-assignment.git
cd research-engineering-intern-assignment#Create Virtual Environment
python -m venv venv
# Activate the virtual environment:
# On macOS and Linux:
source myenv/bin/activate
# On Windows
venv\Scripts\activate
#Install dependencies
pip install -r requirements.txt# Data cleaning
python src/preprocessing/clean_data.py# Topic modeling
python src/dashboard/models/topic_modeling.py# Sentiment analysis
python src/dashboard/models/sentiment_analysis.py
streamlit run src/dashboard/app.py- Loads raw Reddit JSON data.
- Cleans and filters posts based on political keywords.
- Converts timestamps and computes engagement metrics.
- Saves cleaned data to
data/processed/reddit_data_final.csv.
- Uses NLTKβs VADER to compute sentiment for post titles & selftext.
- Assigns sentiment categories (Very Negative, Negative, Neutral, Positive, Very Positive).
- Saves sentiment-enhanced data.
- Uses BERTopic to extract discussion topics.
- Assigns the top 3 topics per post with confidence scores.
- Stores topics for visualization in the dashboard.
- Allows users to filter posts by date, subreddit, and sentiment.
- Displays key statistics, time-series trends, and topic distributions.
- Provides a post explorer for sorting by sentiment, comments, and engagement.
- Loads data & applies sidebar filters (date, subreddit).
- Displays key statistics (Total Posts, Avg. Sentiment).
- Renders interactive charts & graphs using Plotly.
- Uses NLTK VADER to compute sentiment.
- Adjusts weights dynamically (title vs. selftext).
- Classifies sentiment into 5 categories.
- Trains BERTopic using UMAP & HDBSCAN clustering.
- Extracts keywords & representative posts per topic.
- Saves topic distributions for visualization.
- Fetches
reddit_data_final.csv. - Loads the trained topic model (
.pklfile).
- Customizes header colors & sidebar styles.
- Adjusts font size for better readability.
Hereβs how the dashboard looks:
π Real-time Reddit API integration for live updates.
π€ More advanced NLP models (e.g., RoBERTa for sentiment).
π Dashboard customization with user-defined topic filters.
π Named Entity Recognition (NER) for tracking politicians & policies.
Delisha Naik: delisha02








