Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 149 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,149 @@
# research-engineering-intern-assignment
# Research-engineering-intern-assignment

# Social Media Analytics Dashboard

A comprehensive research engineering assignment designed to visualize and analyze social media data. This project features a modern React-based dashboard for interactive data exploration and a Python-based data processing pipeline for sentiment analysis, topic modeling, and network graph generation.

![Dashboard Preview](https://via.placeholder.com/800x400?text=Dashboard+Preview)
Deplyoed link :- https://research-media-project.netlify.app/

## 🚀 Features

- **Overview Dashboard**: High-level metrics on posts, authors, and sentiment trends.
- **Event Correlation**: Correlate social media spikes with real-world events to understand external triggers.
- **Narrative Analysis**: Topic modeling using Latent Dirichlet Allocation (LDA) to discover hidden themes.
- **Network Visualization**: Interactive force-directed graph showing relationships between authors and domains.
- **Analyst Copilot**: An AI-powered chat interface to query top sources, analyze specific domains, and search for trends.
- **Sentiment Analysis**: VADER-based sentiment scoring for all posts.

## 🛠️ Tech Stack

- **Frontend**: React, TypeScript, Vite, Tailwind CSS
- **Visualization**: Recharts, React Force Graph
- **Styling**: Tailwind CSS, Lucide React (Icons)
- **Data Processing**: Python 3.x
- **Analysis Libraries**: pandas, scikit-learn (Topic Modeling), vaderSentiment (Sentiment), NetworkX (Graph Theory)

## 📋 Prerequisites

Before you begin, ensure you have the following installed:
- **Node.js** (v18 or higher)
- **npm** (usually comes with Node.js)
- **Python** (v3.8 or higher)

## ⚙️ Setup & Installation

### 1. Clone the Repository
```bash
git clone <repository-url>
cd research-engineering-intern-assignment
```

### 2. Frontend Setup
Navigate to the UI directory and install dependencies:
```bash
cd my-data-ui
npm install
```

### 3. Python Environment Setup
It is recommended to use a virtual environment for the data processing scripts.

**Windows:**
```powershell
# Create virtual environment
python -m venv venv

# Activate it
.\venv\Scripts\Activate

# Install dependencies
pip install -r requirements.txt
```

**macOS/Linux:**
```bash
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

## 📊 Data Pipeline

The dashboard relies on pre-computed JSON data generated by the Python script. This ensures the frontend remains fast and responsive.

1. Ensure your raw data is located at `my-data-ui/public/data.json` (or ensure the script points to the correct source).
2. Run the analysis script:

```bash
# From the my-data-ui directory
python scripts/precompute_analysis.py
```

This command will generate the following files in `public/precomputed/`:
- `overview.json`
- `topics.json`
- `network.json`
- `sentiment.json`
- `event_correlations.json`

> **Note:** If you deploy this application, ensure these generated files are committed to your repository, as the Python script does not run in the browser.

## 🖥️ Running Locally

To start the development server:

```bash
# Inside my-data-ui directory
npm run dev
```

Open your browser and navigate to `http://localhost:5173` (or the port shown in your terminal).

## 📦 Building for Production

To create a production-ready build:

```bash
npm run build
```

This will generate a `dist` folder containing the static assets. You can preview the production build locally using:

```bash
npm run preview
```

## 🚀 Deployment

This project is optimized for deployment on platforms like **Vercel** or **Netlify**.

1. **Push to GitHub**: Ensure your code (including `public/precomputed` files) is pushed to a GitHub repository.
2. **Connect to Vercel**: Import the project.
3. **Settings**:
- Framework Preset: **Vite**
- Build Command: `npm run build`
- Output Directory: `dist`
4. **Deploy**: Click deploy and your dashboard will be live!

## 🤝 Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

## Below are the screenshots of the webpage with tools.

<img width="1366" height="768" alt="image" src="https://github.com/user-attachments/assets/7b9c508c-cc7a-4fd6-bed6-759ad3a61414" />

<img width="1366" height="768" alt="image" src="https://github.com/user-attachments/assets/324e8526-5081-4822-a45c-68fe12157e58" />

<img width="1366" height="768" alt="image" src="https://github.com/user-attachments/assets/b83d54b0-c33b-435b-bd2b-3da5f9e52645" />

<img width="1366" height="768" alt="image" src="https://github.com/user-attachments/assets/cccaf796-8686-4fce-b18c-9b4fc37ed8c7" />




80 changes: 80 additions & 0 deletions SYSTEM_DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# System Design & Implementation Logic

**Author:** Veeresh
**Project:** MisinfoOps - Social Media Analytics Dashboard

---

## 1. Architectural Philosophy: "Pre-Compute & Serve"

The core design challenge for this dashboard was handling heavy data analysis (Topic Modeling, Sentiment Analysis) without compromising the frontend user experience. Running complex NLP algorithms like LDA (Latent Dirichlet Allocation) or VADER sentiment analysis directly in the browser would cause significant lag and memory issues.

**Solution:** I adopted a **Pre-computation Architecture**.
- **Backend (Python):** Acts as an ETL (Extract, Transform, Load) pipeline. It ingests the raw `data.json`, performs all heavy mathematical operations, and exports lightweight, optimized JSON files (`overview.json`, `topics.json`, `network.json`, `event_correlations.json`).
- **Frontend (React/Vite):** Acts purely as a visualization layer. It fetches these pre-computed JSONs instantly, ensuring the dashboard feels "snappy" and responsive, regardless of the dataset size.

### System Architecture Diagram

```text
+------------------+ +-----------------------------+ +----------------------+
| Raw Data | | Python ETL Pipeline | | Static JSON Assets |
| | | (scripts/precompute.py) | | (public/precomputed) |
| [ data.json ] +------>+ +------>+ [ overview.json ] |
| [ events.json ] | | 1. Clean Text | | [ topics.json ] |
+------------------+ | 2. VADER Sentiment | | [ network.json ] |
| 3. LDA Topic Modeling | | [ correlations.json]|
| 4. Event Correlation | +----------+-----------+
+-----------------------------+ |
|
v
+----------------------+
| React Frontend |
| |
| [ Overview Tab ] |
| [ Narrative Tab ] |
| [ Network Tab ] |
| [ Analyst Copilot ] |
+----------------------+
```

---

## 2. Data Pipeline (Python)

The `scripts/precompute_analysis.py` script is the engine of the system.

### Key Components:
1. **Data Cleaning:**
- Raw social media text is noisy. I implemented a cleaning function to remove URLs, special characters, and handle `null` values (a bug I caught during development where `selftext` was missing).
2. **Sentiment Analysis (VADER):**
- I chose VADER (Valence Aware Dictionary and sEntiment Reasoner) because it is specifically tuned for social media text (handling emojis, slang, and capitalization) better than generic NLP models.
3. **Topic Modeling (LDA):**
- To discover hidden narratives, I used Scikit-Learn's Latent Dirichlet Allocation. This unsupervised learning algorithm groups posts into "topics" based on word co-occurrence, allowing us to see *what* people are talking about without manual labeling.
4. **Event Correlation (New Feature):**
- I added a logic layer that reads a separate `events.json` (real-world events). It aligns these dates with the post time-series to calculate "Volume Spikes" and "Sentiment Shifts" on those specific days, bridging the gap between online discourse and offline reality.

---

## 3. Frontend Design (React + Tailwind)

The UI was built with a focus on **"Investigative Storytelling"**.

### Key Decisions:
1. **Component Modularity:**
- The dashboard is split into distinct tabs (`Overview`, `Narrative`, `Network`, `Copilot`). Each tab is a self-contained React component, making the codebase easy to maintain and extend.
2. **Visual Hierarchy:**
- I used Tailwind CSS to create a clean, professional aesthetic. The "Key Metrics" row at the top gives immediate context (Total Posts, Active Authors), followed by the main timeline, and then deeper drills (Sentiment, Network).
3. **Interactive Visualization:**
- **Recharts:** Used for the time-series and pie charts because of its declarative nature and smooth animations.
- **Force Graph:** Used for the Network tab to visualize the "echo chambers" of authors and domains.
4. **Analyst Copilot (AI Feature):**
- Instead of a generic chatbot, I implemented a **Rule-Based Intent Parser**. It detects specific user intents (e.g., "top sources", "search for X") and queries the local dataset in real-time. This provides instant, deterministic answers without needing an expensive backend LLM API for every query.

---

## 4. AI Usage & Verification

I utilized AI (GitHub Copilot) as a "Force Multiplier" throughout the project. My strategy was **Context-First**:
- I always established the file structure and goal before asking for code.
- I critically reviewed every output. For example, when the AI suggested a date format that didn't match my JSON, I caught it, refined the prompt, and fixed the pipeline.
- This approach allowed me to build a robust, error-free system much faster than traditional coding, while maintaining full understanding and control over the logic.
24 changes: 24 additions & 0 deletions my-data-ui/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Logs
logs
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*
pnpm-debug.log*
lerna-debug.log*

node_modules
dist
dist-ssr
*.local

# Editor directories and files
.vscode/*
!.vscode/extensions.json
.idea
.DS_Store
*.suo
*.ntvs*
*.njsproj
*.sln
*.sw?
23 changes: 23 additions & 0 deletions my-data-ui/eslint.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import js from '@eslint/js'
import globals from 'globals'
import reactHooks from 'eslint-plugin-react-hooks'
import reactRefresh from 'eslint-plugin-react-refresh'
import tseslint from 'typescript-eslint'
import { defineConfig, globalIgnores } from 'eslint/config'

export default defineConfig([
globalIgnores(['dist']),
{
files: ['**/*.{ts,tsx}'],
extends: [
js.configs.recommended,
tseslint.configs.recommended,
reactHooks.configs.flat.recommended,
reactRefresh.configs.vite,
],
languageOptions: {
ecmaVersion: 2020,
globals: globals.browser,
},
},
])
13 changes: 13 additions & 0 deletions my-data-ui/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/vite.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>my-data-ui</title>
</head>
<body>
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>
Loading