-
Notifications
You must be signed in to change notification settings - Fork 0
Description
https://github.com/sailuh/sentiment_classifier
Purpose
Explore the GHTorrent 2004 schema to identify all tables and columns that can provide contextual information about the sentiment-labeled comments from the GitHub Gold Standard dataset.
Background
This work connects the GitHub Gold Standard sentiment dataset to the GHTorrent 2004 database to add contextual information (e.g., project data, user data, timestamps). The contextual data will enable Kaiaulu to re-download 2004-2025 project data for temporal expansion.
Related: See Kaiaulu issue #226 for the comment schema structure that Kaiaulu expects.
Current Schema
What we're starting with:
- Github Gold Standard dataset loaded into MySQL as
comment_sentimenttable (per SQL queries provided by @carlosparadis in Discussion Post #6) - Has 7,122 sentiment-labeled comments
- Columns:
ID(comment ID),Polarity(sentiment),Text(comments) - Key connection:
comment_sentiment.IDmaps tocommit_comments.comment_idin GHTorrent database
Here are the first few rows from the GitHub Gold Standard dataset:

What we're connecting to:
- GHTorrent has project/user/timestamp data spread across multiple tables
- Current JOIN path:
comment_sentiment→commit_comments→commits→projects
This is the GHTorrent schema structure (retrieved here):

What we need to produce:
- Kaiaulu-compatible format (see issue #226)
- Contextual dataset with enough info for Kaiaulu to re-download project data
Process
- Examine table structures, relationships, and sample data in the GHTorrent 2004 database
- Focus on tables that can link to comment data (projects, users, commits, issues, pull requests)
- Analyze join behavior (INNER vs LEFT JOIN)
- Document findings in the
sentiment_github_datasetrepository - Iterate with @carlosparadis to determine which columns are most relevant
Task List
- Investigate row count reduction (priority for this week):
- Confirm if INNER JOINs reduces sentiment-labeled rows
- Use LEFT JOINs to identify which comments have no GHTorrent dataset matches
- Document which contextual data is NULL/missing
- Query and document the GHTorrent 2004 schema structure (relevant tables, columns, data types)
- Identify all comment-related tables (
commit_comments,issue_comments,pull_request_comments) - Identify project/repository tables (
projects,repositories) - Identify user tables (
users) and their relationship to comments - Create schema mapping document showing GHTorrent → Kaiaulu format
- Share findings with @carlosparadis for feedback