Skip to content

Explore GHTorrent 2004 Schema for Contextual Data #1

@splimon

Description

@splimon

https://github.com/sailuh/sentiment_classifier

Purpose

Explore the GHTorrent 2004 schema to identify all tables and columns that can provide contextual information about the sentiment-labeled comments from the GitHub Gold Standard dataset.

Background

This work connects the GitHub Gold Standard sentiment dataset to the GHTorrent 2004 database to add contextual information (e.g., project data, user data, timestamps). The contextual data will enable Kaiaulu to re-download 2004-2025 project data for temporal expansion.

Related: See Kaiaulu issue #226 for the comment schema structure that Kaiaulu expects.

Current Schema

What we're starting with:

  • Github Gold Standard dataset loaded into MySQL as comment_sentiment table (per SQL queries provided by @carlosparadis in Discussion Post #6)
  • Has 7,122 sentiment-labeled comments
  • Columns: ID (comment ID), Polarity (sentiment), Text (comments)
  • Key connection: comment_sentiment.ID maps to commit_comments.comment_id in GHTorrent database

Here are the first few rows from the GitHub Gold Standard dataset:
github_gold

What we're connecting to:

  • GHTorrent has project/user/timestamp data spread across multiple tables
  • Current JOIN path: comment_sentimentcommit_commentscommitsprojects

This is the GHTorrent schema structure (retrieved here):
ghtorrent_schema

What we need to produce:

  • Kaiaulu-compatible format (see issue #226)
  • Contextual dataset with enough info for Kaiaulu to re-download project data
kaiaulu_schema

Process

  1. Examine table structures, relationships, and sample data in the GHTorrent 2004 database
  2. Focus on tables that can link to comment data (projects, users, commits, issues, pull requests)
  3. Analyze join behavior (INNER vs LEFT JOIN)
  4. Document findings in the sentiment_github_dataset repository
  5. Iterate with @carlosparadis to determine which columns are most relevant

Task List

  • Investigate row count reduction (priority for this week):
  • Confirm if INNER JOINs reduces sentiment-labeled rows
  • Use LEFT JOINs to identify which comments have no GHTorrent dataset matches
  • Document which contextual data is NULL/missing
  • Query and document the GHTorrent 2004 schema structure (relevant tables, columns, data types)
  • Identify all comment-related tables (commit_comments, issue_comments, pull_request_comments)
  • Identify project/repository tables (projects, repositories)
  • Identify user tables (users) and their relationship to comments
  • Create schema mapping document showing GHTorrent → Kaiaulu format
  • Share findings with @carlosparadis for feedback

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions