Replies: 4 comments 19 replies
-
|
Let's start with question 5. You can't build a model without data. Running the codeIn this current repo you posted, the codebase uses what I will just refer to as "Haotian's thesis data". We would like to try loading this "GitHub Gold Dataset" data instead and see how what we currently have fares in performance: https://figshare.com/articles/dataset/A_gold_standard_for_polarity_of_emotions_of_software_developers_in_GitHub/11604597?file=21001260 The GitHub data should look like this: Note to run this codebase script, you will invoke it from Kaiaulu. Preparing the GitHub dataset furtherA limitation of this data is that there are only 3 columns. The context of where the data came from, beyond these IDs is lost. Awhile back I tried to answer the question of where they came from. You can obtain a database MySQL dump from here: https://web.archive.org/web/20150206005357/http://ghtorrent.org/msr14.html (please use the Mysql Dump). In it, you will find this: As you can see, far more columns. You will also find in the internet archive website a data schema. Here it is for reference:
According to the authors of this dataset, they manually annotated some tables with comments from this database to create the .csv file. So, it stands to reason we could simply create a table in said dump, load the data from the csv file in, and inner join it around: CREATE TABLE `comment_sentiment` (
`ID` int DEFAULT NULL,
`Polarity` varchar(256) DEFAULT NULL,
`Text` varchar(256) DEFAULT NULL
);-- Run inside mysql after changing to right db: create_commit_comment_sentiment_table.sql first
-- Run on bash: mysql with this to avoid permission errors when loading
-- mysql --local-infile=1 -u root -p
load data local infile 'comment_sentiment.csv' into table comment_sentiment
fields terminated by ';'
enclosed by '"'
lines terminated by '\n'
IGNORE 1 LINES
(ID,Polarity,Text);The two SQL above loads the data. We can then start creating queries: SELECT * FROM comment_sentiment s
INNER JOIN commit_comments cc ON s.ID = cc.comment_id
INNER JOIN commits c ON c.id = cc.commit_id
INNER JOIN projects p ON c.project_id = p.id;and another: SELECT name, count(name) as count FROM comment_sentiment s
INNER JOIN commit_comments cc ON s.ID = cc.comment_id
INNER JOIN commits c ON c.id = cc.commit_id
INNER JOIN projects p ON c.project_id = p.id
GROUP BY name
ORDER BY count desc;What I would like you to do here is to explore the schema to add as much information as possible to this .csv as additional columns. In an ideal world, we uncover sufficient information from them to then run Kaiaulu to re-obtain the data of these projects. Re-running with Kaiaulu will give us more data in "Kaiaulu format", which are also .csvs, and then we can inner join the .csv sentiment polarities above to Kaiaulu. Again, I would like this effort to prepare and understand this GitHub dataset to go in a separate repo than this one that I will create since it is data preparation. As much of the effort to find the data, load the data, and query the data is done, and all is left is to do SQL queries around and learn the dataset for analysis, I am considering this the short M1. Milestone 2What I would like to do from here is for you to do some clean up on this codebase in respect to its architecture. The code organization is still not there yet, so we need to do some refactoring and abstractions. This will make more sense when you try running the code to do M1. Again, I don't expect this to be much effort as two previous ICS 496 student already fixed much of the codebase. Milestone 3We will do the same effort as above, but using emotion data and another database. This time around, you are starting from the dump with less work done upfront from me: More on this when we get there. |
Beta Was this translation helpful? Give feedback.
-
1/30/26 Meeting Notes (Sentiment Classifier)Hi @carlosparadis, Here are the notes from today's meeting. SQL Query Documentation and Versioning Documented in M1The meeting covered the approach for documenting the data preparation work currently underway for M1. The GHTorrent 2004 MySQL dump has been successfully loaded, and the GitHub Gold Standard data has been injected into the All SQL queries and data preparation work will be documented in the separate Next Steps added to M1The immediate focus will be on organizing and documenting the SQL queries in the
Consistent iteration with @carlosparadis will be done to determine column selection. Next week's meeting will focus on establishing the formal versioning approach and finalizing the M1 output format (whether to export the expanded dataset as CSV or provide documented SQL queries). |
Beta Was this translation helpful? Give feedback.
-
Kaiaulu-Sentiment System Architecture with Contextualized DatasetThis diagram shows how three repositories work together to enable sentiment analysis on GitHub developer communication data. Simplified Architecture
Detailed Architecture
System OverviewThe workflow involves three main components:
For technical implementation details (SQL queries, table schemas, etc.), see Issue #1. |
Beta Was this translation helpful? Give feedback.
-
Kaiaulu-Sentiment System Architecture with Contextualized DatasetThis diagram shows how three repositories work together to enable sentiment analysis on GitHub developer communication data. sentiment_github_dataset to Kaiaulu
Kaiaulu to sentiment_classifier
Full System ArchitectureTo view the full sentiment_github_dataset → Kaiaulu → sentiment_classifier pipeline, you can access the editable file here System OverviewThe workflow involves three main components:
For technical implementation details (SQL queries, table schemas, etc.), see Issue #1. |
Beta Was this translation helpful? Give feedback.










Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Project Overview
The sentiment analysis system is a cross-platform tool that classifies developer communication (GitHub comments, Jira issues, emails) into three sentiment categories: positive, negative, and neutral. The system spans both R (Kaiaulu) and Python, where Kaiaulu handles data extraction and preprocessing, while Python ML models perform the actual sentiment classification. Users interact exclusively with Kaiaulu in R, which invokes the Python sentiment models behind the scenes.
Users can use Kaiaulu (R) to analyze sentiment data via ML models in

sentiment_classifier(Python) via 3 main parts:For more detail, here is the full-fleshed layout of the system architecture:

The goal of this project is to extend Kaiaulu's sentiment classifier infrastructure using both sentiment and emotion data. This project will focus on adding contextual data to the GitHub Gold Standard dataset, enabling researchers to perform temporal expansion (tracking sentiment evolution across 2004-2025) and horizontal expansion (linking project data to sentiment), and demonstrating pipeline reusability across multiple data sources (GitHub and Jira).
Project Description
This project consists of three milestones:
Milestone 1 (M1): Dataset Preparation
Goal: Transition from Haotian’s original dataset to a new contextualized dataset to create a current-day data infrastructure for researchers to use Kaiaulu
comment_sentiment.csv)ghtorrent_2004_github.csv)comment_sentimenttable and load GitHub Gold Standard datadb_ghtorrent_2004_github.csv)Why M1 Matters
The GitHub Gold Standard dataset contains manually labeled sentiment data that required significant effort to create. It's used by researchers worldwide. However, it contains no information about which projects the comments came from, who wrote them, when they were created, or any other context.
Fortunately, we have access to the GHTorrent MySQL database dump, which contains contextual data on GitHub projects, users, commits, and comments from 2004.
Joining these datasets unlocks two research capabilities through Kaiaulu:
Milestone 2 (M2): Codebase and Architecture Cleanup
Goal: Clean up the
sentiment_classifiercodebase based on insights from M1Milestone 3 (M3): Jira Emotion Dataset
Goal: Demonstrate the updated pipeline's reusability across different data sources
Beta Was this translation helpful? Give feedback.
All reactions