This project is a complete Sentiment Analysis pipeline that fine-tunes a RoBERTa model to classify text into six different emotions. The project handles dataset class imbalance using weighted loss and includes an interactive web demo built with Gradio and deployed to Hugging Face Spaces.
We use the dair-ai/emotion dataset. It contains English Twitter messages labeled with six basic emotions:
| Label ID | Emotion |
|---|---|
| 0 | Sadness π’ |
| 1 | Joy π |
| 2 | Love π₯° |
| 3 | Anger π‘ |
| 4 | Fear π± |
| 5 | Surprise π² |
This project goes beyond standard fine-tuning by addressing class imbalance in the training data:
- Data Preprocessing: Tokenization using
RobertaTokenizerwith truncation to a max length of 128. - Class Weights: We compute class weights using
sklearn.utils.class_weightto penalize the model more for misclassifying minority classes (like Surprise). - Custom Trainer: A custom
WeightedTrainer(subclassing Hugging Face'sTrainer) is implemented to override thecompute_lossmethod, injecting the calculated class weights into the CrossEntropyLoss. - Model: Fine-tuning
roberta-basefor sequence classification.