Tigrinya Abusive Language Dataset (TiALD) is a large-scale, multi-task benchmark dataset for abusive language detection in the Tigrinya language. It consists of 13,717 YouTube comments annotated for abusiveness, sentiment, and topic tasks. The dataset includes comments written in both the Ge’ez script and prevalent non-standard Latin-based transliterations to mirror real-world usage.
The dataset also includes contextual metadata such as video titles and VLM-generated and LLM-enhanced descriptions of the corresponding video content, enabling context-aware modeling.
This work accompanies the paper "A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings", accepted at the NeurIPS 2025 conference in the Datasets & Benchmarks Track, San Diego, December (2-7), 2025.
Outline:
- Dataset Overview
- Baseline Models and Results
- Dataset Details
- Intended Usage of TiALD Dataset
- Evaluation and Computing Metrics
- Citation
- License
- Data Source: YouTube comments from 51 popular channels in the Tigrinya-speaking community.
- Scope: 13,717 human-annotated comments from 7,373 videos with over 1.2 billion cumulative views at the time of collection.
- Sampling: Comments selected using an embedding-based semantic expansion strategy from an initial pool of ~4.1 million comments across ~34.5k videos.
- For data construction methodology, baseline results, and task formulation, see the associated paper.
TiALD supports multi-task modeling of three complementary tasks abusiveness, sentiment, and topic classification, which in turn has the following classes:
- Abusiveness: Binary (
Abusive,Not Abusive) - Sentiment: 4-way (
Positive,Neutral,Negative,Mixed) - Topic: 5-way (
Political,Racial,Sexist,Religious,Other)
A schematic overview of the dataset tasks and classes is shown below:
A stable version of TiALD dataset is made available on 🤗 Hugging Face Hub.
You can head over to: https://huggingface.co/datasets/fgaim/tigrinya-abusive-language-detection
Or pull it from anywhere as follows:
from datasets import load_dataset
dataset = load_dataset("fgaim/tigrinya-abusive-language-detection")
print(dataset["validation"][5]) # Inspect a sampleSome strong performing trained models trained on TiALD can be found on Hugging Face Hub:
The training and inference code for the three baseline approaches discussed in the paper can be found in the baselines directory.
The following tables show the performances of the baseline models reported in the paper:
| Model | Abusiveness | Sentiment | Topic | TiALD Score |
|---|---|---|---|---|
| Fine-tuned Single-task Models | ||||
| TiELECTRA-small | 82.33 | 42.39 | 26.90 | 50.54 |
| TiRoBERTa-base | 86.67 | 52.82 | 54.23 | 64.57 |
| AfriBERTa-base | 83.42 | 50.81 | 53.20 | 62.48 |
| Afro-XLMR-Large-76L | 85.20 | 54.94 | 51.42 | 63.86 |
| XLM-RoBERTa-base | 81.08 | 30.17 | 43.97 | 51.74 |
| Fine-tuned Multi-task Models | ||||
| TiELECTRA-small | 84.21 | 43.44 | 29.27 | 52.30 |
| TiRoBERTa-base | 86.11 | 53.41 | 54.91 | 64.81 |
| AfriBERTa-base | 83.66 | 50.19 | 53.49 | 62.45 |
| Afro-XLMR-Large-76L | 85.44 | 54.50 | 52.46 | 64.13 |
| XLM-RoBERTa-base | 79.87 | 45.40 | 35.50 | 53.59 |
| Zero-shot Prompted LLMs | ||||
| GPT-4o | 71.05 | 20.55 | 26.25 | 39.28 |
| Claude Sonnet 3.7 | 59.20 | 22.64 | 25.25 | 35.70 |
| Gemma-3 4B | 59.35 | 29.47 | 35.24 | 41.35 |
| LLaMA-3.2 3B | 49.98 | 25.30 | 16.55 | 30.61 |
| Few-shot Prompted LLMs | ||||
| GPT-4o | 72.06 | 21.88 | 27.56 | 40.50 |
| Claude Sonnet 3.7 | 79.31 | 23.39 | 27.92 | 43.54 |
| Gemma-3 4B | 58.37 | 30.46 | 39.49 | 42.78 |
| LLaMA-3.2 3B | 45.65 | 19.94 | 21.68 | 29.09 |
Performance of fine-tuned encoder models (single and multi-task) and prompted generative LLMs (zero-shot and few-shot) evaluated on user comments across all three tasks. The TiALD Score is the average macro F1 across the three tasks. Overall task-level best scores are in bold; category-best scores are italicized.
| Model | Abusiveness | Sentiment | Topic | TiALD Score |
|---|---|---|---|---|
| Fine-tuned Single-task Models | ||||
| TiELECTRA-small | 81.67 | 39.40 | 27.81 | 49.62 |
| TiRoBERTa-base | 86.17 | 54.97 | 54.55 | 65.23 |
| AfriBERTa-base | 82.44 | 51.33 | 52.10 | 61.96 |
| Afro-XLMR-Large-76L | 84.20 | 52.64 | 54.11 | 63.65 |
| XLM-RoBERTa-base | 75.09 | 43.47 | 41.60 | 53.39 |
| Zero-shot Prompted LLMs | ||||
| GPT-4o | 75.59 | 41.03 | 55.52 | 57.38 |
| Claude Sonnet 3.7 | 67.64 | 44.39 | 50.10 | 54.05 |
| Gemma-3 4B | 58.41 | 29.27 | 34.44 | 40.71 |
| LLaMA-3.2 3B | 44.13 | 21.85 | 15.91 | 27.30 |
| Few-shot Prompted LLMs | ||||
| GPT-4o | 75.89 | 45.50 | 58.59 | 59.99 |
| Claude Sonnet 3.7 | 80.29 | 48.01 | 59.45 | 62.58 |
| Gemma-3 4B | 59.39 | 30.43 | 39.60 | 43.14 |
| LLaMA-3.2 3B | 48.29 | 20.19 | 20.20 | 29.56 |
Performance of models with video title as context. Fine-tuned models were trained on concatenation of user comment and video title. LLMs were prompted with both comment and video title. Overall task-level best scores are in bold; category-best scores are italicized.
| Model | Comment Only | Video Title + Comment | ||
|---|---|---|---|---|
| Zero-shot | Few-shot | Zero-shot | Few-shot | |
| Closed Frontier Models | ||||
| GPT-4o | 71.05 | 72.06 | 75.59 | 75.89 |
| Claude Sonnet 3.7 | 59.20 | 79.31 | 67.64 | 80.29 |
| Open-weight Models | ||||
| Gemma-3 4B | 59.35 | 58.37 | 58.41 | 59.39 |
| LLaMA-3.2 3B | 49.98 | 45.65 | 44.13 | 48.29 |
Performance of LLMs on Abusiveness Detection with Cross-Modality Contextual Information: user comment augmented with video_title and auto-generated video_description. Best scores for each prompting approach are in bold; highest scores within model category are italicized.
†LLaMA-3.2 3B produced invalid responses for over 61% of queries in both few-shot settings, mainly due to its limited Tigrinya text understanding.
The final prediction files from baselines models reported in the paper can be found under the model-predictions folder.
A table summarizing the dataset splits and distributions of samples:
| Split | Samples | Abusive | Not Abusive | Political | Racial | Sexist | Religious | Other Topics | Positive | Neutral | Negative | Mixed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | 12,317 | 6,980 | 5,337 | 4,037 | 633 | 564 | 244 | 6,839 | 2,433 | 1,671 | 6,907 | 1,306 |
| Test | 900 | 450 | 450 | 279 | 113 | 78 | 157 | 273 | 226 | 129 | 474 | 71 |
| Dev | 500 | 250 | 250 | 159 | 23 | 21 | 11 | 286 | 108 | 71 | 252 | 69 |
| Total | 13,717 | 7,680 | 6,037 | 4,475 | 769 | 663 | 412 | 7,398 | 2,767 | 1,871 | 7,633 | 1,446 |
Below is a complete list of features in the dataset, grouped by type:
| Feature | Type | Description |
|---|---|---|
sample_id |
Integer | Unique identifier for the sample. |
| Comment Information | ||
comment_id |
String | YouTube comment identifier. |
comment_original |
String | Original unprocessed comment text. |
comment_clean |
String | Cleaned version of the comment for modeling purposes. |
comment_script |
Categorical | Writing system of the comment: geez, latin, or mixed. |
comment_publish_date |
String | Year and month when the comment was published, eg., 2021.11. |
| Comment Annotations | ||
abusiveness |
Categorical | Whether the comment is Abusive or Not Abusive. |
topic |
Categorical | One of: Political, Racial, Religious, Sexist, or Other. |
sentiment |
Categorical | One of: Positive, Neutral, Negative, or Mixed. |
annotator_id |
String | Unique identifier of the annotator. |
| Video Information | ||
video_id |
String | YouTube video identifier. |
video_title |
String | Title of the YouTube video. |
video_publish_year |
Integer | Year the video was published, eg., 2022. |
video_num_views |
Integer | Number of views at the time of data collection. |
video_description |
String | Generated description of video content using a vision-language model and refined by an LLM. |
| Channel Information | ||
channel_id |
String | Identifier for the YouTube channel. |
channel_name |
String | Name of the YouTube channel. |
To assess annotation quality, a subset of 900 comments was double-annotated, exact agreement across all tasks in 546 examples and partial disagreement 354 examples.
Aggregate IAA Scores:
| Task | Cohen's Kappa | Remark |
|---|---|---|
| Abusiveness detection | 0.758 | Substantial agreement |
| Sentiment analysis | 0.649 | Substantial agreement |
| Topic classification | 0.603 | Moderate agreement |
Gold label: Expert adjudication was used to determine the final label of the test set, enabling a gold-standard evaluation.
Croissant is an open, standardized metadata format designed to describe machine learning (ML) datasets. Its primary goal is to make datasets easily discoverable, interoperable, and usable across various ML tools, frameworks, and repositories without changing the underlying data files themselves.
The Croissant metadata for TiALD dataset can be found at TiALD.Croissant.json.
The dataset is solely designed to support:
- Research in abusive language detection in low-resource languages
- Context-aware abusiveness, sentiment, and topic modeling
- Multi-task and transfer learning with digraphic scripts
- Evaluation of multilingual and fine-tuned language models
Researchers and developers should avoid using this dataset for direct moderation or enforcement tasks without human oversight.
- Sensitive content: Contains toxic and offensive language. Use for research purposes only.
- Cultural sensitivity: Abuse is context-dependent; annotations were made by native speakers to account for cultural nuance.
- Bias mitigation: Data sampling and annotation were carefully designed to minimize reinforcement of stereotypes.
- Privacy: All the source content for the dataset is publicly available on YouTube.
- Respect for expression: The dataset should not be used for automated censorship without human review.
This research received IRB approval (Ref: KH2022-133) from Korea Advanced Institute of Science and Technology (KAIST) and followed all ethical data collection and annotation practices, including informed consent of annotators.
Before computing metrics, you need to save models predictions for one or more of the three tasks in TiALD into a JSON file.
For consistency, we recommend saving the predictions into a file with the following format:
{
"config": {
"model_name": "<unique model name>",
"test_date": "<yyyymmdd>",
"<custom-field>": "<e.g., model type, hyperparams>"
},
"abusiveness_predictions": {
"<cid>": "<Abusive | Not Abusive>"
},
"topic_predictions": {
"<cid>": "<Political | Religious | Sexist | Racial | Other>"
},
"sentiment_predictions": {
"<cid>": "<Positive | Negative | Neutral | Mixed>"
}
}Given an exising predictions file for the samples in TiALD test set, the compute_tiald_metrics.py script can be used to compute all metrics discussed in the paper (task-level and pre-class).
Install dependencies:
pip install scikit-learn datasetsThen run the script as follows:
python compute_tiald_metrics.py \
--prediction_file <path-to-model-predictions.json> \
[--output_file <output-file-to-save-results.json>]
[--append_metrics <append metrics to the prediction file>]The script automatically loads the TiALD dataset and computes the following metrics:
- Accuracy for each task
- Macro F1 scores for each task
- Per-class precision, recall, and F1 scores
The summary of results is logged to the terminal and can optionally be saved to a detailed JSON file using the --output_file flag.
The aggregate TiALD Score reported in the paper is an arthmetic mean of the task-level macro F1 scores.
If you use TiALD in your work, please cite:
@misc{gaim-etal-2025-tiald-benchmark,
title = {A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings},
author = {Fitsum Gaim and Hoyun Song and Huije Lee and Changgeon Ko and Eui Jun Hwang and Jong C. Park},
year = {2025},
eprint = {2505.12116},
archiveprefix = {arXiv},
primaryclass = {cs.CL},
url = {https://arxiv.org/abs/2505.12116}
}This dataset is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

