Skip to content

fgaim/TiQuAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TiQuAD: Tigrinya Question Answering Dataset

Paper Dataset Dataset License: CC BY-SA 4.0

This repository accompanies our ACL 2023 paper "Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for Tigrinya". Selected for the Outstanding Paper Award.

Overview

Question-Answering (QA) has seen significant advances recently, achieving near human-level performance over some benchmarks. However, these advances focus on high-resourced languages such as English, while the task remains unexplored for most other languages, mainly due to the lack of annotated datasets. This work presents TiQuAD, the first human annotated QA dataset for Tigrinya, an East African language. The dataset contains 10.6K question-answer pairs (6.5K unique questions) spanning 572 paragraphs extracted from 290 news articles on various topics. The paper presents the dataset construction method, which is applicable to building similar resources for related languages.

In addition to the gold-standard TiQuAD, we develop Tigrinya-SQuAD, a silver dataset used as additional training resource and created by machine translating and filtering the English SQuAD v1.1 dataset.

We present comprehensive experiments and analyses of several resource-efficient approaches to QA, including monolingual, cross-lingual, and multilingual setups, along with comparisons against machine-translated silver data. Our strong baseline models reach 81% in the F1 score, while the estimated human performance is 92%, indicating that the benchmark presents a good challenge for future work.

Datasets

1. TiQuAD v1

Human annotated question-answering dataset with <Paragraph, Question, Answer> entries.

๐Ÿ“ฅ Download via HuggingFace Hub

Split Articles Paragraphs Questions Answers
Train 205 408 4,452 4,454
Dev 43 76 934 2,805
Test* 42 96 1,122 3,378
Total 290 572 6,508 10,637

Data Statistics of TiQuAD: The number of Articles, Paragraphs, Questions, and Answers. The dataset is partitioned by articles.

Note: Test set is not publicly available to maintain evaluation integrity. See TiQuAD Test Set Access section below.

TiQuAD Dataset Construction Pipeline

TiQuAD Dataset Construction Pipeline. The five-stage process includes article collection, context selection, question-answer pair annotation, additional answers annotation for evaluation sets, and quality-focused post-processing.

2. Tigrinya-SQuAD v1 (Extra Training Data)

The training split of the English SQuAD 1.1 dataset machine translated and filtered into Tigrinya.

๐Ÿ“ฅ Download via HuggingFace Hub

Split Articles Paragraphs Questions Answers
Train 442 17,391 46,737 46,737

Data Statistics of Tigrinya-SQuAD: The number of Articles, Paragraphs, Questions, and Answers in the Tigrinya translation of SQuAD v1.1 training set.

Tigrinya-SQuAD Dataset Construction Pipeline

Loading TiQuAD and Tigrinya-SQuAD Datasets

Install the datasets library installed by running pip install -U datasets in the terminal.

Make sure the latest datasets library is installed as older versions may not properly load the data.

Then pull and load the dataset using Python, as follows:

TiQuAD:

from datasets import load_dataset

# Load TiQuAD
tiquad = load_dataset("fgaim/tiquad")
print(tiquad)

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'context', 'answers', 'article_title', 'context_id'],
        num_rows: 4452
    })
    validation: Dataset({
        features: ['id', 'question', 'context', 'answers', 'article_title', 'context_id'],
        num_rows: 934
    })
})

Tigrinya-SQuAD:

from datasets import load_dataset

# Load Tigrinya-SQuAD
tigrinya_squad = load_dataset("fgaim/tigrinya-squad")
print(tigrinya_squad)

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'question', 'context', 'answers', 'article_title', 'context_id'],
        num_rows: 46737
    })
})

A sample entry from TiQuAD validation set:

{
    "id": "5dda7d3e-f76f-4500-a3af-07648a1afa51",
    "question": "แŠฃแˆแ‰ณแˆ•แˆชแˆญ แŠ“แ‹ญ แ‰…แ‹ตแˆ แˆตแˆ› แŠฅแŠ•แ‰ณแ‹ญ แŠ”แˆฉ?",
    "context": "แˆƒแ‰ฅแ‰ถแˆ แŠญแ‰ฅแˆจแŠฃแ‰ฅ (แˆžแŒ€)\nแˆžแŒ€ แŠฃแ‰ฅ 80โ€™แ‰ณแ‰ตแŠ• แŠฃแ‰ฅ แˆแˆˆแˆ› 90โ€™แ‰ณแ‰ตแŠ• แŠซแ‰ฅแ‰ถแˆ แŠ“แ‹ญ แŠญแˆˆแ‰ฅ แŠฃแˆแ‰ณแˆ•แˆชแˆญ แŠ•แ‰แ‹“แ‰ต แ‰ฐแŠธแˆ‹แŠธแˆแ‰ฒ แАแ‹ญแˆฉแฃ แ‰ฅแ‹ตแˆ•แˆชโ€™แ‹šโ€™แ‹แŠ• แŠฃแ‰ฅ แ‹แ‹ตแ‹ตแˆซแ‰ต แˆ“แ‹ญแˆแ‰ณแ‰ต แˆแŠญแˆแŠปแˆ แŠ•แŠ•แ‹แˆ• แ‹แ‰ แˆˆ แ‹“แˆ˜แ‰ณแ‰ต แŠจแˆ แŠฃแˆฐแˆแŒฃแŠ’ แŠญแˆˆแ‰ฅ แ‰ แŠ’แˆแˆญ แŠฎแ‹ญแŠ‘ แ‹แАแŒฅแ แ‹˜แˆŽ แŒˆแ‹ฒแˆ แ‰ฐแŒปแ‹‹แ‰ณแ‹ญแŠ• แŠฃแˆฐแˆแŒฃแŠ•แŠ•โ€™แ‹ฉแข แˆแˆ‰แŠฅ แˆตแˆ™ แˆƒแ‰ฅแ‰ถแˆ แŠญแ‰ฅแˆญแŠฃแ‰ฅ (แˆžแŒ€) แŠฅแ‹ฉแข แˆžแŒ€ แ‰ฅ1968 แŠฃแ‰ฅ แŠฃแˆตแˆ˜แˆซ แ‰ฐแ‹ˆแˆŠแ‹ฑ แ‹“แ‰ฅแ‹ฉแข แŠ•แˆฑ แŠซแ‰ฅ แŠ•แŠกแˆต แ‹•แ‹ตแˆšแŠก แ‰ฅแŠฉแ‹•แˆถ แŒจแˆญแ‰‚ แŒธแ‹ˆแ‰ณ แŒ€แˆšแˆฉแข แ‰ฅแ‹ตแˆ•แˆชแŠก แ‰ฅแ‹ฐแˆจแŒƒ แˆแˆแˆ•แ‹ณแˆญ แŠฃแ‰ฅ แ‹แŠซแ‹จแ‹ต แ‹แАแ‰ แˆจ แŠ“แ‹ญ โ€˜แ‰€แ‰ แˆŒโ€™ แŒธแ‹ˆแ‰ณแ‰ณแ‰ต แˆแˆต แŒธแˆ“แ‹ญ แ‰ แˆญแ‰‚ แˆแˆต แŠฅแ‰ตแ‰ แˆƒแˆ แŒ‹แŠ•แ‰ณ แ‰ฐแŒปแ‹Šแ‰ฑแข แŠฃแ‰ฅ 1987 แˆแˆต แ‹ณแˆ…แˆ‹แŠญ แŠฅแ‰ตแ‰ แˆƒแˆ แ‹แАแ‰ แˆจแ‰ต แŒ‹แŠ•แ‰ณ แŠ•แˆ“แ‹ฐ แ‹“แˆ˜แ‰ต แ‹ตแˆ•แˆช แˆแŒฝแ‹‹แ‰ฑ แŠจแŠฃ แŠฃแ‰ฅ แˆ˜แ‹ˆแ‹ณแŠฅแ‰ณ แ‹ˆแˆญแˆ’ 1987 แŠ“แ‰ฅ แŒ‹แŠ•แ‰ณ แ–แˆŠแˆต (แŠ“แ‹ญ แˆŽแˆš แŠฃแˆแ‰ณแˆ•แˆชแˆญ) แ‰ฅแˆแŒฝแŠ•แ‰ฃแˆญ แŠญแˆณแ‰ฅ 1988 แ‰ฐแŒปแ‹Šแ‰ฑแข แˆแˆตแ‰ณ แŠ“แ‹ญ แ‰…แ‹ตแˆš แŠ“แŒฝแАแ‰ต แŒ‹แŠ•แ‰ณ แ–แˆŠแˆต แŠฃแ‰ฅ แ‹แ‰ฐแŒปแ‹ˆแ‰ฐแˆ‰ แˆฐแˆˆแˆตแ‰ฐ แ‹“แˆ˜แ‰ณแ‰ต แŠจแŠฃ แ‹แ‰ฐแˆแˆ‹แˆˆแ‹จ แ‹“แ‹ˆแ‰ณแ‰ต แ‰ฐแŒแŠ“แŒบแ‰ แ‹‹แŠ“แŒฉ แŠจแˆแ‹•แˆ แ‰ แ‰’แ‹‘โ€™แ‹ฉแข แ‹ตแˆ•แˆช แŠ“แŒฝแАแ‰ต แˆตแˆ แŠญแˆˆแ‰ก แŠฃแˆแ‰ณแˆ•แˆชแˆญ แˆแˆต แ‰ฐแ‰แ‹จแˆจแฃ แˆžแŒ€ แŠ“แ‹ญแ‰ณ แŠญแˆˆแ‰ฅ แ‰ฐแŒปแ‹‹แ‰ณแ‹ญ แŠฎแ‹ญแŠ‘ แ‹แ‹ตแ‹ตแˆฉ แ‰€แŒบแˆ‰แข แŠฃแ‰ฅ แˆ˜แŒ€แˆ˜แˆญแ‰ณ แŠ“แŒฝแАแ‰ต (1991) แŠฃแ‰ฅ แ‹แ‰ฐแŠปแ‹จแ‹ฐ แŠ“แ‹ญ แ‹แˆแˆ›แ‹ญ แ‹‹แŠ•แŒซ แˆตแ‹แŠฃแ‰ต แˆ˜แŠ• แ‹“แ‰ฐแˆจ แ‹แ‹ตแ‹ตแˆญ แˆžแŒ€ แˆแˆต แŠญแˆˆแ‰ก แŠฃแˆแ‰ณแˆ•แˆชแˆญ แ‹‹แŠ•แŒซ แŠจแˆแ‹•แˆ แ‰ แ‰’แ‹‘แข แ‰ฅแ‹˜แ‹ญแŠซโ€™แ‹š แŠฃแ‰ฅ 1992 แ‰ฅแ‰ฅแˆ‰แŒปแ‰ต แ‰ฐแŒปแ‹ˆแ‰ตแ‰ฒ แ‰ฐแ‰ฐแŠฝแ‰ฒแŠป แ‹แАแ‰ แˆจแ‰ต แŠฃแˆแ‰ณแˆ•แˆชแˆญ แŠ“แ‹ญ แ‹แˆแˆ›แ‹ญ แ‹‹แŠ•แŒซ แŠ“แŒฝแАแ‰ต แŠจแˆแŠกโ€™แ‹แŠ• แˆปแˆแ•แ‹ฎแŠ• แŠญแ‰ตแŠจแ‹แŠ• แŠจแˆ‹ แˆžแŒ€ แŠฃแ‰ฃแˆโ€™แ‰ณ แŒ‹แŠ•แ‰ณ แАแ‹ญแˆฉแข แˆแˆต แŠญแˆˆแ‰ฅ แŠฃแˆแ‰ณแˆ•แˆชแˆญ แแ‰•แˆญแŠ• แˆ•แ‹แАแ‰ตแŠ• แ‹แˆ˜แˆแŠฆ แˆแ‰แˆญ แŠ“แ‹ญ แŒธแ‹ˆแ‰ณ แ‹˜แˆ˜แŠ• แŠจแˆ แ‹˜แˆ•แˆˆแˆ แ‹แŒ แ‰…แˆต แˆžแŒ€แฃ แˆแˆต แŠฃแˆแ‰ณแˆ•แˆชแˆญ แŠ“แ‰ฅ แŠจแˆ แˆฑแ‹ณแŠ•แŠ• แŠขแ‰ตแ‹ฎแŒตแ‹ซแŠ• แ‹แŠฃแˆ˜แˆฐแˆ‹ แˆƒแŒˆแˆซแ‰ต แ‰ฅแˆแŒ‹แˆฝ แŠฃแˆ…แŒ‰แˆซแ‹Š แŒธแ‹ˆแ‰ณแ‰ณแ‰ตโ€™แ‹แŠ• แŠฃแŠซแ‹ญแ‹ฑโ€™แ‹ฉแข",
    "answers": [
        {"answer_start": 414, "text": "แ–แˆŠแˆต"},
        {"answer_start": 414, "text": "แ–แˆŠแˆต"},
        {"answer_start": 410, "text": "แŒ‹แŠ•แ‰ณ แ–แˆŠแˆต"},
    ],
    "article_title": "แˆƒแ‰ฅแ‰ถแˆ แŠญแ‰ฅแˆจแŠฃแ‰ฅ (แˆžแŒ€)",
    "context_id": "17.1",
}

Note: Samples in the validation and test sets of TiQuAD have up to three answers labeled by different annotators.

TiQuAD Test Set Access

To maintain evaluation integrity and avoid data contamination, the TiQuAD test set is not publicly available.

Researchers looking to access the test set for evaluation purpose, please email the first author of the paper, with the following details:

  • Subject: TiQuAD Test Set Request
  • Your full name and affiliation
  • Research purpose and usage plan
  • Acknowledgment that the dataset will be used for evaluation only

We review requests to ensure legitimate research use while maintaining benchmark integrity.

Experimental Results

Pre-trained Language Models

Model Layers AH Params Lang. PT Tigrinya
tielectra-small 12 4 14M 1 yes
tiroberta-base 12 12 125M 1 yes
afriberta_base 8 6 112M 11 yes
xlm-roberta-base 12 12 278M 100 no
xlm-roberta-large 24 16 560M 100 no

Training Datasets

  • MT: Tigrinya-SQuAD (Machine Translated SQuAD v1.1 train set) โ€” Tigrinya
  • Native: TiQuAD train set โ€” Tigrinya
  • SQuAD: SQuAD v1.1 train set โ€” English

Results of Models and Mix of Dataset

                                                            โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ•ญโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค    TiQuAD Dev     โ”‚  TiQuAD Test    โ”‚
โ”‚    โ”‚ Dataset         โ”‚ Model             โ”‚ Epochs โ”‚ Batch โ”‚   EM    โ”‚   F1    โ”‚  EM    โ”‚  F1    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  1 โ”‚ MT              โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚   38.54 โ”‚   46.04 โ”‚  39.25 โ”‚  48.36 โ”‚
โ”‚  2 โ”‚ MT              โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   48.5  โ”‚   56.39 โ”‚  48.17 โ”‚  58.81 โ”‚
โ”‚  3 โ”‚ MT              โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   40.36 โ”‚   48.72 โ”‚  40.68 โ”‚  52.96 โ”‚
โ”‚  4 โ”‚ MT              โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   51.71 โ”‚   59.64 โ”‚  53.17 โ”‚  62.61 โ”‚
โ”‚  5 โ”‚ MT              โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   59.85 โ”‚   67.06 โ”‚  61.55 โ”‚  70.85 โ”‚
โ”‚  6 โ”‚ Native          โ”‚ tielectra-small   โ”‚      5 โ”‚     8 โ”‚   36.19 โ”‚   43.06 โ”‚  28.81 โ”‚  37    โ”‚
โ”‚  7 โ”‚ Native          โ”‚ tiroberta-base    โ”‚      5 โ”‚     8 โ”‚   56.21 โ”‚   64.36 โ”‚  53.08 โ”‚  61.82 โ”‚
โ”‚  8 โ”‚ Native          โ”‚ afriberta_base    โ”‚      5 โ”‚     8 โ”‚   38.01 โ”‚   44.85 โ”‚  35.06 โ”‚  44.24 โ”‚
โ”‚  9 โ”‚ Native          โ”‚ xlm-roberta-base  โ”‚      5 โ”‚     8 โ”‚   56.53 โ”‚   65.37 โ”‚  55.75 โ”‚  65.49 โ”‚
โ”‚ 10 โ”‚ Native          โ”‚ xlm-roberta-large โ”‚      5 โ”‚     8 โ”‚   63.17 โ”‚   71.32 โ”‚  64.94 โ”‚  72.62 โ”‚
โ”‚ 11 โ”‚ MT+Native       โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚   46.36 โ”‚   53.6  โ”‚  47.46 โ”‚  56.64 โ”‚
โ”‚ 12 โ”‚ MT+Native       โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   62.42 โ”‚   70.12 โ”‚  62.18 โ”‚  70.42 โ”‚
โ”‚ 13 โ”‚ MT+Native       โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   52.68 โ”‚   59.38 โ”‚  47.37 โ”‚  58.35 โ”‚
โ”‚ 14 โ”‚ MT+Native       โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   61.99 โ”‚   70.44 โ”‚  64.76 โ”‚  73.53 โ”‚
โ”‚ 15 โ”‚ MT+Native       โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   70.88 โ”‚   77.96 โ”‚  74.67 โ”‚  82.31 โ”‚
โ”‚ 16 โ”‚ SQuAD           โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚    9.85 โ”‚   20.91 โ”‚   9.81 โ”‚  20.41 โ”‚
โ”‚ 17 โ”‚ SQuAD           โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   10.71 โ”‚   20.88 โ”‚  10.88 โ”‚  20.69 โ”‚
โ”‚ 18 โ”‚ SQuAD           โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   20.24 โ”‚   32.05 โ”‚  20.52 โ”‚  32.95 โ”‚
โ”‚ 19 โ”‚ SQuAD           โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   17.99 โ”‚   27.81 โ”‚  22.66 โ”‚  34.44 โ”‚
โ”‚ 20 โ”‚ SQuAD           โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   29.12 โ”‚   40.26 โ”‚  34.7  โ”‚  43.96 โ”‚
โ”‚ 21 โ”‚ SQuAD+MT        โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚   37.69 โ”‚   46.06 โ”‚  39.07 โ”‚  49.07 โ”‚
โ”‚ 22 โ”‚ SQuAD+MT        โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   51.28 โ”‚   59.25 โ”‚  51.12 โ”‚  60.75 โ”‚
โ”‚ 23 โ”‚ SQuAD+MT        โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   44.33 โ”‚   51.43 โ”‚  45.58 โ”‚  56.36 โ”‚
โ”‚ 24 โ”‚ SQuAD+MT        โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   52.89 โ”‚   61.06 โ”‚  57.36 โ”‚  66.37 โ”‚
โ”‚ 25 โ”‚ SQuAD+MT        โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   61.03 โ”‚   67.75 โ”‚  61.91 โ”‚  71.05 โ”‚
โ”‚ 26 โ”‚ SQuAD+Native    โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚   33.73 โ”‚   41.51 โ”‚  32.74 โ”‚  40.53 โ”‚
โ”‚ 27 โ”‚ SQuAD+Native    โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   57.07 โ”‚   65.75 โ”‚  59.05 โ”‚  67.3  โ”‚
โ”‚ 28 โ”‚ SQuAD+Native    โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   51.93 โ”‚   59.66 โ”‚  51.38 โ”‚  62.13 โ”‚
โ”‚ 29 โ”‚ SQuAD+Native    โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   62.42 โ”‚   69.95 โ”‚  63.07 โ”‚  71.76 โ”‚
โ”‚ 30 โ”‚ SQuAD+Native    โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   67.24 โ”‚   76.19 โ”‚  71.54 โ”‚  78.39 โ”‚
โ”‚ 31 โ”‚ SQuAD+MT+Native โ”‚ tielectra-small   โ”‚      3 โ”‚    16 โ”‚   45.72 โ”‚   53.4  โ”‚  47.73 โ”‚  57.1  โ”‚
โ”‚ 32 โ”‚ SQuAD+MT+Native โ”‚ tiroberta-base    โ”‚      3 โ”‚    16 โ”‚   65.2  โ”‚   71.88 โ”‚  62.53 โ”‚  71.08 โ”‚
โ”‚ 33 โ”‚ SQuAD+MT+Native โ”‚ afriberta_base    โ”‚      3 โ”‚    16 โ”‚   51.93 โ”‚   59.47 โ”‚  53.26 โ”‚  63.22 โ”‚
โ”‚ 34 โ”‚ SQuAD+MT+Native โ”‚ xlm-roberta-base  โ”‚      3 โ”‚    16 โ”‚   64.78 โ”‚   72.8  โ”‚  68.06 โ”‚  76.58 โ”‚
โ”‚ 35 โ”‚ SQuAD+MT+Native โ”‚ xlm-roberta-large โ”‚      3 โ”‚    16 โ”‚   72.59 โ”‚   79.66 โ”‚  74.13 โ”‚  81.39 โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

The experiments on xlm-roberta-large were added after the paper was published. It outperforms other models mainly due to its larger size (parameters), showing successful transfer capability of fine-tuned multilingual models with minimal or zero exposure to the target language during pre-training.

TiQuAD Evaluation

We provide the official evaluation script evaluate-tiquad.py for computing TiQuAD benchmark scores. The script supports evaluation against both the HuggingFace dataset and local JSON files. Install dependencies by running pip install -U datasets numpy.

The script report the following metrics:

  • Exact Match (EM): Percentage of predictions that match ground truth exactly
  • Token-level F1: F1 score computed over tokens
  • Multi-reference handling: Max score across multiple reference answers

Predictions File Format

Your predictions file should be a JSON file with question IDs as keys and predicted answer texts as values:

{
  "5dda7d3e-...": "แŒ‹แŠ•แ‰ณ แ–แˆŠแˆต",
  ...
}

Usage Examples

# Evaluate against HuggingFace dataset (specific split)
python evaluate-tiquad.py predictions.json --use-hf-dataset --split validation

# Evaluate against a local JSON file (TiQuAD/SQuAD format)
python evaluate-tiquad.py predictions.json --eval-set-path eval-set-v1.json

Add --verbose options to print out more details.

Sample Output:

Loading predictions from: predictions.json
Loading validation set from HF dataset...
Computing evaluation scores...

===================================
TiQuAD EVALUATION RESULTS
===================================
Exact Match (EM): 0.6542 (65.42%)
F1 Score:         0.7321 (73.21%)
Questions evaluated: 934
===================================

Citation

This work can be cited as follows:

@inproceedings{gaim-etal-2023-tiquad,
    title = "Question-Answering in a Low-resourced Language: Benchmark Dataset and Models for {T}igrinya",
    author = "Fitsum Gaim and Wonsuk Yang and Hancheol Park and Jong C. Park",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.661",
    pages = "11857--11870",
}

Acknowledgments

  • Native Tigrinya speakers who contributed to the annotation process of TiQuAD
  • Hadas Ertra newspaper and Eritrean Ministry of Information (shabait.com) for source articles
  • The SQuAD team for the foundational work used as source for Tigrinya-SQuAD.

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License

About

Tigrinya Question-Answering Benchmark Dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages