PuoBERTa: A Curated Setswana Language Model

A RoBERTa-based language model specially designed for Setswana, trained on the PuoData dataset for accurate and culturally relevant NLP applications.

Try it now: Interactive Demo | Model on HuggingFace | Paper

Give Feedback 📑: DSFSI Resource Feedback Form

Quick Start

Try Online (No Installation Required)

Visit our Interactive Demo to try all PuoBERTa models in your browser:

Fill-Mask: Predict masked words in Setswana text
News Classification: Categorize Setswana news articles
Named Entity Recognition (NER): Extract entities from text
Part-of-Speech (POS) Tagging: Identify grammatical roles of words

Quick Start with Code

Get started with PuoBERTa in just a few lines of code:

from transformers import pipeline

# Use the fill-mask pipeline
fill_mask = pipeline('fill-mask', model='dsfsi/PuoBERTa')
result = fill_mask("Setswana ke puo ya <mask>.")
print(result)

For more detailed examples, check out the examples directory with ready-to-run scripts for various use cases.

Model Details

Model Description

This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.

Developed by: Vukosi Marivate (@vukosi), Moseli Mots'Oehli (@MoseliMotsoehli) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
Model type: RoBERTa Model
Language(s) (NLP): Setswana (BCP-47: tn)
License: CC BY 4.0
Training Dataset: PuoData

Installation

Install the required dependencies:

pip install transformers torch

For fine-tuning and advanced usage:

pip install transformers torch datasets accelerate

Usage Examples

1. Masked Language Modeling (Fill-Mask)

Use PuoBERTa to predict masked words in Setswana text:

from transformers import pipeline

# Create a fill-mask pipeline
fill_mask = pipeline('fill-mask', model='dsfsi/PuoBERTa')

# Predict masked tokens
text = "Setswana ke puo ya <mask>."
results = fill_mask(text)

for result in results:
    print(f"Token: {result['token_str']}, Score: {result['score']:.4f}")

2. Getting Text Embeddings

Extract contextual embeddings for Setswana text:

from transformers import RobertaTokenizer, RobertaModel
import torch

# Load model and tokenizer
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')
model = RobertaModel.from_pretrained('dsfsi/PuoBERTa')

# Encode text
text = "Dumela! Ke a leboga."
inputs = tokenizer(text, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state

print(f"Embeddings shape: {embeddings.shape}")

3. Fine-tuning for Text Classification

Fine-tune PuoBERTa for downstream tasks like news categorization:

from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from transformers import RobertaTokenizer

# Load model for classification (e.g., 10 news categories)
model = RobertaForSequenceClassification.from_pretrained(
    'dsfsi/PuoBERTa',
    num_labels=10
)
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTa')

# Prepare your dataset
# train_dataset, eval_dataset = ...

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Train the model
trainer.train()

4. Using Pre-trained Downstream Models

We provide ready-to-use models for specific tasks:

from transformers import pipeline

# News categorization (10 categories)
news_classifier = pipeline('text-classification', model='dsfsi/PuoBERTa-News')
result = news_classifier("Palamente e ne e kopana gompieno go tlotla melao e mesha.")
print(f"Category: {result[0]['label']}, Score: {result[0]['score']:.4f}")

# Named Entity Recognition (PER, LOC, ORG, DATE)
ner = pipeline('ner', model='dsfsi/PuoBERTa-NER', aggregation_strategy="simple")
entities = ner("Vukosi Marivate o tswa kwa University of Pretoria.")
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.4f})")

# Part-of-Speech Tagging
pos_tagger = pipeline('token-classification', model='dsfsi/PuoBERTa-POS', aggregation_strategy="simple")
pos_tags = pos_tagger("Ke rata go bala dibuka.")
for tag in pos_tags:
    print(f"{tag['word']}: {tag['entity_group']}")

Downstream Models

News Categorization: dsfsi/PuoBERTa-News
Named Entity Recognition: dsfsi/PuoBERTa-NER
Part-of-Speech Tagging: dsfsi/PuoBERTa-POS

Downstream Performance

PuoBERTa has been evaluated on multiple downstream tasks and shows competitive performance against multilingual models while being specifically optimized for Setswana.

Daily News Dikgang (News Categorization)

Performance on the Setswana news categorization task using the Daily News Dikgang dataset. Learn more about the dataset in the Dataset Folder.

Model	5-fold Cross Validation F1	Test F1
Logistic Regression + TFIDF	60.1	56.2
NCHLT TSN RoBERTa	64.7	60.3
PuoBERTa	63.8	62.9
PuoBERTaJW300	66.2	65.4

Pre-trained model: dsfsi/PuoBERTa-News

MasakhaPOS (Part-of-Speech Tagging)

Performance on the MasakhaPOS downstream task for Setswana.

Model	Test Performance
Multilingual Models
AfroLM	83.8
AfriBERTa	82.5
AfroXLMR-base	82.7
AfroXLMR-large	83.0
Monolingual Models
NCHLT TSN RoBERTa	82.3
PuoBERTa	83.4
PuoBERTa+JW300	84.1

Pre-trained model: dsfsi/PuoBERTa-POS

MasakhaNER (Named Entity Recognition)

Performance on the MasakhaNER downstream task for Setswana.

Model	Test Performance (f1 score)
Multilingual Models
AfriBERTa	83.2
AfroXLMR-base	87.7
AfroXLMR-large	89.4
Monolingual Models
NCHLT TSN RoBERTa	74.2
PuoBERTa	78.2
PuoBERTa+JW300	80.2

Pre-trained model: dsfsi/PuoBERTa-NER

Pre-Training Dataset

PuoBERTa was trained on PuoData, a large, curated corpus of Setswana text designed to ensure the model is well-trained and culturally attuned to the language.

Access the dataset:

The dataset includes diverse sources of Setswana text to provide comprehensive language coverage for robust model training.

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

Contributing

We welcome contributions from the community! Whether you want to:

Add new examples or improve documentation
Report bugs or suggest features
Share your fine-tuned models
Contribute datasets or use cases

Please see our Contributing Guidelines for detailed information on how to get started.

Quick links:

Model Card Authors

Vukosi Marivate

Model Card Contact

For more details, reach out or check our website.

Email: vukosi.marivate@cs.up.ac.za

Enjoy exploring Setswana through AI!

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
daily-news-dikgang		daily-news-dikgang
examples		examples
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
data_statement.md		data_statement.md
requirements.txt		requirements.txt
setup.py		setup.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PuoBERTa: A Curated Setswana Language Model

Table of Contents

Quick Start

Try Online (No Installation Required)

Quick Start with Code

Model Details

Model Description

Installation

Usage Examples

1. Masked Language Modeling (Fill-Mask)

2. Getting Text Embeddings

3. Fine-tuning for Text Classification

4. Using Pre-trained Downstream Models

Downstream Models

Downstream Performance

Daily News Dikgang (News Categorization)

MasakhaPOS (Part-of-Speech Tagging)

MasakhaNER (Named Entity Recognition)

Pre-Training Dataset

Citation Information

Contributing

Model Card Authors

Model Card Contact

About

Uh oh!

Uh oh!

Contributors 2

Languages

License

dsfsi/PuoBERTa

Folders and files

Latest commit

History

Repository files navigation

PuoBERTa: A Curated Setswana Language Model

Table of Contents

Quick Start

Try Online (No Installation Required)

Quick Start with Code

Model Details

Model Description

Installation

Usage Examples

1. Masked Language Modeling (Fill-Mask)

2. Getting Text Embeddings

3. Fine-tuning for Text Classification

4. Using Pre-trained Downstream Models

Downstream Models

Downstream Performance

Daily News Dikgang (News Categorization)

MasakhaPOS (Part-of-Speech Tagging)

MasakhaNER (Named Entity Recognition)

Pre-Training Dataset

Citation Information

Contributing

Model Card Authors

Model Card Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Languages