HuGME: Hungarian Generative Model Evaluation benchmark

HuGME is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on Hungarian language proficiency and cultural understanding. It provides a structured assessment of model performance across multiple dimensions, based on DeepEval.

📌 Installation & Usage

Installation

Installation via pypi

pip install hugme

To install the library for testing and development, use the following command:

git clone https://github.com/nytud/hugme
pip install .

Running HuGME

You can execute HuGME with:

hugme --model-name /path/to/your/model --tasks bias --parameters config.json

Command-Line Parameters

Parameter	Description
`--model-name`	Name of the model (Hugging Face (local) model or OpenAI models).
`--tasks`	Tasks to evaluate (`bias`, `toxicity`, `faithfulness`, `summarization`, `answer-relevancy`, `mmlu`, `spelling`, `truthfulqa`, `prompt-alignment`, `readability`, `needle-in-haystack`).
`--judge`	Default: `"gpt-3.5-turbo-1106"`. Specifies the judge model for evaluations.
`--use-cuda`	Default: `True`. Enables GPU acceleration.
`--cuda-id`	Default: `1`. Specifies which GPU to use. Indexing starts from 0
`--seed`	Sets a random seed for reproducibility.
`--parameters`	Required. Path to a JSON configuration file for model parameters. See below for example.
`--save-results`	Default: `True`. Whether to save evaluation results.
`--use-gen-results`	Path to generated file by the model to evaluation on.
`--provider`	Default: `False`. Provider to use. Choices: (`openai`)
`--thinking`	Default: `False`. Enable thinking mode.
`--use-alpaca-prompt`	Default: `False`. Use alpaca prompt.
`--sample-size`	Default: `1.0`. Sample size (percenatage) from task's dataset.

🛠 Configure HuGME

Before running HuGME, you must set the DATASETS environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.

export DATASETS=/path/to/datasets

The following environment variable needs to be configured for spelling task:

export BERT_MODEL=/path/to/bert-model

HuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in --parameters flag. Example:

{
  "max_new_tokens": 50,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 150,
  "repetition_penalty": 0.98,
  "diversity_penalty": 0,
  "do_sample": true,
  "return_full_text": false
}

🔑 Providing API Keys

To authenticate with OpenAI or Hugging Face, set your API keys as environment variables:

export OPENAI_API_KEY=sk-examplekey # judge model for deepeval based metrics
export HF_TOKEN=hf-exampletoken # using huggingface models
export PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider
export PROVIDER_URL=hf-provider-url # using custom (openai package compatible) provider

Alternatively, provide them inline when running the evaluation:

OPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmlu

🧠 Results

After running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the results/ directory.

📊 Evaluation Tasks

HuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found here.

1️⃣ Bias

Assesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as Szerintem 'I think', Úgy gondolom 'I believe', or Véleményem szerint 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.

2️⃣ Toxicity

Evaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.

3️⃣ Answer relevancy

Evaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.

4️⃣ Faithfulness

Examines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.

5️⃣ Summarization

Tests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.

6️⃣ Prompt alignment

Evaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.

7️⃣ Spelling

Evaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.

8️⃣ Readability

Evaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.

9️⃣ HuTruthfulQA

Adapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.

🔟 HuMMLU (Massive Multitask Language Understanding)

Adapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.

🧩 Needle in the Haystack

Tests LLM performance in extracting specific information ("needle") from large bodies of Hungarian text ("haystack") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.

Providers like OpenAI are currently unsupported for this metric.

🤝 Contributing

Contributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
docs		docs
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENCE		LICENCE
README.md		README.md
mkdocs.yml		mkdocs.yml
mypy.ini		mypy.ini
parameters.json		parameters.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HuGME: Hungarian Generative Model Evaluation benchmark

📌 Installation & Usage

Installation

Running HuGME

Command-Line Parameters

🛠 Configure HuGME

🔑 Providing API Keys

🧠 Results

📊 Evaluation Tasks

1️⃣ Bias

2️⃣ Toxicity

3️⃣ Answer relevancy

4️⃣ Faithfulness

5️⃣ Summarization

6️⃣ Prompt alignment

7️⃣ Spelling

8️⃣ Readability

9️⃣ HuTruthfulQA

🔟 HuMMLU (Massive Multitask Language Understanding)

🧩 Needle in the Haystack

🤝 Contributing

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

nytud/hugme

Folders and files

Latest commit

History

Repository files navigation

HuGME: Hungarian Generative Model Evaluation benchmark

📌 Installation & Usage

Installation

Running HuGME

Command-Line Parameters

🛠 Configure HuGME

🔑 Providing API Keys

🧠 Results

📊 Evaluation Tasks

1️⃣ Bias

2️⃣ Toxicity

3️⃣ Answer relevancy

4️⃣ Faithfulness

5️⃣ Summarization

6️⃣ Prompt alignment

7️⃣ Spelling

8️⃣ Readability

9️⃣ HuTruthfulQA

🔟 HuMMLU (Massive Multitask Language Understanding)

🧩 Needle in the Haystack

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages