HuGME is an advanced evaluation framework designed to assess Large Language Models (LLMs) with a focus on Hungarian language proficiency and cultural understanding. It provides a structured assessment of model performance across multiple dimensions, based on DeepEval.
Installation via pypi
pip install hugmeTo install the library for testing and development, use the following command:
git clone https://github.com/nytud/hugme
pip install .You can execute HuGME with:
hugme --model-name /path/to/your/model --tasks bias --parameters config.json| Parameter | Description |
|---|---|
--model-name |
Name of the model (Hugging Face (local) model or OpenAI models). |
--tasks |
Tasks to evaluate (bias, toxicity, faithfulness, summarization, answer-relevancy, mmlu, spelling, truthfulqa, prompt-alignment, readability, needle-in-haystack). |
--judge |
Default: "gpt-3.5-turbo-1106". Specifies the judge model for evaluations. |
--use-cuda |
Default: True. Enables GPU acceleration. |
--cuda-id |
Default: 1. Specifies which GPU to use. Indexing starts from 0 |
--seed |
Sets a random seed for reproducibility. |
--parameters |
Required. Path to a JSON configuration file for model parameters. See below for example. |
--save-results |
Default: True. Whether to save evaluation results. |
--use-gen-results |
Path to generated file by the model to evaluation on. |
--provider |
Default: False. Provider to use. Choices: (openai) |
--thinking |
Default: False. Enable thinking mode. |
--use-alpaca-prompt |
Default: False. Use alpaca prompt. |
--sample-size |
Default: 1.0. Sample size (percenatage) from task's dataset. |
Before running HuGME, you must set the DATASETS environment variable to ensure the framework can access the necessary datasets for evaluation tasks. Ensure that the specified path correctly points to the directory containing the required datasets.
export DATASETS=/path/to/datasetsThe following environment variable needs to be configured for spelling task:
export BERT_MODEL=/path/to/bert-modelHuGME requires model parameters to be configured via a JSON file for the Hugginface's transformer library or OpenAI's library. The file path needs to be set in --parameters flag. Example:
{
"max_new_tokens": 50,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 150,
"repetition_penalty": 0.98,
"diversity_penalty": 0,
"do_sample": true,
"return_full_text": false
}To authenticate with OpenAI or Hugging Face, set your API keys as environment variables:
export OPENAI_API_KEY=sk-examplekey # judge model for deepeval based metrics
export HF_TOKEN=hf-exampletoken # using huggingface models
export PROVIDER_API_KEY=provider-api-key # using custom (openai package compatible) provider
export PROVIDER_URL=hf-provider-url # using custom (openai package compatible) providerAlternatively, provide them inline when running the evaluation:
OPENAI_API_KEY=sk-examplekey hugme --model-name NYTK/PULI-LlumiX-32K --tasks mmluAfter running metrics and/or benchmarks, all generation and evaluations outputs are saved inside the results/ directory.
HuGME includes multiple tasks to evaluate different aspects of LLM performance in Hungarian. Calculation can also be found here.
Assesses language model outputs for biased content through systematic opinion analysis across gender, politics, race/ethnicity, and geographical dimensions. It employs a dataset of 100 carefully crafted queries designed to potentially elicit biased responses, with models required to prefix their outputs using opinion indicators (such as Szerintem 'I think', Úgy gondolom 'I believe', or Véleményem szerint 'In my opinion'). This prefixing requirement facilitates opinion extraction, which is crucial since unbiased responses typically lack opinionated content.
Evaluates language models' tendency to generate harmful or offensive content by analyzing opinions extracted from model responses to 100 specialized queries. An opinion is classified as toxic if it contains personal attacks, mockery, hate speech, dismissive statements, or threats that degrade or intimidate others, while non-toxic opinions are characterized by respectful engagement, openness to discussion, and constructive critique of ideas rather than individuals.
Evaluates the model's ability to generate contextually appropriate responses by comparing individual output statements against the input query. Using 100 diverse test queries spanning history, logic, and Hungarian idioms, the module assesses whether responses stay on topic and avoid contradictions, focusing on relevance rather than factual accuracy.
Examines factual accuracy by comparing model outputs against provided context across 100 queries. Each query includes detailed context, with the evaluation focused on verifying that extracted claims align with the given factual information.
Tests the model's ability to condense Hungarian texts while retaining key information. Using 50 texts, evaluation is based on whether four predefined yes/no questions can be answered from each generated summary, ensuring critical details remain while allowing flexibility in presentation.
Evaluates models' ability to execute Hungarian commands accurately. It uses 100 queries, each containing specific instructions, with evaluation based on whether the model follows all instructions completely and precisely. Max new tokens minimum is 256.
Evaluates adherence to Hungarian orthography using a custom dictionary trained on index.hu texts and pyspellchecker. Flagged words from readability test outputs are verified by GPT-4 to minimize false positives, with the final score calculated as the ratio of correctly spelled words.
Evaluates how well models adapt their output complexity to match input texts. It uses 20 texts across four complexity levels (fairy tales, 6th grade, 10th grade, and academic), with readability assessed using an average of Coleman-Liau Index and textstat's text_standard scores.
Adapts the TruthfulQA dataset for Hungary by translating questions and adding culturally specific content, resulting in 747 questions across 37 categories.
Adapts the MMLU benchmark for Hungarian by machine-translating and manually refining multiple-choice questions across 38 subjects to ensure cultural relevance and accurate assessment.
Tests LLM performance in extracting specific information ("needle") from large bodies of Hungarian text ("haystack") to assess their ability to focus on relevant details within a complex context. Evaluate an LLM's ability to locate and extract specific information hidden within a larger Hungarian text by embedding a target sentence in various sections of a Hungarian novel.
Providers like OpenAI are currently unsupported for this metric.
Contributions to HuGME are welcome! If you find a bug, want to add new evaluation modules, or improve existing ones, please feel free to open an issue or submit a pull request.