Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/validate-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ on:
push:
branches: [ main ]
paths:
- 'scripts/validate_data.py'
- 'utils/validate_data.py'
- 'eval.schema.json'
- 'data/**'
pull_request:
paths:
- 'scripts/validate_data.py'
- 'utils/validate_data.py'
- 'eval.schema.json'
- 'data/**'

Expand All @@ -34,7 +34,7 @@ jobs:
enable-cache: false

- name: Check for duplicate entries
run: uv run python scripts/check_duplicate_entries.py data
run: uv run python utils/check_duplicate_entries.py data

- name: Validate data
run: uv run pre-commit run --all-files
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ repos:
hooks:
- id: validate-data
name: Validate data
entry: uv run python scripts/validate_data.py --schema-path eval.schema.json
entry: uv run python utils/validate_data.py --schema-path eval.schema.json
exclude: ^(tests/|.*\.schema\.json$)
language: system
types_or: [json]
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ Note: Each file can contain multiple individual results related to one model. Se
1. Add a new folder under `/data` with a codename for your eval.
2. For each model, use the HuggingFace (`developer_name/model_name`) naming convention to create a 2-tier folder structure.
3. Add a JSON file with results for each model and name it `{uuid}.json`.
4. [Optional] Include a `scripts` folder in your eval name folder with any scripts used to generate the data.
5. [Validate] Validation Script: Adds workflow (`workflows/validate-data.yml`) that runs validation script (`scripts/validate_data.py`) to check JSON files against schema and report errors before merging.
4. [Optional] Include a `utils` folder in your eval name folder with any scripts used to generate the data.
5. [Validate] Validation Script: Adds workflow (`workflows/validate-data.yml`) that runs validation script (`utils/validate_data.py`) to check JSON files against schema and report errors before merging.

### Schema Instructions

Expand Down Expand Up @@ -188,8 +188,13 @@ Each evaluation (e.g., `livecodebenchpro`, `hfopenllm_v2`) has its own directory

Run following bash commands to generate pydantic classes for `eval.schema.json` and `instance_level_eval.schema.json` (to easier use in data converter scripts):

```bash
uv run datamodel-codegen --input eval.schema.json --output eval_types.py --class-name EvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-check
uv run datamodel-codegen --input instance_level_eval.schema.json --output instance_level_types.py --class-name InstanceLevelEvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-check

```bash
```

## Eval Converters

We have prepared converters to make adapting to our schema as easy as possible. At the moment, we support converting local evaluations in `Inspect AI` and `HELM` formats into our unified schema.

For more information, see the README in `eval_converters`.
20 changes: 10 additions & 10 deletions scripts/eval_converters/README.md → eval_converters/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Automatic Evaluation Log Converters
A collection of scripts to convert evaluation logs from local runs of evaluation benchmarks (e.g., Inspect AI and lm-eval-harness).
A collection of scripts to convert evaluation logs from local runs from evaluation frameworks (e.g., `Inspect AI` and `lm-eval-harness`).

### Installation
- Install the required dependencies:
Expand All @@ -9,18 +9,18 @@ uv sync
```

### Inspect
Convert eval log from Inspect AI into json format with following command:
`
The conversion script from `Inspect AI` to the unified schema can be run using `eval_converters/inspect/__main__.py`.

```bash
uv run inspect log convert path_to_eval_file_generated_by_inspect --to json --output-dir inspect_json
```

Then we can convert Inspect evaluation log into unified schema via `eval_converters/inspect/__main__.py`. Conversion for example data can be generated via below script:
Using the `--log_path` argument, you can choose one of three ways to specify evaluations to convert:
- Provide an `Inspect AI` evaluation log with the `.eval` extension (e.g., `2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.eval`)
- Provide an `Inspect AI` evaluation log with the `.json` extension (e.g., `2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json`)
- Provide a directory containing multiple `Inspect AI` evaluation logs

for example:
The exact command for converting an example evaluation log is:

```bash
uv run python3 -m scripts.eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json
uv run python3 -m eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json
```


Expand Down Expand Up @@ -48,7 +48,7 @@ options:
You can convert HELM evaluation log into unified schema via `eval_converters/helm/__main__.py`. For example:

```bash
uv run python3 -m scripts.eval_converters.helm --log_path tests/data/helm
uv run python3 -m eval_converters.helm --log_path tests/data/helm
```

The automatic conversion script requires following files generated by HELM to work correctly:
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from pathlib import Path
from typing import Any, Dict, List, Tuple, Union

from scripts.eval_converters.common.error import AdapterError, TransformationError
from eval_converters.common.error import AdapterError, TransformationError
from eval_types import EvaluationLog

@dataclass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from pathlib import Path
from typing import Any, Dict, List, Union

from scripts.eval_converters.helm.adapter import HELMAdapter
from eval_converters.helm.adapter import HELMAdapter
from eval_types import (
EvaluatorRelationship,
EvaluationLog
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@
SourceType
)

from scripts.eval_converters.common.adapter import AdapterMetadata, BaseEvaluationAdapter, SupportedLibrary
from scripts.eval_converters import SCHEMA_VERSION
from eval_converters.common.adapter import AdapterMetadata, BaseEvaluationAdapter, SupportedLibrary
from eval_converters import SCHEMA_VERSION

# run this just once in your process to initialize the registry
register_builtin_configs_from_helm_package()
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from pathlib import Path
from typing import Any, Dict, List, Tuple, Union

from scripts.eval_converters.inspect.adapter import InspectAIAdapter
from eval_converters.inspect.adapter import InspectAIAdapter
from eval_types import (
EvaluatorRelationship,
EvaluationLog
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,24 +49,24 @@
Uncertainty
)

from scripts.eval_converters.common.adapter import (
from eval_converters.common.adapter import (
AdapterMetadata,
BaseEvaluationAdapter,
SupportedLibrary
)

from scripts.eval_converters.common.error import AdapterError
from scripts.eval_converters.common.utils import (
from eval_converters.common.error import AdapterError
from eval_converters.common.utils import (
convert_timestamp_to_unix_format,
get_current_unix_timestamp
)
from scripts.eval_converters.inspect.instance_level_adapter import (
from eval_converters.inspect.instance_level_adapter import (
InspectInstanceLevelDataAdapter
)
from scripts.eval_converters.inspect.utils import (
from eval_converters.inspect.utils import (
extract_model_info_from_model_path, sha256_file
)
from scripts.eval_converters import SCHEMA_VERSION
from eval_converters import SCHEMA_VERSION

class InspectAIAdapter(BaseEvaluationAdapter):
"""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@
ToolCall
)

from scripts.eval_converters import SCHEMA_VERSION
from scripts.eval_converters.inspect.utils import sha256_string
from eval_converters import SCHEMA_VERSION
from eval_converters.inspect.utils import sha256_string


class InspectInstanceLevelDataAdapter:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
InferenceEngine,
ModelInfo
)
from scripts.eval_converters.common.utils import get_model_organization_info
from eval_converters.common.utils import get_model_organization_info


class ModelPathHandler:
Expand Down
9 changes: 0 additions & 9 deletions scripts/HELM/parse_helm_leaderboards.sh

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"solver_args_passed": {},
"dataset": {
"name": "GAIA",
"location": "/Users/damians/Library/Caches/inspect_evals/gaia_dataset/GAIA",
"location": "inspect_evals/gaia_dataset/GAIA",
"samples": 165,
"sample_ids": [
"c61d22de-5f6c-4958-a7f6-5e9707bd3466",
Expand Down
6 changes: 3 additions & 3 deletions tests/test_helm_adapter.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
from pathlib import Path

from scripts.eval_converters.helm.adapter import HELMAdapter
from eval_converters.helm.adapter import HELMAdapter
from eval_types import (
EvaluationLog,
EvaluatorRelationship,
SourceData,
SourceDataHf,
SourceMetadata
)

Expand All @@ -13,7 +13,7 @@ def _load_eval(adapter, filepath, metadata_args):
eval_dirpath = Path(filepath)
converted_eval = adapter.transform_from_directory(eval_dirpath, metadata_args=metadata_args)
assert isinstance(converted_eval, EvaluationLog)
assert isinstance(converted_eval.source_data, SourceData)
assert isinstance(converted_eval.source_data, SourceDataHf)

assert converted_eval.source_metadata.source_name == 'helm'
assert converted_eval.source_metadata.source_type.value == 'evaluation_run'
Expand Down
8 changes: 4 additions & 4 deletions tests/test_inspect_adapter.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
from pathlib import Path

from scripts.eval_converters.inspect.adapter import InspectAIAdapter
from scripts.eval_converters.inspect.utils import extract_model_info_from_model_path
from eval_converters.inspect.adapter import InspectAIAdapter
from eval_converters.inspect.utils import extract_model_info_from_model_path
from eval_types import (
EvaluationLog,
EvaluatorRelationship,
SourceData
SourceDataHf
)

def _load_eval(adapter, filepath, metadata_args):
eval_path = Path(filepath)
converted_eval = adapter.transform_from_file(eval_path, metadata_args=metadata_args)
assert isinstance(converted_eval, EvaluationLog)
assert isinstance(converted_eval.source_data, SourceData)
assert isinstance(converted_eval.source_data, SourceDataHf)

assert converted_eval.source_metadata.source_name == 'inspect_ai'
assert converted_eval.source_metadata.source_type.value == 'evaluation_run'
Expand Down
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
- Global MMLU Lite: Kaggle Benchmarks API (cohere-labs/global-mmlu-lite)

Usage:
uv run python -m scripts.global-mmlu-lite.adapter
uv run python -m utils.global-mmlu-lite.adapter
"""

import time
Expand All @@ -28,7 +28,7 @@
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))

from utils import (
from helpers import (
fetch_json,
get_developer,
make_source_metadata,
Expand Down
4 changes: 2 additions & 2 deletions scripts/HELM/adapter.py → utils/helm/adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- HELM_MMLU

Usage:
uv run python -m scripts.helm.adapter --leaderboard_name HELM_Lite --source_data_url <url>
uv run python -m utils.helm.adapter --leaderboard_name HELM_Lite --source_data_url <url>
"""

import math
Expand All @@ -32,7 +32,7 @@
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))

from utils import (
from helpers import (
fetch_json,
get_developer,
make_model_info,
Expand Down
9 changes: 9 additions & 0 deletions utils/helm/parse_helm_leaderboards.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Capabilities --source_data_url https://storage.googleapis.com/crfm-helm-public/capabilities/benchmark_output/releases/v1.12.0/groups/core_scenarios.json

uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Lite --source_data_url https://storage.googleapis.com/crfm-helm-public/lite/benchmark_output/releases/v1.13.0/groups/core_scenarios.json

uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Classic --source_data_url https://storage.googleapis.com/crfm-helm-public/benchmark_output/releases/v0.4.0/groups/core_scenarios.json

uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Instruct --source_data_url https://storage.googleapis.com/crfm-helm-public/instruct/benchmark_output/releases/v1.0.0/groups/instruction_following.json

uv run python3 -m utils.helm.adapter --leaderboard_name HELM_MMLU --source_data_url https://storage.googleapis.com/crfm-helm-public/mmlu/benchmark_output/releases/v1.13.0/groups/mmlu_subjects.json
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
- HF Open LLM Leaderboard v2 API: https://open-llm-leaderboard-open-llm-leaderboard.hf.space/api/leaderboard/formatted

Usage:
uv run python -m scripts.hfopenllm_v2.adapter
uv run python -m utils.hfopenllm_v2.adapter
"""

import time
Expand All @@ -25,7 +25,7 @@
import sys
sys.path.insert(0, str(Path(__file__).parent.parent))

from utils import (
from helpers import (
fetch_json,
get_developer,
make_model_info,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
using SourceDataUrl, matching each URL to its evaluation by difficulty.

Usage:
uv run python scripts/livecodebenchpro/adapter.py
uv run python utils/livecodebenchpro/adapter.py
"""

import json
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
- RewardBench v2: JSON files from allenai/reward-bench-2-results dataset (eval-set/{org}/{model}.json)

Usage:
uv run python -m scripts.rewardbench.adapter
uv run python -m utils.rewardbench.adapter
"""

import re
Expand All @@ -31,7 +31,7 @@
import sys
sys.path.insert(0, str(Path(__file__).parent.parent))

from utils import (
from helpers import (
fetch_csv,
fetch_json,
get_developer,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
source_data = {"dataset_name": "RewardBench 2", "source_type": "hf_dataset", "hf_repo": "allenai/reward-bench-2-results"}

Usage:
python -m scripts.rewardbench.migrate_to_v020
python -m utils.rewardbench.migrate_to_v020
"""

import json
Expand Down
File renamed without changes.
File renamed without changes.
Loading