diff --git a/.github/workflows/validate-data.yml b/.github/workflows/validate-data.yml index e689397a7..6c12b18bc 100644 --- a/.github/workflows/validate-data.yml +++ b/.github/workflows/validate-data.yml @@ -4,12 +4,12 @@ on: push: branches: [ main ] paths: - - 'scripts/validate_data.py' + - 'utils/validate_data.py' - 'eval.schema.json' - 'data/**' pull_request: paths: - - 'scripts/validate_data.py' + - 'utils/validate_data.py' - 'eval.schema.json' - 'data/**' @@ -34,7 +34,7 @@ jobs: enable-cache: false - name: Check for duplicate entries - run: uv run python scripts/check_duplicate_entries.py data + run: uv run python utils/check_duplicate_entries.py data - name: Validate data run: uv run pre-commit run --all-files diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 669160648..13d281dd1 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -3,7 +3,7 @@ repos: hooks: - id: validate-data name: Validate data - entry: uv run python scripts/validate_data.py --schema-path eval.schema.json + entry: uv run python utils/validate_data.py --schema-path eval.schema.json exclude: ^(tests/|.*\.schema\.json$) language: system types_or: [json] diff --git a/README.md b/README.md index 7760495ca..8b54d4da3 100644 --- a/README.md +++ b/README.md @@ -28,8 +28,8 @@ Note: Each file can contain multiple individual results related to one model. Se 1. Add a new folder under `/data` with a codename for your eval. 2. For each model, use the HuggingFace (`developer_name/model_name`) naming convention to create a 2-tier folder structure. 3. Add a JSON file with results for each model and name it `{uuid}.json`. -4. [Optional] Include a `scripts` folder in your eval name folder with any scripts used to generate the data. -5. [Validate] Validation Script: Adds workflow (`workflows/validate-data.yml`) that runs validation script (`scripts/validate_data.py`) to check JSON files against schema and report errors before merging. +4. [Optional] Include a `utils` folder in your eval name folder with any scripts used to generate the data. +5. [Validate] Validation Script: Adds workflow (`workflows/validate-data.yml`) that runs validation script (`utils/validate_data.py`) to check JSON files against schema and report errors before merging. ### Schema Instructions @@ -188,8 +188,13 @@ Each evaluation (e.g., `livecodebenchpro`, `hfopenllm_v2`) has its own directory Run following bash commands to generate pydantic classes for `eval.schema.json` and `instance_level_eval.schema.json` (to easier use in data converter scripts): +```bash uv run datamodel-codegen --input eval.schema.json --output eval_types.py --class-name EvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-check uv run datamodel-codegen --input instance_level_eval.schema.json --output instance_level_types.py --class-name InstanceLevelEvaluationLog --output-model-type pydantic_v2.BaseModel --input-file-type jsonschema --formatters ruff-format ruff-check - -```bash ``` + +## Eval Converters + +We have prepared converters to make adapting to our schema as easy as possible. At the moment, we support converting local evaluations in `Inspect AI` and `HELM` formats into our unified schema. + +For more information, see the README in `eval_converters`. \ No newline at end of file diff --git a/scripts/eval_converters/README.md b/eval_converters/README.md similarity index 73% rename from scripts/eval_converters/README.md rename to eval_converters/README.md index 79e1504cb..f45952cae 100644 --- a/scripts/eval_converters/README.md +++ b/eval_converters/README.md @@ -1,5 +1,5 @@ ## Automatic Evaluation Log Converters -A collection of scripts to convert evaluation logs from local runs of evaluation benchmarks (e.g., Inspect AI and lm-eval-harness). +A collection of scripts to convert evaluation logs from local runs from evaluation frameworks (e.g., `Inspect AI` and `lm-eval-harness`). ### Installation - Install the required dependencies: @@ -9,18 +9,18 @@ uv sync ``` ### Inspect -Convert eval log from Inspect AI into json format with following command: +` +The conversion script from `Inspect AI` to the unified schema can be run using `eval_converters/inspect/__main__.py`. -```bash -uv run inspect log convert path_to_eval_file_generated_by_inspect --to json --output-dir inspect_json -``` - -Then we can convert Inspect evaluation log into unified schema via `eval_converters/inspect/__main__.py`. Conversion for example data can be generated via below script: +Using the `--log_path` argument, you can choose one of three ways to specify evaluations to convert: +- Provide an `Inspect AI` evaluation log with the `.eval` extension (e.g., `2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.eval`) +- Provide an `Inspect AI` evaluation log with the `.json` extension (e.g., `2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json`) +- Provide a directory containing multiple `Inspect AI` evaluation logs -for example: +The exact command for converting an example evaluation log is: ```bash -uv run python3 -m scripts.eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json +uv run python3 -m eval_converters.inspect --log_path tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json ``` @@ -48,7 +48,7 @@ options: You can convert HELM evaluation log into unified schema via `eval_converters/helm/__main__.py`. For example: ```bash -uv run python3 -m scripts.eval_converters.helm --log_path tests/data/helm +uv run python3 -m eval_converters.helm --log_path tests/data/helm ``` The automatic conversion script requires following files generated by HELM to work correctly: diff --git a/scripts/eval_converters/__init__.py b/eval_converters/__init__.py similarity index 100% rename from scripts/eval_converters/__init__.py rename to eval_converters/__init__.py diff --git a/scripts/eval_converters/common/__init__.py b/eval_converters/common/__init__.py similarity index 100% rename from scripts/eval_converters/common/__init__.py rename to eval_converters/common/__init__.py diff --git a/scripts/eval_converters/common/adapter.py b/eval_converters/common/adapter.py similarity index 98% rename from scripts/eval_converters/common/adapter.py rename to eval_converters/common/adapter.py index dfde2cc0f..b1462aabf 100644 --- a/scripts/eval_converters/common/adapter.py +++ b/eval_converters/common/adapter.py @@ -7,7 +7,7 @@ from pathlib import Path from typing import Any, Dict, List, Tuple, Union -from scripts.eval_converters.common.error import AdapterError, TransformationError +from eval_converters.common.error import AdapterError, TransformationError from eval_types import EvaluationLog @dataclass diff --git a/scripts/eval_converters/common/error.py b/eval_converters/common/error.py similarity index 100% rename from scripts/eval_converters/common/error.py rename to eval_converters/common/error.py diff --git a/scripts/eval_converters/common/utils.py b/eval_converters/common/utils.py similarity index 100% rename from scripts/eval_converters/common/utils.py rename to eval_converters/common/utils.py diff --git a/scripts/eval_converters/helm/__init__.py b/eval_converters/helm/__init__.py similarity index 100% rename from scripts/eval_converters/helm/__init__.py rename to eval_converters/helm/__init__.py diff --git a/scripts/eval_converters/helm/__main__.py b/eval_converters/helm/__main__.py similarity index 98% rename from scripts/eval_converters/helm/__main__.py rename to eval_converters/helm/__main__.py index 838b3625f..355b150d1 100644 --- a/scripts/eval_converters/helm/__main__.py +++ b/eval_converters/helm/__main__.py @@ -6,7 +6,7 @@ from pathlib import Path from typing import Any, Dict, List, Union -from scripts.eval_converters.helm.adapter import HELMAdapter +from eval_converters.helm.adapter import HELMAdapter from eval_types import ( EvaluatorRelationship, EvaluationLog diff --git a/scripts/eval_converters/helm/adapter.py b/eval_converters/helm/adapter.py similarity index 98% rename from scripts/eval_converters/helm/adapter.py rename to eval_converters/helm/adapter.py index 3987eecc7..797e6d8a1 100644 --- a/scripts/eval_converters/helm/adapter.py +++ b/eval_converters/helm/adapter.py @@ -25,8 +25,8 @@ SourceType ) -from scripts.eval_converters.common.adapter import AdapterMetadata, BaseEvaluationAdapter, SupportedLibrary -from scripts.eval_converters import SCHEMA_VERSION +from eval_converters.common.adapter import AdapterMetadata, BaseEvaluationAdapter, SupportedLibrary +from eval_converters import SCHEMA_VERSION # run this just once in your process to initialize the registry register_builtin_configs_from_helm_package() diff --git a/scripts/eval_converters/helm/utils.py b/eval_converters/helm/utils.py similarity index 100% rename from scripts/eval_converters/helm/utils.py rename to eval_converters/helm/utils.py diff --git a/scripts/eval_converters/inspect/__init__.py b/eval_converters/inspect/__init__.py similarity index 100% rename from scripts/eval_converters/inspect/__init__.py rename to eval_converters/inspect/__init__.py diff --git a/scripts/eval_converters/inspect/__main__.py b/eval_converters/inspect/__main__.py similarity index 98% rename from scripts/eval_converters/inspect/__main__.py rename to eval_converters/inspect/__main__.py index 5ae386449..7f8327c3b 100644 --- a/scripts/eval_converters/inspect/__main__.py +++ b/eval_converters/inspect/__main__.py @@ -6,7 +6,7 @@ from pathlib import Path from typing import Any, Dict, List, Tuple, Union -from scripts.eval_converters.inspect.adapter import InspectAIAdapter +from eval_converters.inspect.adapter import InspectAIAdapter from eval_types import ( EvaluatorRelationship, EvaluationLog diff --git a/scripts/eval_converters/inspect/adapter.py b/eval_converters/inspect/adapter.py similarity index 97% rename from scripts/eval_converters/inspect/adapter.py rename to eval_converters/inspect/adapter.py index cf9273909..6f29cc4f4 100644 --- a/scripts/eval_converters/inspect/adapter.py +++ b/eval_converters/inspect/adapter.py @@ -49,24 +49,24 @@ Uncertainty ) -from scripts.eval_converters.common.adapter import ( +from eval_converters.common.adapter import ( AdapterMetadata, BaseEvaluationAdapter, SupportedLibrary ) -from scripts.eval_converters.common.error import AdapterError -from scripts.eval_converters.common.utils import ( +from eval_converters.common.error import AdapterError +from eval_converters.common.utils import ( convert_timestamp_to_unix_format, get_current_unix_timestamp ) -from scripts.eval_converters.inspect.instance_level_adapter import ( +from eval_converters.inspect.instance_level_adapter import ( InspectInstanceLevelDataAdapter ) -from scripts.eval_converters.inspect.utils import ( +from eval_converters.inspect.utils import ( extract_model_info_from_model_path, sha256_file ) -from scripts.eval_converters import SCHEMA_VERSION +from eval_converters import SCHEMA_VERSION class InspectAIAdapter(BaseEvaluationAdapter): """ diff --git a/scripts/eval_converters/inspect/instance_level_adapter.py b/eval_converters/inspect/instance_level_adapter.py similarity index 98% rename from scripts/eval_converters/inspect/instance_level_adapter.py rename to eval_converters/inspect/instance_level_adapter.py index d8bfab51c..26ba6c094 100644 --- a/scripts/eval_converters/inspect/instance_level_adapter.py +++ b/eval_converters/inspect/instance_level_adapter.py @@ -25,8 +25,8 @@ ToolCall ) -from scripts.eval_converters import SCHEMA_VERSION -from scripts.eval_converters.inspect.utils import sha256_string +from eval_converters import SCHEMA_VERSION +from eval_converters.inspect.utils import sha256_string class InspectInstanceLevelDataAdapter: diff --git a/scripts/eval_converters/inspect/utils.py b/eval_converters/inspect/utils.py similarity index 99% rename from scripts/eval_converters/inspect/utils.py rename to eval_converters/inspect/utils.py index e45ba253b..3e09d180a 100644 --- a/scripts/eval_converters/inspect/utils.py +++ b/eval_converters/inspect/utils.py @@ -9,7 +9,7 @@ InferenceEngine, ModelInfo ) -from scripts.eval_converters.common.utils import get_model_organization_info +from eval_converters.common.utils import get_model_organization_info class ModelPathHandler: diff --git a/scripts/HELM/parse_helm_leaderboards.sh b/scripts/HELM/parse_helm_leaderboards.sh deleted file mode 100755 index ad09b437c..000000000 --- a/scripts/HELM/parse_helm_leaderboards.sh +++ /dev/null @@ -1,9 +0,0 @@ -uv run python3 -m scripts.HELM.convert_to_schema --leaderboard_name HELM_Capabilities --source_data_url https://storage.googleapis.com/crfm-helm-public/capabilities/benchmark_output/releases/v1.12.0/groups/core_scenarios.json - -uv run python3 -m scripts.HELM.convert_to_schema --leaderboard_name HELM_Lite --source_data_url https://storage.googleapis.com/crfm-helm-public/lite/benchmark_output/releases/v1.13.0/groups/core_scenarios.json - -uv run python3 -m scripts.HELM.convert_to_schema --leaderboard_name HELM_Classic --source_data_url https://storage.googleapis.com/crfm-helm-public/benchmark_output/releases/v0.4.0/groups/core_scenarios.json - -uv run python3 -m scripts.HELM.convert_to_schema --leaderboard_name HELM_Instruct --source_data_url https://storage.googleapis.com/crfm-helm-public/instruct/benchmark_output/releases/v1.0.0/groups/instruction_following.json - -uv run python3 -m scripts.HELM.convert_to_schema --leaderboard_name HELM_MMLU --source_data_url https://storage.googleapis.com/crfm-helm-public/mmlu/benchmark_output/releases/v1.13.0/groups/mmlu_subjects.json \ No newline at end of file diff --git a/tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json b/tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json index 58ca9ea09..ec5a1402b 100644 --- a/tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json +++ b/tests/data/inspect/2026-02-07T11-26-57+00-00_gaia_4V8zHbbRKpU5Yv2BMoBcjE.json @@ -27,7 +27,7 @@ "solver_args_passed": {}, "dataset": { "name": "GAIA", - "location": "/Users/damians/Library/Caches/inspect_evals/gaia_dataset/GAIA", + "location": "inspect_evals/gaia_dataset/GAIA", "samples": 165, "sample_ids": [ "c61d22de-5f6c-4958-a7f6-5e9707bd3466", diff --git a/tests/test_helm_adapter.py b/tests/test_helm_adapter.py index d75c812d8..de1345efd 100644 --- a/tests/test_helm_adapter.py +++ b/tests/test_helm_adapter.py @@ -1,10 +1,10 @@ from pathlib import Path -from scripts.eval_converters.helm.adapter import HELMAdapter +from eval_converters.helm.adapter import HELMAdapter from eval_types import ( EvaluationLog, EvaluatorRelationship, - SourceData, + SourceDataHf, SourceMetadata ) @@ -13,7 +13,7 @@ def _load_eval(adapter, filepath, metadata_args): eval_dirpath = Path(filepath) converted_eval = adapter.transform_from_directory(eval_dirpath, metadata_args=metadata_args) assert isinstance(converted_eval, EvaluationLog) - assert isinstance(converted_eval.source_data, SourceData) + assert isinstance(converted_eval.source_data, SourceDataHf) assert converted_eval.source_metadata.source_name == 'helm' assert converted_eval.source_metadata.source_type.value == 'evaluation_run' diff --git a/tests/test_inspect_adapter.py b/tests/test_inspect_adapter.py index 74391113d..8bb799dd8 100644 --- a/tests/test_inspect_adapter.py +++ b/tests/test_inspect_adapter.py @@ -1,18 +1,18 @@ from pathlib import Path -from scripts.eval_converters.inspect.adapter import InspectAIAdapter -from scripts.eval_converters.inspect.utils import extract_model_info_from_model_path +from eval_converters.inspect.adapter import InspectAIAdapter +from eval_converters.inspect.utils import extract_model_info_from_model_path from eval_types import ( EvaluationLog, EvaluatorRelationship, - SourceData + SourceDataHf ) def _load_eval(adapter, filepath, metadata_args): eval_path = Path(filepath) converted_eval = adapter.transform_from_file(eval_path, metadata_args=metadata_args) assert isinstance(converted_eval, EvaluationLog) - assert isinstance(converted_eval.source_data, SourceData) + assert isinstance(converted_eval.source_data, SourceDataHf) assert converted_eval.source_metadata.source_name == 'inspect_ai' assert converted_eval.source_metadata.source_type.value == 'evaluation_run' diff --git a/scripts/ __init__.py b/utils/ __init__.py similarity index 100% rename from scripts/ __init__.py rename to utils/ __init__.py diff --git a/scripts/check_duplicate_entries.py b/utils/check_duplicate_entries.py similarity index 100% rename from scripts/check_duplicate_entries.py rename to utils/check_duplicate_entries.py diff --git a/scripts/global-mmlu-lite/__init__.py b/utils/global-mmlu-lite/__init__.py similarity index 100% rename from scripts/global-mmlu-lite/__init__.py rename to utils/global-mmlu-lite/__init__.py diff --git a/scripts/global-mmlu-lite/adapter.py b/utils/global-mmlu-lite/adapter.py similarity index 98% rename from scripts/global-mmlu-lite/adapter.py rename to utils/global-mmlu-lite/adapter.py index 0a0cbe441..78ba408f4 100644 --- a/scripts/global-mmlu-lite/adapter.py +++ b/utils/global-mmlu-lite/adapter.py @@ -6,7 +6,7 @@ - Global MMLU Lite: Kaggle Benchmarks API (cohere-labs/global-mmlu-lite) Usage: - uv run python -m scripts.global-mmlu-lite.adapter + uv run python -m utils.global-mmlu-lite.adapter """ import time @@ -28,7 +28,7 @@ from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) -from utils import ( +from helpers import ( fetch_json, get_developer, make_source_metadata, diff --git a/scripts/HELM/adapter.py b/utils/helm/adapter.py similarity index 98% rename from scripts/HELM/adapter.py rename to utils/helm/adapter.py index 2542be157..3297cfac9 100644 --- a/scripts/HELM/adapter.py +++ b/utils/helm/adapter.py @@ -9,7 +9,7 @@ - HELM_MMLU Usage: - uv run python -m scripts.helm.adapter --leaderboard_name HELM_Lite --source_data_url + uv run python -m utils.helm.adapter --leaderboard_name HELM_Lite --source_data_url """ import math @@ -32,7 +32,7 @@ from pathlib import Path sys.path.insert(0, str(Path(__file__).parent.parent)) -from utils import ( +from helpers import ( fetch_json, get_developer, make_model_info, diff --git a/utils/helm/parse_helm_leaderboards.sh b/utils/helm/parse_helm_leaderboards.sh new file mode 100755 index 000000000..f00e7dca6 --- /dev/null +++ b/utils/helm/parse_helm_leaderboards.sh @@ -0,0 +1,9 @@ +uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Capabilities --source_data_url https://storage.googleapis.com/crfm-helm-public/capabilities/benchmark_output/releases/v1.12.0/groups/core_scenarios.json + +uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Lite --source_data_url https://storage.googleapis.com/crfm-helm-public/lite/benchmark_output/releases/v1.13.0/groups/core_scenarios.json + +uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Classic --source_data_url https://storage.googleapis.com/crfm-helm-public/benchmark_output/releases/v0.4.0/groups/core_scenarios.json + +uv run python3 -m utils.helm.adapter --leaderboard_name HELM_Instruct --source_data_url https://storage.googleapis.com/crfm-helm-public/instruct/benchmark_output/releases/v1.0.0/groups/instruction_following.json + +uv run python3 -m utils.helm.adapter --leaderboard_name HELM_MMLU --source_data_url https://storage.googleapis.com/crfm-helm-public/mmlu/benchmark_output/releases/v1.13.0/groups/mmlu_subjects.json \ No newline at end of file diff --git a/scripts/utils/__init__.py b/utils/helpers/__init__.py similarity index 100% rename from scripts/utils/__init__.py rename to utils/helpers/__init__.py diff --git a/scripts/utils/developer.py b/utils/helpers/developer.py similarity index 100% rename from scripts/utils/developer.py rename to utils/helpers/developer.py diff --git a/scripts/utils/fetch.py b/utils/helpers/fetch.py similarity index 100% rename from scripts/utils/fetch.py rename to utils/helpers/fetch.py diff --git a/scripts/utils/io.py b/utils/helpers/io.py similarity index 100% rename from scripts/utils/io.py rename to utils/helpers/io.py diff --git a/scripts/utils/schema.py b/utils/helpers/schema.py similarity index 100% rename from scripts/utils/schema.py rename to utils/helpers/schema.py diff --git a/scripts/hfopenllm_v2/adapter.py b/utils/hfopenllm_v2/adapter.py similarity index 98% rename from scripts/hfopenllm_v2/adapter.py rename to utils/hfopenllm_v2/adapter.py index 98ccf56db..3c1a97c13 100644 --- a/scripts/hfopenllm_v2/adapter.py +++ b/utils/hfopenllm_v2/adapter.py @@ -5,7 +5,7 @@ - HF Open LLM Leaderboard v2 API: https://open-llm-leaderboard-open-llm-leaderboard.hf.space/api/leaderboard/formatted Usage: - uv run python -m scripts.hfopenllm_v2.adapter + uv run python -m utils.hfopenllm_v2.adapter """ import time @@ -25,7 +25,7 @@ import sys sys.path.insert(0, str(Path(__file__).parent.parent)) -from utils import ( +from helpers import ( fetch_json, get_developer, make_model_info, diff --git a/scripts/livecodebenchpro/adapter.py b/utils/livecodebenchpro/adapter.py similarity index 98% rename from scripts/livecodebenchpro/adapter.py rename to utils/livecodebenchpro/adapter.py index 94532357b..0c6031443 100644 --- a/scripts/livecodebenchpro/adapter.py +++ b/utils/livecodebenchpro/adapter.py @@ -5,7 +5,7 @@ using SourceDataUrl, matching each URL to its evaluation by difficulty. Usage: - uv run python scripts/livecodebenchpro/adapter.py + uv run python utils/livecodebenchpro/adapter.py """ import json diff --git a/scripts/rewardbench/adapter.py b/utils/rewardbench/adapter.py similarity index 99% rename from scripts/rewardbench/adapter.py rename to utils/rewardbench/adapter.py index 09ceef8d1..304509d53 100644 --- a/scripts/rewardbench/adapter.py +++ b/utils/rewardbench/adapter.py @@ -7,7 +7,7 @@ - RewardBench v2: JSON files from allenai/reward-bench-2-results dataset (eval-set/{org}/{model}.json) Usage: - uv run python -m scripts.rewardbench.adapter + uv run python -m utils.rewardbench.adapter """ import re @@ -31,7 +31,7 @@ import sys sys.path.insert(0, str(Path(__file__).parent.parent)) -from utils import ( +from helpers import ( fetch_csv, fetch_json, get_developer, diff --git a/scripts/rewardbench/migrate_to_v020.py b/utils/rewardbench/migrate_to_v020.py similarity index 98% rename from scripts/rewardbench/migrate_to_v020.py rename to utils/rewardbench/migrate_to_v020.py index e36880a16..0346d7c69 100644 --- a/scripts/rewardbench/migrate_to_v020.py +++ b/utils/rewardbench/migrate_to_v020.py @@ -14,7 +14,7 @@ source_data = {"dataset_name": "RewardBench 2", "source_type": "hf_dataset", "hf_repo": "allenai/reward-bench-2-results"} Usage: - python -m scripts.rewardbench.migrate_to_v020 + python -m utils.rewardbench.migrate_to_v020 """ import json diff --git a/scripts/test.py b/utils/test.py similarity index 100% rename from scripts/test.py rename to utils/test.py diff --git a/scripts/validate_data.py b/utils/validate_data.py similarity index 100% rename from scripts/validate_data.py rename to utils/validate_data.py