Add task validation CLI by MagellaX · Pull Request #302 · hud-evals/hud-python

MagellaX · 2026-01-26T08:51:00Z

Summary

add hud validate to check task files or HF datasets without running them
surface per‑task validation errors with clear CLI output
include a small unit test for valid/invalid task files

Note

Low Risk
Adds a new CLI command and validation-only code paths; main risk is false positives/negatives in task parsing/validation rather than runtime behavior changes.

Overview
Adds a new hud validate command that loads tasks from a local .json/.jsonl file or a dataset slug and validates each entry without running an eval.

Validation now checks v4-style tasks via validate_v4_task (when detected) and always attempts Pydantic Task construction, aggregating and printing per-task errors before exiting non-zero on failure. Includes unit tests covering valid tasks, missing required fields, and non-dict entries in the tasks list.

^{Written by Cursor Bugbot for commit 0191b4d. This will update automatically on new commits. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b4f45c79d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

hud/cli/validate.py

hud/cli/__init__.py

hud/cli/validate.py

hud/cli/__init__.py

hud/cli/validate.py

hud/tests/test_validate_cli.py

MagellaX · 2026-01-27T19:10:50Z

any thoughts? @lorenss-m

cursor · 2026-02-02T07:36:01Z

hud/datasets/runner.py

+        task_list = [t if isinstance(t, Task) else Task.from_v4(t) for t in tasks]
+
+    if not task_list:
+        raise ValueError("No tasks to run")


Duplicated task normalization logic in runner functions

Medium Severity

The task normalization logic in run_dataset_async (lines 159-175) is nearly identical to run_dataset (lines 76-94). Both functions normalize agent_type from string to AgentType enum and normalize tasks from various input types to list[Task]. This ~17 lines of duplicated code should be extracted into a shared helper function like _normalize_tasks().

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-02T07:54:02Z

hud/cli/validate.py

+
+def _load_raw_tasks(source: str) -> tuple[list[dict[str, Any]], list[str]]:
+    path = Path(source)
+    if path.exists() and path.suffix.lower() in {".json", ".jsonl"}:


Case sensitivity mismatch between validation and loading

Low Severity

The _load_raw_tasks and _load_raw_from_file functions use case-insensitive extension matching via .suffix.lower(), while the existing load_tasks function in loader.py uses case-sensitive matching. This means a file like tasks.JSONL would pass validation but fail when actually loaded via load_tasks, because loader.py wouldn't recognize the uppercase extension and would incorrectly try to fetch it as a HuggingFace dataset.

Additional Locations (1)

hud/cli/validate.py#L74-L75

cursor · 2026-02-02T07:54:02Z

hud/tests/test_validate_cli.py

+    module = importlib.util.module_from_spec(spec)  # type: ignore[arg-type]
+    assert spec and spec.loader
+    spec.loader.exec_module(module)
+    return module.validate_command


Test uses unnecessarily complex importlib module loading

Medium Severity

The _load_validate_command() function uses importlib.util.spec_from_file_location to manually load the module when a simple import would work: from hud.cli.validate import validate_command. This pattern is inconsistent with other tests in hud/cli/tests/ which use standard imports.

MagellaX and others added 2 commits January 26, 2026 14:06

Add task validation command

9b4f45c

Merge branch 'main' into feature/task-validate-cli

0236c5d

chatgpt-codex-connector bot reviewed Jan 26, 2026

View reviewed changes

hud/cli/validate.py Outdated Show resolved Hide resolved

Validate: flag non-object task entries

edcd13d

cursor bot reviewed Jan 26, 2026

View reviewed changes

hud/cli/__init__.py Outdated Show resolved Hide resolved

hud/cli/__init__.py Outdated Show resolved Hide resolved

hud/cli/validate.py Show resolved Hide resolved

Fix eval CLI selection and strict v4 validation

fa8aade

cursor bot reviewed Jan 26, 2026

View reviewed changes

hud/cli/__init__.py Outdated Show resolved Hide resolved

hud/cli/__init__.py Outdated Show resolved Hide resolved

hud/cli/validate.py Outdated Show resolved Hide resolved

Restore eval command registration and tighten types

7e57738

cursor bot reviewed Jan 26, 2026

View reviewed changes

hud/tests/test_validate_cli.py Show resolved Hide resolved

Fix ruff formatting and JSONL validation

0191b4d

cursor bot reviewed Feb 2, 2026

View reviewed changes

MagellaX force-pushed the feature/task-validate-cli branch from 579b4f0 to 0191b4d Compare February 2, 2026 07:41

cursor bot reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task validation CLI#302

Add task validation CLI#302
MagellaX wants to merge 6 commits intohud-evals:mainfrom
MagellaX:feature/task-validate-cli

MagellaX commented Jan 26, 2026 •

edited by cursor bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MagellaX commented Jan 27, 2026

Uh oh!

cursor bot Feb 2, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 2, 2026

Uh oh!

cursor bot Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MagellaX commented Jan 26, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MagellaX commented Jan 27, 2026

Uh oh!

cursor bot Feb 2, 2026

Choose a reason for hiding this comment

Duplicated task normalization logic in runner functions

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 2, 2026

Choose a reason for hiding this comment

Case sensitivity mismatch between validation and loading

Uh oh!

cursor bot Feb 2, 2026

Choose a reason for hiding this comment

Test uses unnecessarily complex importlib module loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MagellaX commented Jan 26, 2026 •

edited by cursor bot

Loading