Skip to content

Eval-design: Visium z-score mean metric is invariant and trivially passable without preprocessing #2

@zhennyang

Description

@zhennyang

Eval-design: Visium z-score mean metric is invariant and trivially passable without preprocessing

Description

The eval merfish_brain_log_zscore_gad2_mean computes the mean Z-scored expression of Gad2 after normalization and z-scoring. This metric is invariant by construction (mean ≈ 0), making the eval gameable.

An agent can return 0 without loading data or executing any steps and still pass.

The original intent was to verify that the agent can (1) execute a multi-step preprocessing pipeline and (2) compute a derived statistic. The current metric does not achieve that.

Steps to Reproduce

Run vizgen/normalization/merfish_brain_log_zscore_gad2_mean.json

Expected vs Actual

Expected:
The output should depend on actually executing preprocessing and computing on the data.

Actual:
The eval passes with a constant output, independent of data or computation.

Environment

  • Dataset: vizgen_mouse_brain_aging_raw.h5ad
  • Eval: merfish_brain_log_zscore_gad2_mean
  • Type: numeric_tolerance
  • Steps: normalize → log1p → z-score

Proposed Fix

Replace the mean z-score with a non-invariant summary of the z-scored values, e.g:

  • Fraction of cells with Gad2 z-score > 1
  • 95th percentile of Gad2 z-scores
  • Variance / IQR of Gad2 z-scores

Example:

After z-scoring, compute the fraction of cells with Gad2 z-score > 1.
Return EXACTLY: {"pct_cells_gad2_z_gt_1": <float>}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions