-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Eval-design: Visium z-score mean metric is invariant and trivially passable without preprocessing
Description
The eval merfish_brain_log_zscore_gad2_mean computes the mean Z-scored expression of Gad2 after normalization and z-scoring. This metric is invariant by construction (mean ≈ 0), making the eval gameable.
An agent can return 0 without loading data or executing any steps and still pass.
The original intent was to verify that the agent can (1) execute a multi-step preprocessing pipeline and (2) compute a derived statistic. The current metric does not achieve that.
Steps to Reproduce
Run vizgen/normalization/merfish_brain_log_zscore_gad2_mean.json
Expected vs Actual
Expected:
The output should depend on actually executing preprocessing and computing on the data.
Actual:
The eval passes with a constant output, independent of data or computation.
Environment
- Dataset:
vizgen_mouse_brain_aging_raw.h5ad - Eval:
merfish_brain_log_zscore_gad2_mean - Type:
numeric_tolerance - Steps: normalize → log1p → z-score
Proposed Fix
Replace the mean z-score with a non-invariant summary of the z-scored values, e.g:
- Fraction of cells with
Gad2z-score > 1 - 95th percentile of
Gad2z-scores - Variance / IQR of
Gad2z-scores
Example:
After z-scoring, compute the fraction of cells with Gad2 z-score > 1.
Return EXACTLY: {"pct_cells_gad2_z_gt_1": <float>}